100+ datasets found

D
Data Collection And Labeling Report
datainsightsmarket.com
doc, pdf, ppt
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Collection And Labeling Report [Dataset]. https://www.datainsightsmarket.com/reports/data-collection-and-labeling-1945059
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Nov 17, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Explore the booming data collection and labeling market, driven by AI advancements. Discover key growth drivers, market trends, and forecasts for 2025-2033, essential for AI development across IT, automotive, and healthcare.
Sentiment Datasets for Online Learning Platforms
kaggle.com
zip
Updated Jul 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ARVIKRIZ (2025). Sentiment Datasets for Online Learning Platforms [Dataset]. https://www.kaggle.com/datasets/arvikriz/sentiment-datasets-for-online-learning-platforms
Explore at:
zip(583753 bytes)Available download formats
Dataset updated
Jul 28, 2025
Authors
ARVIKRIZ
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains synthetic review data collected from popular online learning platforms such as Coursera, Udemy, and RateMyProfessors. It is designed to support sentiment analysis research by providing structured review content labeled with sentiment classifications.

📌 Purpose The dataset aims to facilitate Natural Language Processing (NLP) tasks, especially in the context of educational feedback analysis, by enabling users to:

Train and evaluate sentiment classification models.

Analyze learner satisfaction across platforms.

Visualize sentiment trends in online education.

📂 Dataset Composition The dataset is synthetically generated and includes review texts with associated sentiment labels. It may include:

Review text: A learner's comment or review.

Sentiment label: Categories like positive, neutral, or negative.

Source indicator: Platform such as Coursera, Udemy, or RateMyProfessors.

🔍 Potential Applications Sentiment classification using machine learning (e.g., Logistic Regression, SVM, BERT, VADER).

Topic modeling to extract key concerns or highlights from reviews.

Dashboards for educational insights and user experience monitoring.

✅ Notes This dataset is synthetic and intended for academic and research purposes only.

No personally identifiable information (PII) is included.

Labeling is consistent with typical sentiment classification tasks.
w
Data Use in Academia Dataset
datacatalog.worldbank.org
csv, utf-8
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
Explore at:
utf-8, csvAvailable download formats
Dataset updated
Nov 27, 2023
Dataset provided by
Semantic Scholar Open Research Corpus (S2ORC)
Brian William Stacy
License
https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
Description
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.

Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.

We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.

Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.

The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.

To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.

The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.

The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:

Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.

The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.

A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.

The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.

The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
3A2M+ dataset structure.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jan 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nazmus Sakib; G. M. Shahariar; Md. Mohsinul Kabir; Md. Kamrul Hasan; Hasan Mahmud (2025). 3A2M+ dataset structure. [Dataset]. http://doi.org/10.1371/journal.pone.0317697.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317697.t005
Dataset updated
Jan 28, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Nazmus Sakib; G. M. Shahariar; Md. Mohsinul Kabir; Md. Kamrul Hasan; Hasan Mahmud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sharing cooking recipes is a great way to exchange culinary ideas and provide instructions for food preparation. However, categorizing raw recipes found online into appropriate food genres can be challenging due to a lack of adequate labeled data. In this study, we present a dataset named the “Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking Recipe Dataset” that contains two million culinary recipes labeled in respective categories with extended named entities extracted from recipe descriptions. This collection of data includes various features such as title, NER, directions, and extended NER, as well as nine different labels representing genres including bakery, drinks, non-veg, vegetables, fast food, cereals, meals, sides, and fusions. The proposed pipeline named 3A2M+ extends the size of the Named Entity Recognition (NER) list to address missing named entities like heat, time or process from the recipe directions using two NER extraction tools. 3A2M+ dataset provides a comprehensive solution to the various challenging recipe-related tasks, including classification, named entity recognition, and recipe generation. Furthermore, we have demonstrated traditional machine learning, deep learning and pre-trained language models to classify the recipes into their corresponding genre and achieved an overall accuracy of 98.6%. Our investigation indicates that the title feature played a more significant role in classifying the genre.
m
MAAD : Multi-Label Arabic Articles Dataset
data.mendeley.com
Updated Oct 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marwah Yahya Al-Nahari (2025). MAAD : Multi-Label Arabic Articles Dataset [Dataset]. http://doi.org/10.17632/hbfc9j8hj8.2
Explore at:
Unique identifier
https://doi.org/10.17632/hbfc9j8hj8.2
Dataset updated
Oct 27, 2025
Authors
Marwah Yahya Al-Nahari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MAAD dataset represents a comprehensive collection of Arabic news articles that may be employed across a diverse array of Arabic Natural Language Processing (NLP) applications, including but not limited to classification, text generation, summarization, and various other tasks. The dataset was diligently assembled through the application of specifically designed Python scripts that targeted six prominent news platforms: Al Jazeera, BBC Arabic, Youm7, Russia Today, and Al Ummah, in conjunction with regional and local media outlets, ultimately resulting in a total of 602,792 articles. This dataset exhibits a total word count of 29,371,439, with the number of unique words totaling 296,518; the average word length has been determined to be 6.36 characters, while the mean article length is calculated at 736.09 characters. This extensive dataset is categorized into ten distinct classifications: Political, Economic, Cultural, Arts, Sports, Health, Technology, Community, Incidents, and Local. The data fields are categorized into five distinct types: Title, Article, Summary, Category, and Published_ Date. The MAAD dataset is structured into six files, each named after the corresponding news outlets from which the data was sourced; within each directory, text files are provided, containing the number of categories represented in a single file, formatted in txt to accommodate all news articles. This dataset serves as an expansive standard resource designed for utilization within the context of our research endeavors.
Learning Privacy from Visual Entities - Curated data sets and pre-computed...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro (2025). Learning Privacy from Visual Entities - Curated data sets and pre-computed visual entities [Dataset]. http://doi.org/10.5281/zenodo.15348506
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15348506
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the curated image privacy datasets and pre-computed visual entities used in the publication Learning Privacy from Visual Entities by A. Xompero and A. Cavallaro.
[arxiv][code]

Curated image privacy data sets

In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.

Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.

Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.

List of datasets and their original source:

PicAlert [Images occupy 2.4 GB]

VISPR [Images occupy 49.7 GB]

PrivacyAlert [Images occupy 1 GB]

Notes:

For PicAlert and PrivacyAlert, only urls to the original locations in Flickr are available in the Zenodo record

Collector and authors of the PrivacyAlert dataset selected the images from Flickr under Public Domain license

Owners of the photos on Flick could have removed the photos from the social media platform

Running the bash scripts to download the images can incur in the "429 Too Many Requests" status code

Pre-computed visual entitities

Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).

For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.

Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.

Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)

Enquiries, questions and comments

If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.
R
AI in Semi-supervised Learning Market Research Report 2033
researchintelo.com
csv, pdf, pptx
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Intelo (2025). AI in Semi-supervised Learning Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-semi-supervised-learning-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Research Intelo
License
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
Time period covered
2024 - 2033
Area covered
Global
Description
AI in Semi-supervised Learning Market Outlook

According to our latest research, the AI in Semi-supervised Learning market size reached USD 1.82 billion in 2024 globally, driven by rapid advancements in artificial intelligence and machine learning applications across diverse industries. The market is expected to expand at a robust CAGR of 28.1% from 2025 to 2033, reaching a projected value of USD 17.17 billion by 2033. This exponential growth is primarily fueled by the increasing need for efficient data labeling, the proliferation of unstructured data, and the growing adoption of AI-driven solutions in both large enterprises and small and medium businesses. As per the latest research, the surging demand for automation, accuracy, and cost-efficiency in data processing is significantly accelerating the adoption of semi-supervised learning models worldwide.

One of the most significant growth factors for the AI in Semi-supervised Learning market is the explosive increase in data generation across industries such as healthcare, finance, retail, and automotive. Organizations are continually collecting vast amounts of structured and unstructured data, but the process of labeling this data for supervised learning remains time-consuming and expensive. Semi-supervised learning offers a compelling solution by leveraging small amounts of labeled data alongside large volumes of unlabeled data, thus reducing the dependency on extensive manual annotation. This approach not only accelerates the deployment of AI models but also enhances their accuracy and scalability, making it highly attractive for enterprises seeking to maximize the value of their data assets while minimizing operational costs.

Another critical driver propelling the growth of the AI in Semi-supervised Learning market is the increasing sophistication of AI algorithms and the integration of advanced technologies such as deep learning, natural language processing, and computer vision. These advancements have enabled semi-supervised learning models to achieve remarkable performance in complex tasks like image and speech recognition, medical diagnostics, and fraud detection. The ability to process and interpret vast datasets with minimal supervision is particularly valuable in sectors where labeled data is scarce or expensive to obtain. Furthermore, the ongoing investments in research and development by leading technology companies and academic institutions are fostering innovation, resulting in more robust and scalable semi-supervised learning frameworks that can be seamlessly integrated into enterprise workflows.

The proliferation of cloud computing and the increasing adoption of hybrid and multi-cloud environments are also contributing significantly to the expansion of the AI in Semi-supervised Learning market. Cloud-based deployment offers unparalleled scalability, flexibility, and cost-efficiency, allowing organizations of all sizes to access cutting-edge AI tools and infrastructure without the need for substantial upfront investments. This democratization of AI technology is empowering small and medium enterprises to leverage semi-supervised learning for competitive advantage, driving widespread adoption across regions and industries. Additionally, the emergence of AI-as-a-Service (AIaaS) platforms is further simplifying the integration and management of semi-supervised learning models, enabling businesses to accelerate their digital transformation initiatives and unlock new growth opportunities.

From a regional perspective, North America currently dominates the AI in Semi-supervised Learning market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of leading AI vendors, robust technological infrastructure, and high investments in AI research and development are key factors driving market growth in these regions. Asia Pacific is expected to witness the fastest CAGR during the forecast period, fueled by rapid digitalization, expanding IT infrastructure, and increasing government initiatives to promote AI adoption. Meanwhile, Latin America and the Middle East & Africa are also showing promising growth potential, supported by rising awareness of AI benefits and growing investments in digital transformation projects across various sectors.

Component Analysis

The component segment of the AI in Semi-supervised Learning market is divided into software, hardware, and services, each playing a pivotal role in the adoption and implementation of semi-s
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
ChatAlign-FeedbackSet
kaggle.com
zip
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Jafari (2025). ChatAlign-FeedbackSet [Dataset]. https://www.kaggle.com/datasets/alanjafari/chatalign-feedbackset
Explore at:
zip(141998521 bytes)Available download formats
Dataset updated
May 14, 2025
Authors
Alan Jafari
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
🧠 Dataset Title: Human-AI Preference Alignment (Inspired by Anthropic’s HH-RLHF) 📘 Overview: This dataset presents a curated collection of human-AI interaction samples designed to support cutting-edge research in Reinforcement Learning from Human Feedback (RLHF), ethical AI development, and model alignment. It follows the structure and spirit of the original hh-rlhf, making it a high-impact resource for fine-tuning and evaluating Large Language Models (LLMs).

Whether you're working on alignment, instruction-following behavior, safety, or human preference modeling, this dataset provides a strong foundation for experimentation and development.

🧩 What’s Inside? ✅ Thousands of preference-labeled response pairs, where annotators select the more aligned AI reply

✅ Multi-turn conversations between human prompts and assistant completions

✅ Designed for reward model training, RLHF pipelines, and supervised fine-tuning

✅ Structured in a way that supports both transformer-based and reinforcement learning models

✅ Covers a wide range of topics, from factual QA to ethical dilemmas and role-play

🎯 Use Cases: 🔹 Train reward models for instruction-following AI (e.g., InstructGPT, Claude, ChatGPT-style agents)

🔹 Evaluate LLM alignment with human values like helpfulness, harmlessness, and honesty (HHH)

🔹 Fine-tune open-source models (e.g., LLaMA, Mistral, Falcon, Gemma) using RLHF pipelines

🔹 Build preference-based datasets for safe and interpretable AI systems

🔹 Use in comparative learning tasks, conversational modeling, or safety benchmarking

🌍 Why This Dataset Matters: As AI systems become more capable, aligning their behavior with human ethical preferences becomes critically important. Human feedback is at the core of building AI that can reason, act safely, and respond meaningfully. This dataset contributes to that mission by offering high-quality, human-labeled data that reflects real-world human expectations in AI responses.

By enabling fine-tuning of models with reinforcement learning from actual human judgments, this dataset brings us one step closer to building trustworthy AI.

🧪 Inspirations & References: Anthropic’s HH-RLHF

OpenAI’s InstructGPT

Constitutional AI & Ethical Alignment techniques

Reward Modeling in Reinforcement Learning

📌 Tags / Keywords: AI Alignment • RLHF • Large Language Models • Reward Modeling • Preference Comparison • Ethical AI • Human Feedback • Open-source Fine-tuning

💬 Citation & Credit: If you use this dataset in your research, demos, or fine-tuning workflows, please cite the original HH-RLHF dataset and acknowledge this Kaggle version as an adapted resource for open-access experimentation.
r
Neural sequential transfer learning for relation extraction
resodate.org
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Benedikt Alt (2021). Neural sequential transfer learning for relation extraction [Dataset]. http://doi.org/10.14279/depositonce-11154
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-11154
Dataset updated
Jan 20, 2021
Dataset provided by
DepositOnce
Technische Universität Berlin
Authors
Christoph Benedikt Alt
Description
Relation extraction (RE) is concerned with developing methods and models that automatically detect and retrieve relational information from unstructured data. It is crucial to information extraction (IE) applications that aim to leverage the vast amount of knowledge contained in unstructured natural language text, for example, in web pages, online news, and social media; and simultaneously require the powerful and clean semantics of structured databases instead of searching, querying, and analyzing unstructured text directly. In practical applications, however, relation extraction is often characterized by limited availability of labeled data, due to the cost of annotation or scarcity of domain-specific resources. In such scenarios it is difficult to create models that perform well on the task. It therefore is desired to develop methods that learn more efficiently from limited labeled data and also exhibit better overall relation extraction performance, especially in domains with complex relational structure. In this thesis, I propose to use transfer learning to address this problem, i.e., to reuse knowledge from related tasks to improve models, in particular, their performance and efficiency to learn from limited labeled data. I show how sequential transfer learning, specifically unsupervised language model pre-training, can improve performance and sample efficiency in supervised and distantly supervised relation extraction. In the light of improved modeling abilities, I observe that better understanding neural network-based relation extraction methods is crucial to gain insights that further improve their performance. I therefore present an approach to uncover the linguistic features of the input that neural RE models encode and use for relation prediction. I further complement this with a semi-automated analysis approach focused on model errors, datasets, and annotations. It effectively highlights controversial examples in the data for manual evaluation and allows to specify error hypotheses that can be verified automatically. Together, the researched approaches allow us to build better performing, more sample efficient relation extraction models, and advance our understanding despite their complexity. Further, it facilitates more comprehensive analyses of model errors and datasets in the future.
The comparison of the median of the binary classification measurement...
plos.figshare.com
bin
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert (2023). The comparison of the median of the binary classification measurement results on the synthetic data. [Dataset]. http://doi.org/10.1371/journal.pone.0274569.t004
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274569.t004
Dataset updated
Jun 13, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The comparison of the median of the binary classification measurement results on the synthetic data.
Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and...
zenodo.org
data.niaid.nih.gov
bin
Updated Aug 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jannik Stebani; Martin Blaimer; Tilman Neun; Daniël M. Pelt; Kristen Rak; Jannik Stebani; Martin Blaimer; Tilman Neun; Daniël M. Pelt; Kristen Rak (2023). Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks [Dataset]. http://doi.org/10.5281/zenodo.8277159
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8277159
Dataset updated
Aug 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jannik Stebani; Martin Blaimer; Tilman Neun; Daniël M. Pelt; Kristen Rak; Jannik Stebani; Martin Blaimer; Tilman Neun; Daniël M. Pelt; Kristen Rak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided dataset comprises 43 instances of temporal bone volume CT scans. The scans were performed on human cadaveric specimen with a resulting isotropic voxel size of $99 \times 99 \times 99 \, \, \mathrm{\mu m}^3$. Voxel-wise image labels of the fluid space of the bony labyrinth, subdivided in the three semantic classes cochlear volume, vestibular volume and semicircular canal volume are provided. In addition, each dataset contains JSON-like descriptor data defining the voxel coordinates of the anatomical landmarks: (1) apex of the cochlea, (2) oval window and (3) round window. The dataset can be used to train and evaluate algorithmic machine learning models for automated innear ear analysis in the context of the supervised learning paradigm.

Usage Notes

The datasets are formatted in the HDF5 format developed by the HDF5 Group. We utilized and thus recommend the usage of Python bindings pyHDF to handle the datasets.

The flat-panel volume CT raw data, labels and landmarks are saved in the HDF5-internal file structure using the respective group and datasets:

raw/raw-0 label/label-0 landmark/landmark-0 landmark/landmark-1 landmark/landmark-2

Array raw and label data can be read from the file by indexing into an opened h5py file handle, for example as numpy.ndarray. Further metadata is contained in the attribute dictionaries of the raw and label datasets.

Landmark coordinate data is available as an attribute dict and contains the coordinate system (LPS or RAS), IJK voxel coordinates and label information. The helicotrema or cochlea top is globally saved in landmark 0, the oval window in landmark 1 and the round window in landmark 2. Read as a Python dictionary, exemplary landmark information for a dataset may reads as follows:

{'coordsys': 'LPS', 'id': 1, 'ijk_position': array([181, 188, 100]), 'label': 'CochleaTop', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 44.21109689, -139.38058589, -183.48249736])}

{'coordsys': 'LPS', 'id': 2, 'ijk_position': array([222, 182, 145]), 'label': 'OvalWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.27890112, -139.95991131, -179.04103763])}

{'coordsys': 'LPS', 'id': 3, 'ijk_position': array([223, 209, 147]), 'label': 'RoundWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.33120126, -137.27135678, -178.8665465 ])}
Details of methods implementation.
plos.figshare.com
xls
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malek Senoussi; Thierry Artieres; Paul Villoutreix (2024). Details of methods implementation. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012006.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1012006.t001
Dataset updated
Apr 17, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Malek Senoussi; Thierry Artieres; Paul Villoutreix
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Single-cell RNA sequencing (scRNASeq) data plays a major role in advancing our understanding of developmental biology. An important current question is how to classify transcriptomic profiles obtained from scRNASeq experiments into the various cell types and identify the lineage relationship for individual cells. Because of the fast accumulation of datasets and the high dimensionality of the data, it has become challenging to explore and annotate single-cell transcriptomic profiles by hand. To overcome this challenge, automated classification methods are needed. Classical approaches rely on supervised training datasets. However, due to the difficulty of obtaining data annotated at single-cell resolution, we propose instead to take advantage of partial annotations. The partial label learning framework assumes that we can obtain a set of candidate labels containing the correct one for each data point, a simpler setting than requiring a fully supervised training dataset. We study and extend when needed state-of-the-art multi-class classification methods, such as SVM, kNN, prototype-based, logistic regression and ensemble methods, to the partial label learning framework. Moreover, we study the effect of incorporating the structure of the label set into the methods. We focus particularly on the hierarchical structure of the labels, as commonly observed in developmental processes. We show, on simulated and real datasets, that these extensions enable to learn from partially labeled data, and perform predictions with high accuracy, particularly with a nonlinear prototype-based method. We demonstrate that the performances of our methods trained with partially annotated data reach the same performance as fully supervised data. Finally, we study the level of uncertainty present in the partially annotated data, and derive some prescriptive results on the effect of this uncertainty on the accuracy of the partial label learning methods. Overall our findings show how hierarchical and non-hierarchical partial label learning strategies can help solve the problem of automated classification of single-cell transcriptomic profiles, interestingly these methods rely on a much less stringent type of annotated datasets compared to fully supervised learning methods.
G
Telecom Data Labeling Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Telecom Data Labeling Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/telecom-data-labeling-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Telecom Data Labeling Market Outlook

According to our latest research, the global Telecom Data Labeling market size reached USD 1.42 billion in 2024, driven by the exponential growth in data generation, increasing adoption of AI and machine learning in telecom operations, and the rising complexity of communication networks. The market is forecasted to expand at a robust CAGR of 22.8% from 2025 to 2033, reaching an estimated USD 10.09 billion by 2033. This strong momentum is underpinned by the escalating demand for high-quality labeled datasets to power advanced analytics and automation in the telecom sector.

The growth trajectory of the Telecom Data Labeling market is fundamentally propelled by the surging data volumes generated by telecom networks worldwide. With the proliferation of 5G, IoT devices, and cloud-based services, telecom operators are inundated with massive streams of structured and unstructured data. Efficient data labeling is essential to transform raw data into actionable insights, fueling AI-driven solutions for network optimization, predictive maintenance, and fraud detection. Additionally, the mounting pressure on telecom companies to enhance customer experience and operational efficiency is prompting significant investments in data labeling infrastructure and services, further accelerating market expansion.

Another critical growth factor is the rapid evolution of artificial intelligence and machine learning applications within the telecommunications industry. AI-powered tools depend on vast quantities of accurately labeled data to deliver reliable predictions and automation. As telecom companies strive to automate network management, detect anomalies, and personalize user experiences, the demand for high-quality labeled datasets has surged. The emergence of advanced labeling techniques, including semi-automated and automated labeling methods, is enabling telecom enterprises to keep pace with the growing data complexity and volume, thus fostering faster and more scalable AI deployments.

Furthermore, regulatory compliance and data privacy concerns are shaping the landscape of the Telecom Data Labeling market. As governments worldwide tighten data protection regulations, telecom operators are compelled to ensure that data used for AI and analytics is accurately labeled and anonymized. This necessity is driving the adoption of robust data labeling solutions that not only facilitate compliance but also enhance data quality and integrity. The integration of secure, privacy-centric labeling platforms is becoming a competitive differentiator, especially in regions with stringent data governance frameworks. This trend is expected to persist, reinforcing the marketÂ’s upward trajectory.

AI-Powered Product Labeling is revolutionizing the telecom industry by providing more efficient and accurate data annotation processes. This technology leverages artificial intelligence to automate the labeling of large datasets, reducing the time and costs associated with manual labeling. By utilizing AI algorithms, telecom operators can ensure that their data is consistently labeled with high precision, which is crucial for training machine learning models. This advancement not only enhances the quality of labeled data but also accelerates the deployment of AI-driven solutions across various applications, such as network optimization and customer experience management. As AI-Powered Product Labeling continues to evolve, it is expected to play a pivotal role in the telecom sector's digital transformation journey, enabling operators to harness the full potential of their data assets.

From a regional perspective, Asia Pacific is emerging as a powerhouse in the Telecom Data Labeling market, fueled by rapid digitalization, expanding telecom infrastructure, and the early adoption of 5G technologies. North America remains a significant contributor, owing to its mature telecom ecosystem and high investments in AI research and development. Europe is also witnessing steady growth, driven by regulatory mandates and increasing focus on data-driven network management. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with investments in digital transformation and telecom modernization initiatives providing new growth avenues. These regional dynamics collectively underscore the global nature
m
Human Faces and Objects Mix Image Dataset
data.mendeley.com
Updated Mar 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bindu Garg (2025). Human Faces and Objects Mix Image Dataset [Dataset]. http://doi.org/10.17632/nzwvnrmwp3.1
Explore at:
Unique identifier
https://doi.org/10.17632/nzwvnrmwp3.1
Dataset updated
Mar 13, 2025
Authors
Bindu Garg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description: Human Faces and Objects Dataset (HFO-5000) The Human Faces and Objects Dataset (HFO-5000) is a curated collection of 5,000 images, categorized into three distinct classes: male faces (1,500), female faces (1,500), and objects (2,000). This dataset is designed for machine learning and computer vision applications, including image classification, face detection, and object recognition. The dataset provides high-quality, labeled images with a structured CSV file for seamless integration into deep learning pipelines.

Column Description: The dataset is accompanied by a CSV file that contains essential metadata for each image. The CSV file includes the following columns: file_name: The name of the image file (e.g., image_001.jpg). label: The category of the image, with three possible values: "male" (for male face images) "female" (for female face images) "object" (for images of various objects) file_path: The full or relative path to the image file within the dataset directory.

Uniqueness and Key Features: 1) Balanced Distribution: The dataset maintains an even distribution of human faces (male and female) to minimize bias in classification tasks. 2) Diverse Object Selection: The object category consists of a wide variety of items, ensuring robustness in distinguishing between human and non-human entities. 3) High-Quality Images: The dataset consists of clear and well-defined images, suitable for both training and testing AI models. 4) Structured Annotations: The CSV file simplifies dataset management and integration into machine learning workflows. 5) Potential Use Cases: This dataset can be used for tasks such as gender classification, facial recognition benchmarking, human-object differentiation, and transfer learning applications.

Conclusion: The HFO-5000 dataset provides a well-structured, diverse, and high-quality set of labeled images that can be used for various computer vision tasks. Its balanced distribution of human faces and objects ensures fairness in training AI models, making it a valuable resource for researchers and developers. By offering structured metadata and a wide range of images, this dataset facilitates advancements in deep learning applications related to facial recognition and object classification.
Z
SCoRe-LFC: Platform data on crowd collaboration in higher education
data.niaid.nih.gov
Updated Feb 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allert, Heidrun; Bussian, Christine; Raffel, Lars-Arne; Reichelt, Norma; Richter, Christoph (2022). SCoRe-LFC: Platform data on crowd collaboration in higher education [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6109105
Explore at:
Dataset updated
Feb 17, 2022
Dataset provided by
Christian-Albrechts-Universität zu Kiel
Authors
Allert, Heidrun; Bussian, Christine; Raffel, Lars-Arne; Reichelt, Norma; Richter, Christoph
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SCoRe (short for Student Crowd Research) was a joint research project between the Universities of Bremen (UB), Hamburg (UHH) and Kiel (CAU), the Macromedia University of Applied Sciences (HMM) and the Ghostthinker GmbH (GT). The overall aim of the project was to develop a digital learning and research environment as well as didactic scenarios that foster collaborative processes of research-based learning in large groups of students (crowd). The main subject area was research for sustainable development. Towards this end, the project consortium drew on the partners’ expertise on advanced video-technologies (HHM), virtual collaboration in interdisciplinary and largescale groups (CAU), research-based learning (UHH) and education and research for sustainable development (UB). To achieve its goals, the project adopted a design-based research approach. The Project started in Oct. 2018 and was funded for 3.5 years by the Federal Ministry of Education and Research (BMBF) in a funding scheme on digital higher education.

The work in the department of media-pedagogy and educational computer sciences at Kiel University was focused on the sub-project „SCoRe - learning and researching in the crowd“. The sub-project was aimed at the development, implementation and evaluation of pedagogical and organizational measures for the seeding, coordination and orchestration of collaborative research and learning processes in crowd scenarios. Particular emphasis was placed on crowd-specific characteristics of productive knowledge work in large and interdisciplinary groups.

This dataset contains interaction data as well as textual content data. As ongoing development of the software platform led to a continuous integration of new features into the platform itself as well as changes to the data collection functions, making this an evolving dataset. Some inconsistencies exist due to software bugs.

Platform data structure diagram

Further Readings to gain an understanding of the platform and its interaction posbilities (in german):

Design Report Prototype 2

Design Report Prototype 3

Contained Files

Filename

Description

annotations.csv

Annotations (comment and/or drawings on the video) of video files

content.csv

Content of sections

events.csv

All events triggered by user interaction

media.csv

Uploaded images and videos

messages.csv

Chat messages

sequences.csv

Sequences of video files

Columns

(not all are present in each file. 0, “null” or “none” might mean not applicable)

Column name

Description

Format

Index (empty column name)

unique identifier of the corresponding event in the original dataset

UUID (int on rare occasions)

Actor-Name

Unique identifier of an actor – “MA” identifies project staff

string

Annotation-ID

Unique identifier of an annotation

int

Annotation-Text

Label of an annotation

string

Version-ID

Unique identifier of a version of an auditable object (e.g. a section)

int

Version-Changelog

Changelog message on saving a new version of a section

string

Case-ID

Unique identifier of a case (if applicable, coded by research team)

String

Media-Caption

Title of a media file (image, video)

string

Media-ID

Unique identifier of a media file (image, video)

int

Media-Timestamp

Timestamp in a video

int

Message-ID

Unique identifier of a chat message

int

Message-Text

Content of a chat-message

string

Object-Type

Type of an object an action refers to

string

Project-ID

Unique identifier of a project

int

Research-Task-Type

Type of research task (if applicable, coded by research team, see table below)

string

Section-Content

Content of a section (in a specific version)

string

Section-Outline-Level

Outline level of a section (in a specific version)

int

Section-ID

Unique identifier of a section

int

Section-Index

Position of a section in the project (in a specific version)

int

Section-Status

Status of a section

int

Section-Title

Title of a section (in a specific version)

string

Sequence-Description

Description of a video sequence

string

Sequence-Duration

Length of a video sequence

int

Sequence-ID

Unique identifier of a sequence

int

Sequence-Timestamp

Timestamp of the start of a sequence in a video

int

timestamp

timestamp of an event

datetime

Verb

Action type of an event (see table below)

string

Verbs

Value

Description

canceled editing of

Actor canceled editing of a section

clicked

Actor clicked a link

collapsed

Actor collapsed a section (hides its content form being viewed)

compared versions of

Actor compared two versions of a section

created

Actor created a new section, video sequence, video annotation, video playback command, project or news

deleted

Actor deleted a section, video sequence, video annotation, video playback command, project or news

ended

Actor played a video hitting its end

expanded

Actor expanded a collapsed section

inserted

Actor inserted a video comment (on occasions instead of created)

left

Actor left a context (e.g. a project, a chat window) by e.g. closing it using platform functions, changing a browser tab, etc.

mentioned

Actor mentioned another actor in a chat message

opened

Actor opened a context (e.g. a project, a chat window) by e.g. accessing it using platform functions or changing a browser tab

paused

Actor paused a video

played

Actor played a video

read

Actor read an activity message or news

read all messages and activities of

Actor used switch to mark all chat and activity messages read

restored

Actor restored a deleted section

reverted

Actor restored a deleted section

reverted version of

Actor reverted a section to an earlier version

seeked

Actor seeked on a video timeline

sent

Actor sent a chat message

started editing of

Actor started editing of a section

switched

Actor switched chat focus between project and section chat

typed

Actor typed into the chat

updated

Actor updated an existing section (changing content, heading, heading-depth or status), video sequence, video annotation, video playback command, project or news

uploaded

Actor uploaded an image or video

viewed

Actor viewed an entity (had it on screen for 5 seconds), e.g. a section or video comment

viewed history of

Actor viewed history of a section

Project-ID

Project-ID

Case-IDs

Title

Type

Period

2

a1-a*

Urbane Grünflächen

Research project

1.11.20-31.3.21

4

b1-b*

Nachhaltiger Verkehr

Research project

1.11.20-31.3.21

166

c1-c*

UGF - Urbane Grünflächen

Research project

1.4.21-30.09.21

168

LGS - Foyer

Onboarding of students in LGS Projects

1.4.21-30.09.21

188

LGS - Reflexionsraum

Reflection project for students in LGS Projects

1.4.21-30.09.21

207

LGS - Nachhaltiger Konsum

Research project

1.4.21-30.09.21

210

LGS - Bildungsangebote für nachhaltige Entwicklung

Research project

1.4.21-30.09.21

264

LGS - Fahrradmobilität in Städten

Research project

1.4.21-30.09.21

271

Fahrradmobilität in Städten

Research project

1.10.21-30.11.21

269

e1-e*

Kaufentscheidung vs. Nachhaltigkeit

Research
H
Replication Data for: Automatic Collective Behaviour Recognition
dataverse.harvard.edu
dataone.org
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shadi Abpeikar (2022). Replication Data for: Automatic Collective Behaviour Recognition [Dataset]. http://doi.org/10.7910/DVN/S1YJOX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/S1YJOX
Dataset updated
Nov 14, 2022
Dataset provided by
Harvard Dataverse
Authors
Shadi Abpeikar
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Collective behaviour such as the flocks of birds and schools of fish is inspired by computer-based systems and is widely used in agents’ formation. The human could easily recognise these behaviours; however, it is hard for a computer system to recognise these behaviours. Since humans could easily recognise these behaviours, ground truth data on human perception of collective behaviour could enable machine learning methods to mimic this human perception. Hence ground truth data has been collected from human perception of collective behaviour recognition by running an online survey. Specific collective motions considered in this online survey include 16 structured and unstructured behaviours. The defined structured collective motions include boids’ movements with an identifiable embedded pattern. Unstructured collective motions consist of random movement of boids with no patterns. The participants are from diverse levels of knowledge, all over the world, and are over 18 years old. Each question contains a short video (around 10 seconds), captured from one of the 16 simulated movements. The videos are shown in a randomized order to the participants. Then they were asked to label each structured motion of boids as ‘flocking’, ‘aligned’, or ‘grouped’ and others as ‘not flocking’, ‘not aligned’, or ‘not grouped’. By averaging human perceptions, three binary labelled datasets of these motions are created. The data could be trained by machine learning methods, which enabled them to automatically recognise collective behaviour.
The algorithms ranking.
plos.figshare.com
bin
Updated Jun 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert (2023). The algorithms ranking. [Dataset]. http://doi.org/10.1371/journal.pone.0274569.t006
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274569.t006
Dataset updated
Jun 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The algorithms ranking.
u
3D Microvascular Image Data and Labels for Machine Learning
rdr.ucl.ac.uk
datasetcatalog.nlm.nih.gov
bin
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natalie Holroyd; Claire Walsh; Emmeline Brown; Emma Brown; Yuxin Zhang; Carles Bosch Pinol; Simon Walker-Samuel (2024). 3D Microvascular Image Data and Labels for Machine Learning [Dataset]. http://doi.org/10.5522/04/25715604.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5522/04/25715604.v1
Dataset updated
Apr 30, 2024
Dataset provided by
University College London
Authors
Natalie Holroyd; Claire Walsh; Emmeline Brown; Emma Brown; Yuxin Zhang; Carles Bosch Pinol; Simon Walker-Samuel
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality]_[species Organ]_[resolution].tif Labels - [Modality]_[species Organ]_[resolution]_labels.tif Sub-volumes of larger dataset - [Modality]_[species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background (Brown et al., 2019). OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature (Walsh et al., 2021). The image data has been processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house (Walsh et al., 2021). The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute (Bosch et al., 2022). NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19  
Data from: DeepLabCut: markerless pose estimation of user-defined body parts...
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Mathis; Alexander Mathis; Pranav Mamidanna; Pranav Mamidanna; Kevin M. Cury; Taiga Abe; Venkatesh N. Murthy; Venkatesh N. Murthy; Mackenzie Weygandt Mathis; Mackenzie Weygandt Mathis; Matthias Bethge; Matthias Bethge; Kevin M. Cury; Taiga Abe (2023). DeepLabCut: markerless pose estimation of user-defined body parts with deep learning [Dataset]. http://doi.org/10.5281/zenodo.4008504
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4008504
Dataset updated
Oct 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander Mathis; Alexander Mathis; Pranav Mamidanna; Pranav Mamidanna; Kevin M. Cury; Taiga Abe; Venkatesh N. Murthy; Venkatesh N. Murthy; Mackenzie Weygandt Mathis; Mackenzie Weygandt Mathis; Matthias Bethge; Matthias Bethge; Kevin M. Cury; Taiga Abe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data entry contains annotated mouse data from the DeepLabCut Nature Neuroscience paper.

This data entry contains a public release of annotated mouse data from the DeepLabCut paper. The trail-tracking behavior is part of an investigation into odor guided navigation, where one or multiple wildtype (C57BL/6J) mice are running on a paper spool and following odor trails. These experiments were carried out by Alexander Mathis & Mackenzie Mathis in the Murthy lab at Harvard University.

Data was recorded by two different cameras (640×480 pixels with Point Grey Firefly (FMVU-03MTM-CS), and at approximately 1,700×1,200 pixels with Grasshopper 3 4.1MP Mono USB3 Vision (CMOSIS CMV4000-3E12)) at 30 Hz. The latter images were cropped around mice to generate images that are approximately 800×800.

Here we share 1066, frames from multiple experimental sessions observing 7 different mice. Pranav Mamidanna labeled the snout, the tip of the left and right ear as well as the base of the tail in the example images. The data is organized in DeepLabCut 2.0 project structure with images and annotations in the labeled-data folder. The names are pseudocodes indicating mouse id and session id, e.g. m4s1 = mouse 4 session 1.

Code for loading, visualizing & training deep neural networks available at https://github.com/DeepLabCut/DeepLabCut.

Facebook

Twitter

Click to copy link

Link copied

Cite

Data Insights Market (2025). Data Collection And Labeling Report [Dataset]. https://www.datainsightsmarket.com/reports/data-collection-and-labeling-1945059

Data Collection And Labeling Report

Explore at:

ppt, doc, pdfAvailable download formats

Dataset updated

Nov 17, 2025

Dataset authored and provided by

Data Insights Market

License

https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

Time period covered

2025 - 2033

Area covered

Global

Variables measured

Market Size

Description

Explore the booming data collection and labeling market, driven by AI advancements. Discover key growth drivers, market trends, and forecasts for 2025-2033, essential for AI development across IT, automotive, and healthcare.

Clear search

Close search

Google apps

Main menu

Data Collection And Labeling Report

Sentiment Datasets for Online Learning Platforms

Data Use in Academia Dataset

3A2M+ dataset structure.

MAAD : Multi-Label Arabic Articles Dataset

Learning Privacy from Visual Entities - Curated data sets and pre-computed...

Curated image privacy data sets

Pre-computed visual entitities

Enquiries, questions and comments

AI in Semi-supervised Learning Market Research Report 2033

AI in Semi-supervised Learning Market Outlook

Component Analysis

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

ChatAlign-FeedbackSet

Neural sequential transfer learning for relation extraction

The comparison of the median of the binary classification measurement...

Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and...

Details of methods implementation.

Telecom Data Labeling Market Research Report 2033

Telecom Data Labeling Market Outlook

Human Faces and Objects Mix Image Dataset

SCoRe-LFC: Platform data on crowd collaboration in higher education

Replication Data for: Automatic Collective Behaviour Recognition

The algorithms ranking.

3D Microvascular Image Data and Labels for Machine Learning

Data from: DeepLabCut: markerless pose estimation of user-defined body parts...

Data Collection And Labeling Report