100+ datasets found

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...
data.nist.gov
cloud.csiss.gmu.edu
+1more
Updated Oct 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian DeCost (2020). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. http://doi.org/10.18434/mds2-2301
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2301, https://identifiers.org/ark:/88434/mds2-2301
Dataset updated
Oct 23, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Brian DeCost
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations. Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
Z
MSL Curiosity Rover Images with Science and Engineering Classes
data.niaid.nih.gov
zenodo.org
Updated Sep 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steven Lu (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3892023
Explore at:
Dataset updated
Sep 17, 2020
Dataset provided by
Steven Lu
Kiri L. Wagstaff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

Data Set Description

The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

Directory Contents

images - contains all 6,820 images

class_map.csv - string-integer class mappings

train-set-v2.1.txt - label file for the training set

val-set-v2.1.txt - label file for the validation set

test-set-v2.1.txt - label file for the test set

The label files are formatted as below:

"Image-file-name class_in_integer_representation"

Labeling Process

Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

If all three labels agree with each other, then use the label as the final label.

If the three labels do not agree with each other, then we manually review the labels and decide the final label.

We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

Classes

There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

Class name, counts (training set), counts (validation set), counts (test set), integer representation

Arm cover, 10, 1, 4, 0

Other rover part, 190, 11, 10, 1

Artifact, 680, 62, 132, 2

Nearby surface, 1554, 74, 187, 3

Close-up rock, 1422, 50, 84, 4

DRT, 8, 4, 6, 5

DRT spot, 214, 1, 7, 6

Distant landscape, 342, 14, 34, 7

Drill hole, 252, 5, 12, 8

Night sky, 40, 3, 4, 9

Float, 190, 5, 1, 10

Layers, 182, 21, 17, 11

Light-toned veins, 42, 4, 27, 12

Mastcam cal target, 122, 12, 29, 13

Sand, 228, 19, 16, 14

Sun, 182, 5, 19, 15

Wheel, 212, 5, 5, 16

Wheel joint, 62, 1, 5, 17

Wheel tracks, 26, 3, 1, 18

Image Augmentation

Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

90 degrees clockwise rotation (file name ends with -r90.jpg)

180 degrees clockwise rotation (file name ends with -r180.jpg)

270 degrees clockwise rotation (file name ends with -r270.jpg)

Horizontal flip (file name ends with -fh.jpg)

Vertical flip (file name ends with -fv.jpg)

Acknowledgment

The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.
Data from: Towards Automatic Labeling of Exception Handling Bugs: A Case...
figshare.com
zip
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renan Vieira (2024). Towards Automatic Labeling of Exception Handling Bugs: A Case Study of 10 Years Bug-Fixing in Apache Hadoop [Dataset]. http://doi.org/10.6084/m9.figshare.22735124.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22735124.v2
Dataset updated
Apr 29, 2024
Dataset provided by
figshare
Authors
Renan Vieira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context: Exception handling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHMs) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and AI-based systems), in which the software's sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs --- since it may require an encompassing knowledge of the software's EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions.Objective: First, we aim to evaluate the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim to provide a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community's awareness regarding the importance of EH bugs.Method: We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ~20% (943) as EH bugs. We also labeled 2,584 non-EH bugs analyzing their bug-fixing code and creating a dataset composed of 7,100 bug reports. Then, we used word embedding techniques (Bag-of-Words and TF-IDF) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit five classes of ML methods and evaluate them on unseen data. We also evaluated a pre-trained transformer-based model using the complete textual fields. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance.Results: Our results show that using a pre-trained DistilBERT with a linear layer trained with our proposed dataset can reasonably label EH bugs, achieving ROC-AUC scores of up to 0.88. The combination of NLP and ML traditional techniques achieved ROC-AUC scores of up to 0.74 and recall up to 0.56. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. Considering ROC-AUC as the primary concern, for the majority of ML methods tested, the analysis suggests that keywords alone are not sufficient to characterize reports of EH bugs, although this can change based on other metrics (such as recall and precision) or ML methods (e.g., Random Forest).Conclusions: To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the use of ML techniques, specially transformer-base models, sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.
d
GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grennan, Mark; Schibel, Martin; Collins, Andrew; Beel, Joeran (2023). GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [Data] [Dataset]. https://search.dataone.org/view/sha256%3Ae3c20d740cae52537cc7ac98400a31dbc6040267a40bd4df6207d07a77430220
Explore at:
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Grennan, Mark; Schibel, Martin; Collins, Andrew; Beel, Joeran
Description
Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.
D
Data Labeling Solution and Services Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMA Research & Media LLP (2025). Data Labeling Solution and Services Report [Dataset]. https://www.archivemarketresearch.com/reports/data-labeling-solution-and-services-52811
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 7, 2025
Dataset provided by
AMA Research & Media LLP
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Labeling Solutions and Services market is experiencing robust growth, driven by the escalating demand for high-quality training data in the artificial intelligence (AI) and machine learning (ML) sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching approximately $75 billion by 2033. This expansion is fueled by several key factors. Firstly, the increasing adoption of AI across diverse industries, including automotive, healthcare, and finance, necessitates vast amounts of accurately labeled data for model training and improvement. Secondly, advancements in deep learning algorithms and the emergence of sophisticated data annotation tools are streamlining the labeling process, boosting efficiency and reducing costs. Finally, the growing availability of diverse data sources, coupled with the rise of specialized data labeling companies, is further contributing to market growth. Despite these positive trends, the market faces some challenges. The high cost associated with data annotation, particularly for complex datasets requiring specialized expertise, can be a barrier for smaller businesses. Ensuring data quality and consistency across large-scale projects remains a critical concern, necessitating robust quality control measures. Furthermore, addressing data privacy and security issues is essential to maintain ethical standards and build trust within the market. The market segmentation by type (text, image/video, audio) and application (automotive, government, healthcare, financial services, etc.) presents significant opportunities for specialized service providers catering to niche needs. Competition is expected to intensify as new players enter the market, focusing on innovative solutions and specialized services.
Code for Predicting MIEs from Gene Expression and Chemical Target Labels...
catalog.data.gov
datasets.ai
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear
Explore at:
Dataset updated
Apr 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).
Data from: Not so weak-PICO: Leveraging weak supervision for Participants,...
zenodo.org
search.dataone.org
+2more
bin, tsv, zip
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anjani Dhrangadhariya; Anjani Dhrangadhariya; Henning Müller; Henning Müller (2022). Not so weak-PICO: Leveraging weak supervision for Participants, Interventions, and Outcomes recognition for systematic review automation [Dataset]. http://doi.org/10.5061/dryad.ncjsxkszr
Explore at:
bin, zip, tsvAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ncjsxkszr
Dataset updated
Dec 14, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anjani Dhrangadhariya; Anjani Dhrangadhariya; Henning Müller; Henning Müller
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Objective: PICO (Participants, Interventions, Comparators, Outcomes) analysis is vital but time-consuming for conducting systematic reviews (SRs). Supervised machine learning can help fully automate it, but a lack of large annotated corpora limits the quality of automated PICO recognition systems. The largest currently available PICO corpus is manually annotated, which is an approach that is often too expensive for the scientific community to apply. Depending on the specific SR question, PICO criteria are extended to PICOC (C-Context), PICOT (T-timeframe), and PIBOSO (B-Background, S-Study design, O-Other) meaning the static hand-labelled corpora need to undergo costly re-annotation as per the downstream requirements. We aim to test the feasibility of designing a weak supervision system to extract these entities without hand-labelled data.

Methodology: We decompose PICO spans into its constituent entities and re-purpose multiple medical and non-medical ontologies and expert-generated rules to obtain multiple noisy labels for these entities. These labels obtained using several sources are then aggregated using simple majority voting and generative modelling approaches. The resulting programmatic labels are used as weak signals to train a weakly-supervised discriminative model and observe performance changes. We explore mistakes in the currently available PICO corpus that could have led to inaccurate evaluation of several automation methods.

Results: We present Weak-PICO, a weakly-supervised PICO entity recognition approach using medical and non-medical ontologies, dictionaries and expert-generated rules. Our approach does not use hand-labelled data.

Conclusion: Weak supervision using weak-PICO for PICO entity recognition has encouraging results, and the approach can potentially extend to more clinical entities readily.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
D
Replication Data for: Terrain-Informed Self-Supervised Learning: Enhancing...
dataverse.no
dataverse.azure.uit.no
+1more
txt, zip
Updated Feb 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anuja Vats; Anuja Vats (2025). Replication Data for: Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations [Dataset]. http://doi.org/10.18710/HSMJLL
Explore at:
zip(25842728960), zip(27433246720), zip(10174319212), zip(27219730824), txt(5048), zip(22036067961)Available download formats
Unique identifier
https://doi.org/10.18710/HSMJLL
Dataset updated
Feb 20, 2025
Dataset provided by
DataverseNO
Authors
Anuja Vats; Anuja Vats
License
https://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HSMJLLhttps://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HSMJLL
Area covered
Norway
Dataset funded by
The Research Council of Norway
Description
The dataset comprises the pretraining and testing data for our work: Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations. The pretaining data consists of images corresponding to the Digital Surface Models (DSM) and Digital Terrain Models (DTM) obtained from Norway, with a ground resolution of 1 meter, utilizing the UTM 33N projection. The primary data source for this dataset is the Norwegian Mapping Authority (Kartverket), which has made the data freely available on their website under the CC BY 4.0 license (Source: https://hoydedata.no/, License terms: https://creativecommons.org/licenses/by/4.0/) The DSM and DTM models are generated from 3D LiDAR point clouds collected through periodic aerial campaigns. During these campaigns, the LiDAR sensors capture data with a maximum offset of 20 degrees from the nadir. Additionally, a subset of data also includes building footprints/labels created using the OpenStreetMap (OSM) database. Specifically, building footprints extracted from the OSM database were rasterized to match the grid of the DTM and DSM models. These rasterized labels are made available under the Open Database License (ODbL) in compliance with the OSM license requirements. We hope this dataset facilitates various applications in geographic analysis, remote sensing, and machine learning research.
O
Open Source Data Labeling Tool Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Open Source Data Labeling Tool Report [Dataset]. https://www.marketresearchforecast.com/reports/open-source-data-labeling-tool-28519
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 7, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The open-source data labeling tool market is experiencing robust growth, driven by the increasing demand for high-quality training data in the burgeoning artificial intelligence (AI) and machine learning (ML) sectors. The market's expansion is fueled by several key factors. Firstly, the rising adoption of AI across various industries, including healthcare, automotive, and finance, necessitates large volumes of accurately labeled data. Secondly, open-source tools offer a cost-effective alternative to proprietary solutions, making them attractive to startups and smaller companies with limited budgets. Thirdly, the collaborative nature of open-source development fosters continuous improvement and innovation, leading to more sophisticated and user-friendly tools. While the cloud-based segment currently dominates due to scalability and accessibility, on-premise solutions maintain a significant share, especially among organizations with stringent data security and privacy requirements. The geographical distribution reveals strong growth in North America and Europe, driven by established tech ecosystems and early adoption of AI technologies. However, the Asia-Pacific region is expected to witness significant growth in the coming years, fueled by increasing digitalization and government initiatives promoting AI development. The market faces some challenges, including the need for skilled data labelers and the potential for inconsistencies in data quality across different open-source tools. Nevertheless, ongoing developments in automation and standardization are expected to mitigate these concerns. The forecast period of 2025-2033 suggests a continued upward trajectory for the open-source data labeling tool market. Assuming a conservative CAGR of 15% (a reasonable estimate given the rapid advancements in AI and the increasing need for labeled data), and a 2025 market size of $500 million (a plausible figure considering the significant investments in the broader AI market), the market is projected to reach approximately $1.8 billion by 2033. This growth will be further shaped by the ongoing development of new features, improved user interfaces, and the integration of advanced techniques such as active learning and semi-supervised learning within open-source tools. The competitive landscape is dynamic, with both established players and emerging startups contributing to the innovation and expansion of this crucial segment of the AI ecosystem. Companies are focusing on improving the accuracy, efficiency, and accessibility of their tools to cater to a growing and diverse user base.
U
Unsupervised Learning Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMA Research & Media LLP (2025). Unsupervised Learning Report [Dataset]. https://www.archivemarketresearch.com/reports/unsupervised-learning-56897
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 14, 2025
Dataset provided by
AMA Research & Media LLP
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The unsupervised learning market is experiencing robust growth, driven by the increasing volume of unstructured data and the need for businesses to extract valuable insights without pre-defined labels. This market is projected to reach $XX billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of XX% during the forecast period of 2025-2033. This substantial growth is fueled by several key trends, including the rising adoption of cloud-based solutions for enhanced scalability and cost-effectiveness, the proliferation of big data analytics applications across various industries, and the increasing demand for advanced anomaly detection and pattern recognition capabilities. The market segmentation reveals a significant contribution from large enterprises due to their higher budgets and complex data management needs, while the cloud-based segment dominates owing to its flexibility and accessibility. Key players like Microsoft, IBM, and Google are heavily investing in R&D and strategic partnerships to consolidate their market share and capitalize on emerging opportunities in areas such as fraud detection, customer segmentation, and predictive maintenance. The market faces challenges such as the complexity of implementing unsupervised learning algorithms and the need for skilled data scientists, however, ongoing technological advancements and the growing availability of user-friendly tools are mitigating these restraints. The continued growth trajectory is anticipated to be further propelled by advancements in deep learning techniques, particularly in areas like generative adversarial networks (GANs) and autoencoders, which are enhancing the accuracy and efficiency of unsupervised learning models. The geographical distribution of the market shows strong performance in North America and Europe, due to early adoption and well-established technological infrastructure. However, the Asia-Pacific region presents a significant growth opportunity, driven by rapid digitalization and increasing investments in data analytics capabilities within emerging economies like India and China. The competitive landscape is characterized by both established technology giants and specialized AI startups, leading to continuous innovation and a wide range of solutions tailored to specific industry needs. The overall outlook for the unsupervised learning market remains highly promising, with significant potential for expansion across various sectors. (Note: To provide specific numerical data for market size and CAGR, please provide those values.)
d
3D Microvascular Image Data and Labels for Machine Learning - Dataset -...
b2find.dkrz.de
Updated May 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). 3D Microvascular Image Data and Labels for Machine Learning - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/fc51c530-c314-5fd4-8979-1032b6f798cf
Explore at:
Dataset updated
May 7, 2024
Description
These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality][species Organ][resolution].tif Labels - [Modality][species Organ][resolution]labels.tif Sub-volumes of larger dataset - [Modality][species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background (Brown et al., 2019). OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature (Walsh et al., 2021). The image data has been processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house (Walsh et al., 2021). The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute (Bosch et al., 2022). NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19
d
TagX Data Annotation | Automated Annotation | AI-assisted labeling with...
datarade.ai
Updated Aug 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2022). TagX Data Annotation | Automated Annotation | AI-assisted labeling with human verification | Customized annotation | Data for AI & LLMs [Dataset]. https://datarade.ai/data-products/data-annotation-services-for-artificial-intelligence-and-data-tagx
Explore at:
.json, .xml, .csv, .xls, .txtAvailable download formats
Dataset updated
Aug 14, 2022
Dataset authored and provided by
TagX
Area covered
Sint Eustatius and Saba, Saint Barthélemy, Egypt, Estonia, Lesotho, Central African Republic, Comoros, Guatemala, Georgia, Cabo Verde
Description
TagX data annotation services are a set of tools and processes used to accurately label and classify large amounts of data for use in machine learning and artificial intelligence applications. The services are designed to be highly accurate, efficient, and customizable, allowing for a wide range of data types and use cases.

The process typically begins with a team of trained annotators reviewing and categorizing the data, using a variety of annotation tools and techniques, such as text classification, image annotation, and video annotation. The annotators may also use natural language processing and other advanced techniques to extract relevant information and context from the data.

Once the data has been annotated, it is then validated and checked for accuracy by a team of quality assurance specialists. Any errors or inconsistencies are corrected, and the data is then prepared for use in machine learning and AI models.

TagX annotation services can be applied to a wide range of data types, including text, images, videos, and audio. The services can be customized to meet the specific needs of each client, including the type of data, the level of annotation required, and the desired level of accuracy.

TagX data annotation services provide a powerful and efficient way to prepare large amounts of data for use in machine learning and AI applications, allowing organizations to extract valuable insights and improve their decision-making processes.
A
AI Data Labeling Solution Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMA Research & Media LLP (2025). AI Data Labeling Solution Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-data-labeling-solution-55998
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 11, 2025
Dataset provided by
AMA Research & Media LLP
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The AI Data Labeling Solutions market is experiencing robust growth, driven by the increasing demand for high-quality data to train and improve the accuracy of AI and machine learning models. The market size in 2025 is estimated at $2.5 billion, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This substantial growth is fueled by several key factors. The proliferation of AI applications across diverse sectors like healthcare, automotive, and finance necessitates extensive data labeling. The rise of sophisticated AI algorithms that require larger and more complex datasets is another major driver. Cloud-based solutions are gaining significant traction due to their scalability, cost-effectiveness, and ease of access, contributing significantly to market expansion. However, challenges remain, including data privacy concerns, the need for skilled data labelers, and the potential for bias in labeled data. These restraints need to be addressed to ensure the sustainable and responsible growth of the market. The segmentation of the market reveals a diverse landscape. Cloud-based solutions currently dominate, reflecting the industry shift toward flexible and scalable data processing. Application-wise, the IT sector is currently the largest consumer, followed by automotive and healthcare. However, growth in financial services and other sectors indicates the broadening application of AI data labeling solutions. Key players in the market are constantly innovating to improve accuracy, efficiency, and cost-effectiveness, leading to a competitive and rapidly evolving market. The regional distribution shows strong market presence in North America and Europe, driven by early adoption of AI technologies and a well-established technological infrastructure. Asia-Pacific is also demonstrating significant growth potential due to increasing technological advancements and investments in AI research and development. The forecast period of 2025-2033 presents substantial opportunities for market expansion, contingent upon addressing the challenges and leveraging emerging technologies.

Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

zenodo.org
data.niaid.nih.gov

bin

Updated Sep 20, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Nirmalya Thakur; Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.5281/zenodo.13738598

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13738598

Dataset updated

Sep 20, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Nirmalya Thakur; Nirmalya Thakur

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Sep 9, 2024

Description

Please cite the following paper when using this dataset:

N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292

Abstract

The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset.

After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post into

one of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral,
hate or not hate
anxiety/stress detected or no anxiety/stress detected.

These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.

The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.

The following table represents the data description for this dataset

Attribute Name	Attribute Description
Post ID	Unique ID of each Instagram post
Post Description	Complete description of each post in the language in which it was originally published
Date	Date of publication in MM/DD/YYYY format
Language	Language of the post as detected using the Google Translate API
Translated Post Description	Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.
Sentiment	Results of sentiment analysis (using translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutral
Hate	Results of hate speech detection (using translated Post Description) where each post was classified as hate or not hate
Anxiety or Stress	Results of anxiety or stress detection (using translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.

d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
d
Data from: Deep Learning for diagnosing patients with rare genetic diseases
search.dataone.org
dataverse.harvard.edu
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alsentzer, Emily*; Michelle M. Li*; Shilpa N. Kobren; Undiagnosed Diseases Network; Isaac S. Kohane; Marinka Zitnik (2024). Deep Learning for diagnosing patients with rare genetic diseases [Dataset]. https://search.dataone.org/view/sha256%3A4400ad0a40ce2ab8084d6751fc0614d4502b83742c43b58e280983a5fd9de988
Explore at:
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Alsentzer, Emily*; Michelle M. Li*; Shilpa N. Kobren; Undiagnosed Diseases Network; Isaac S. Kohane; Marinka Zitnik
Description
There are over 7,000 unique rare diseases, some of which affecting 3,500 or fewer patients in the US. Due to clinicians' limited experience with such diseases and the considerable heterogeneity of their clinical presentations, many patients with rare genetic diseases remain undiagnosed. While artificial intelligence has demonstrated success in assisting diagnosis, its success is usually contingent on the availability of large annotated datasets. Here, we present SHEPHERD, a deep learning approach for multi-faceted rare disease diagnosis. To overcome the limitations of supervised learning, SHEPHERD performs label-efficient training by (1) training exclusively on simulated rare disease patients without the use of any real labeled data and (2) incorporating external knowledge of known phenotype, gene and disease associations via knowledge-guided deep learning. This repository houses (1) the preprocessed rare disease knowledge graph, (2) the simulated patients used for training SHEPHERD, and (3) the myGene2 rare disease patients used for evaluation. The accompanying github repository can be found at: https://github.com/mims-harvard/SHEPHERD.
G
Processed Lab Data for Neural Network-Based Shear Stress Level Prediction
gdr.openei.org
data.openei.org
+4more
data
Updated May 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Marone; Derek Elsworth; Jing Yang; Chris Marone; Derek Elsworth; Jing Yang (2021). Processed Lab Data for Neural Network-Based Shear Stress Level Prediction [Dataset]. http://doi.org/10.15121/1787545
Explore at:
dataAvailable download formats
Unique identifier
https://doi.org/10.15121/1787545
Dataset updated
May 14, 2021
Dataset provided by
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Geothermal Technologies Program (EE-4G)
Pennsylvania State University
Geothermal Data Repository
Authors
Chris Marone; Derek Elsworth; Jing Yang; Chris Marone; Derek Elsworth; Jing Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning can be used to predict fault properties such as shear stress, friction, and time to failure using continuous records of fault zone acoustic emissions. The files are extracted features and labels from lab data (experiment p4679). The features are extracted with a non-overlapping window from the original acoustic data. The first column is the time of the window. The second and third columns are the mean and the variance of the acoustic data in this window, respectively. The 4th-11th column is the the power spectrum density ranging from low to high frequency. And the last column is the corresponding label (shear stress level). The name of the file means which driving velocity the sequence is generated from. Data were generated from laboratory friction experiments conducted with a biaxial shear apparatus. Experiments were conducted in the double direct shear configuration in which two fault zones are sheared between three rigid forcing blocks. Our samples consisted of two 5-mm-thick layers of simulated fault gouge with a nominal contact area of 10 by 10 cm^2. Gouge material consisted of soda-lime glass beads with initial particle size between 105 and 149 micrometers. Prior to shearing, we impose a constant fault normal stress of 2 MPa using a servo-controlled load-feedback mechanism and allow the sample to compact. Once the sample has reached a constant layer thickness, the central block is driven down at constant rate of 10 micrometers per second. In tandem, we collect an AE signal continuously at 4 MHz from a piezoceramic sensor embedded in a steel forcing block about 22 mm from the gouge layer The data from this experiment can be used with the deep learning algorithm to train it for future fault property prediction.
Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions...
zenodo-rdm.web.cern.ch
data.niaid.nih.gov
zip
Updated Sep 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carly Sutter; Carly Sutter; Kara Sulia; Kara Sulia; Nick P. Bassill; Nick P. Bassill; Christopher D. Thorncroft; Christopher D. Wirz; Christopher D. Wirz; Vanessa Przybylo; Vanessa Przybylo; Mariana G. Cains; Mariana G. Cains; Jacob Radford; Jacob Radford; David Aaron Evans; David Aaron Evans; Christopher D. Thorncroft (2023). Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions in New York State Department of Transportation Camera Images [Dataset]. http://doi.org/10.5281/zenodo.8370665
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8370665
Dataset updated
Sep 27, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carly Sutter; Carly Sutter; Kara Sulia; Kara Sulia; Nick P. Bassill; Nick P. Bassill; Christopher D. Thorncroft; Christopher D. Wirz; Christopher D. Wirz; Vanessa Przybylo; Vanessa Przybylo; Mariana G. Cains; Mariana G. Cains; Jacob Radford; Jacob Radford; David Aaron Evans; David Aaron Evans; Christopher D. Thorncroft
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
Traffic camera images from the New York State Department of Transportation (511ny.org) are used to create a hand-labeled dataset of images classified into to one of six road surface conditions: 1) severe snow, 2) snow, 3) wet, 4) dry, 5) poor visibility, or 6) obstructed. Six labelers (authors Sutter, Wirz, Przybylo, Cains, Radford, and Evans) went through a series of four labeling trials where reliability across all six labelers were assessed using the Krippendorff’s alpha (KA) metric (Krippendorff, 2007). The online tool by Dr. Freelon (Freelon, 2013; Freelon, 2010) was used to calculate reliability metrics after each trial, and the group achieved inter-coder reliability with KA of 0.888 on the 4th trial. This process is known as quantitative content analysis, and three pieces of data used in this process are shared, including: 1) a PDF of the codebook which serves as a set of rules for labeling images, 2) images from each of the four labeling trials, including the use of New York State Mesonet weather observation data (Brotzge et al., 2020), and 3) an Excel spreadsheet including the calculated inter-coder reliability metrics and other summaries used to asses reliability after each trial.

The broader purpose of this work is that the six human labelers, after achieving inter-coder reliability, can then label large sets of images independently, each contributing to the creation of larger labeled dataset used for training supervised machine learning models to predict road surface conditions from camera images. The xCITE lab (xCITE, 2023) is used to store camera images from 511ny.org, and the lab provides computing resources for training machine learning models.
f
EDA augmentation parameters.
plos.figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EDA augmentation parameters. [Dataset]. https://plos.figshare.com/articles/dataset/EDA_augmentation_parameters_/27112619
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t009
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

Facebook

Twitter

Click to copy link

Link copied

Cite

Brian DeCost (2020). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. http://doi.org/10.18434/mds2-2301

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models

Explore at:

Unique identifier

https://doi.org/10.18434/mds2-2301, https://identifiers.org/ark:/88434/mds2-2301

Dataset updated

Oct 23, 2020

Dataset provided by

National Institute of Standards and Technologyhttp://www.nist.gov/

Authors

Brian DeCost

License

https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

Description

The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations. Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.

Clear search

Close search

Google apps

Main menu

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

MSL Curiosity Rover Images with Science and Engineering Classes

Data from: Towards Automatic Labeling of Exception Handling Bugs: A Case...

GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String...

Data Labeling Solution and Services Report

Code for Predicting MIEs from Gene Expression and Chemical Target Labels...

Data from: Not so weak-PICO: Leveraging weak supervision for Participants,...

UCI and OpenML Data Sets for Ordinal Quantification

Replication Data for: Terrain-Informed Self-Supervised Learning: Enhancing...

Open Source Data Labeling Tool Report

Unsupervised Learning Report

3D Microvascular Image Data and Labels for Machine Learning - Dataset -...

TagX Data Annotation | Automated Annotation | AI-assisted labeling with...

AI Data Labeling Solution Report

Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

Training dataset for NABat Machine Learning V1.0

Data from: Deep Learning for diagnosing patients with rare genetic diseases

Processed Lab Data for Neural Network-Based Shear Stress Level Prediction

Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions...

EDA augmentation parameters.

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models