https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations. Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.
Data Set Description
The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.
Directory Contents
images - contains all 6,820 images
class_map.csv - string-integer class mappings
train-set-v2.1.txt - label file for the training set
val-set-v2.1.txt - label file for the validation set
test-set-v2.1.txt - label file for the test set
The label files are formatted as below:
"Image-file-name class_in_integer_representation"
Labeling Process
Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:
If all three labels agree with each other, then use the label as the final label.
If the three labels do not agree with each other, then we manually review the labels and decide the final label.
We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.
Classes
There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:
Class name, counts (training set), counts (validation set), counts (test set), integer representation
Arm cover, 10, 1, 4, 0
Other rover part, 190, 11, 10, 1
Artifact, 680, 62, 132, 2
Nearby surface, 1554, 74, 187, 3
Close-up rock, 1422, 50, 84, 4
DRT, 8, 4, 6, 5
DRT spot, 214, 1, 7, 6
Distant landscape, 342, 14, 34, 7
Drill hole, 252, 5, 12, 8
Night sky, 40, 3, 4, 9
Float, 190, 5, 1, 10
Layers, 182, 21, 17, 11
Light-toned veins, 42, 4, 27, 12
Mastcam cal target, 122, 12, 29, 13
Sand, 228, 19, 16, 14
Sun, 182, 5, 19, 15
Wheel, 212, 5, 5, 16
Wheel joint, 62, 1, 5, 17
Wheel tracks, 26, 3, 1, 18
Image Augmentation
Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.
90 degrees clockwise rotation (file name ends with -r90.jpg)
180 degrees clockwise rotation (file name ends with -r180.jpg)
270 degrees clockwise rotation (file name ends with -r270.jpg)
Horizontal flip (file name ends with -fh.jpg)
Vertical flip (file name ends with -fv.jpg)
Acknowledgment
The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context: Exception handling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHMs) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and AI-based systems), in which the software's sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs --- since it may require an encompassing knowledge of the software's EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions.Objective: First, we aim to evaluate the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim to provide a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community's awareness regarding the importance of EH bugs.Method: We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ~20% (943) as EH bugs. We also labeled 2,584 non-EH bugs analyzing their bug-fixing code and creating a dataset composed of 7,100 bug reports. Then, we used word embedding techniques (Bag-of-Words and TF-IDF) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit five classes of ML methods and evaluate them on unseen data. We also evaluated a pre-trained transformer-based model using the complete textual fields. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance.Results: Our results show that using a pre-trained DistilBERT with a linear layer trained with our proposed dataset can reasonably label EH bugs, achieving ROC-AUC scores of up to 0.88. The combination of NLP and ML traditional techniques achieved ROC-AUC scores of up to 0.74 and recall up to 0.56. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. Considering ROC-AUC as the primary concern, for the majority of ML methods tested, the analysis suggests that keywords alone are not sufficient to characterize reports of EH bugs, although this can change based on other metrics (such as recall and precision) or ML methods (e.g., Random Forest).Conclusions: To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the use of ML techniques, specially transformer-base models, sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.
Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The Data Labeling Solutions and Services market is experiencing robust growth, driven by the escalating demand for high-quality training data in the artificial intelligence (AI) and machine learning (ML) sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching approximately $75 billion by 2033. This expansion is fueled by several key factors. Firstly, the increasing adoption of AI across diverse industries, including automotive, healthcare, and finance, necessitates vast amounts of accurately labeled data for model training and improvement. Secondly, advancements in deep learning algorithms and the emergence of sophisticated data annotation tools are streamlining the labeling process, boosting efficiency and reducing costs. Finally, the growing availability of diverse data sources, coupled with the rise of specialized data labeling companies, is further contributing to market growth. Despite these positive trends, the market faces some challenges. The high cost associated with data annotation, particularly for complex datasets requiring specialized expertise, can be a barrier for smaller businesses. Ensuring data quality and consistency across large-scale projects remains a critical concern, necessitating robust quality control measures. Furthermore, addressing data privacy and security issues is essential to maintain ethical standards and build trust within the market. The market segmentation by type (text, image/video, audio) and application (automotive, government, healthcare, financial services, etc.) presents significant opportunities for specialized service providers catering to niche needs. Competition is expected to intensify as new players enter the market, focusing on innovative solutions and specialized services.
Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Objective: PICO (Participants, Interventions, Comparators, Outcomes) analysis is vital but time-consuming for conducting systematic reviews (SRs). Supervised machine learning can help fully automate it, but a lack of large annotated corpora limits the quality of automated PICO recognition systems. The largest currently available PICO corpus is manually annotated, which is an approach that is often too expensive for the scientific community to apply. Depending on the specific SR question, PICO criteria are extended to PICOC (C-Context), PICOT (T-timeframe), and PIBOSO (B-Background, S-Study design, O-Other) meaning the static hand-labelled corpora need to undergo costly re-annotation as per the downstream requirements. We aim to test the feasibility of designing a weak supervision system to extract these entities without hand-labelled data.
Methodology: We decompose PICO spans into its constituent entities and re-purpose multiple medical and non-medical ontologies and expert-generated rules to obtain multiple noisy labels for these entities. These labels obtained using several sources are then aggregated using simple majority voting and generative modelling approaches. The resulting programmatic labels are used as weak signals to train a weakly-supervised discriminative model and observe performance changes. We explore mistakes in the currently available PICO corpus that could have led to inaccurate evaluation of several automation methods.
Results: We present Weak-PICO, a weakly-supervised PICO entity recognition approach using medical and non-medical ontologies, dictionaries and expert-generated rules. Our approach does not use hand-labelled data.
Conclusion: Weak supervision using weak-PICO for PICO entity recognition has encouraging results, and the approach can potentially extend to more clinical entities readily.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
https://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HSMJLLhttps://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HSMJLL
The dataset comprises the pretraining and testing data for our work: Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations. The pretaining data consists of images corresponding to the Digital Surface Models (DSM) and Digital Terrain Models (DTM) obtained from Norway, with a ground resolution of 1 meter, utilizing the UTM 33N projection. The primary data source for this dataset is the Norwegian Mapping Authority (Kartverket), which has made the data freely available on their website under the CC BY 4.0 license (Source: https://hoydedata.no/, License terms: https://creativecommons.org/licenses/by/4.0/) The DSM and DTM models are generated from 3D LiDAR point clouds collected through periodic aerial campaigns. During these campaigns, the LiDAR sensors capture data with a maximum offset of 20 degrees from the nadir. Additionally, a subset of data also includes building footprints/labels created using the OpenStreetMap (OSM) database. Specifically, building footprints extracted from the OSM database were rasterized to match the grid of the DTM and DSM models. These rasterized labels are made available under the Open Database License (ODbL) in compliance with the OSM license requirements. We hope this dataset facilitates various applications in geographic analysis, remote sensing, and machine learning research.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The open-source data labeling tool market is experiencing robust growth, driven by the increasing demand for high-quality training data in the burgeoning artificial intelligence (AI) and machine learning (ML) sectors. The market's expansion is fueled by several key factors. Firstly, the rising adoption of AI across various industries, including healthcare, automotive, and finance, necessitates large volumes of accurately labeled data. Secondly, open-source tools offer a cost-effective alternative to proprietary solutions, making them attractive to startups and smaller companies with limited budgets. Thirdly, the collaborative nature of open-source development fosters continuous improvement and innovation, leading to more sophisticated and user-friendly tools. While the cloud-based segment currently dominates due to scalability and accessibility, on-premise solutions maintain a significant share, especially among organizations with stringent data security and privacy requirements. The geographical distribution reveals strong growth in North America and Europe, driven by established tech ecosystems and early adoption of AI technologies. However, the Asia-Pacific region is expected to witness significant growth in the coming years, fueled by increasing digitalization and government initiatives promoting AI development. The market faces some challenges, including the need for skilled data labelers and the potential for inconsistencies in data quality across different open-source tools. Nevertheless, ongoing developments in automation and standardization are expected to mitigate these concerns. The forecast period of 2025-2033 suggests a continued upward trajectory for the open-source data labeling tool market. Assuming a conservative CAGR of 15% (a reasonable estimate given the rapid advancements in AI and the increasing need for labeled data), and a 2025 market size of $500 million (a plausible figure considering the significant investments in the broader AI market), the market is projected to reach approximately $1.8 billion by 2033. This growth will be further shaped by the ongoing development of new features, improved user interfaces, and the integration of advanced techniques such as active learning and semi-supervised learning within open-source tools. The competitive landscape is dynamic, with both established players and emerging startups contributing to the innovation and expansion of this crucial segment of the AI ecosystem. Companies are focusing on improving the accuracy, efficiency, and accessibility of their tools to cater to a growing and diverse user base.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The unsupervised learning market is experiencing robust growth, driven by the increasing volume of unstructured data and the need for businesses to extract valuable insights without pre-defined labels. This market is projected to reach $XX billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of XX% during the forecast period of 2025-2033. This substantial growth is fueled by several key trends, including the rising adoption of cloud-based solutions for enhanced scalability and cost-effectiveness, the proliferation of big data analytics applications across various industries, and the increasing demand for advanced anomaly detection and pattern recognition capabilities. The market segmentation reveals a significant contribution from large enterprises due to their higher budgets and complex data management needs, while the cloud-based segment dominates owing to its flexibility and accessibility. Key players like Microsoft, IBM, and Google are heavily investing in R&D and strategic partnerships to consolidate their market share and capitalize on emerging opportunities in areas such as fraud detection, customer segmentation, and predictive maintenance. The market faces challenges such as the complexity of implementing unsupervised learning algorithms and the need for skilled data scientists, however, ongoing technological advancements and the growing availability of user-friendly tools are mitigating these restraints. The continued growth trajectory is anticipated to be further propelled by advancements in deep learning techniques, particularly in areas like generative adversarial networks (GANs) and autoencoders, which are enhancing the accuracy and efficiency of unsupervised learning models. The geographical distribution of the market shows strong performance in North America and Europe, due to early adoption and well-established technological infrastructure. However, the Asia-Pacific region presents a significant growth opportunity, driven by rapid digitalization and increasing investments in data analytics capabilities within emerging economies like India and China. The competitive landscape is characterized by both established technology giants and specialized AI startups, leading to continuous innovation and a wide range of solutions tailored to specific industry needs. The overall outlook for the unsupervised learning market remains highly promising, with significant potential for expansion across various sectors. (Note: To provide specific numerical data for market size and CAGR, please provide those values.)
These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality][species Organ][resolution].tif Labels - [Modality][species Organ][resolution]labels.tif Sub-volumes of larger dataset - [Modality][species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background (Brown et al., 2019). OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature (Walsh et al., 2021). The image data has been processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house (Walsh et al., 2021). The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute (Bosch et al., 2022). NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19
TagX data annotation services are a set of tools and processes used to accurately label and classify large amounts of data for use in machine learning and artificial intelligence applications. The services are designed to be highly accurate, efficient, and customizable, allowing for a wide range of data types and use cases.
The process typically begins with a team of trained annotators reviewing and categorizing the data, using a variety of annotation tools and techniques, such as text classification, image annotation, and video annotation. The annotators may also use natural language processing and other advanced techniques to extract relevant information and context from the data.
Once the data has been annotated, it is then validated and checked for accuracy by a team of quality assurance specialists. Any errors or inconsistencies are corrected, and the data is then prepared for use in machine learning and AI models.
TagX annotation services can be applied to a wide range of data types, including text, images, videos, and audio. The services can be customized to meet the specific needs of each client, including the type of data, the level of annotation required, and the desired level of accuracy.
TagX data annotation services provide a powerful and efficient way to prepare large amounts of data for use in machine learning and AI applications, allowing organizations to extract valuable insights and improve their decision-making processes.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The AI Data Labeling Solutions market is experiencing robust growth, driven by the increasing demand for high-quality data to train and improve the accuracy of AI and machine learning models. The market size in 2025 is estimated at $2.5 billion, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This substantial growth is fueled by several key factors. The proliferation of AI applications across diverse sectors like healthcare, automotive, and finance necessitates extensive data labeling. The rise of sophisticated AI algorithms that require larger and more complex datasets is another major driver. Cloud-based solutions are gaining significant traction due to their scalability, cost-effectiveness, and ease of access, contributing significantly to market expansion. However, challenges remain, including data privacy concerns, the need for skilled data labelers, and the potential for bias in labeled data. These restraints need to be addressed to ensure the sustainable and responsible growth of the market. The segmentation of the market reveals a diverse landscape. Cloud-based solutions currently dominate, reflecting the industry shift toward flexible and scalable data processing. Application-wise, the IT sector is currently the largest consumer, followed by automotive and healthcare. However, growth in financial services and other sectors indicates the broadening application of AI data labeling solutions. Key players in the market are constantly innovating to improve accuracy, efficiency, and cost-effectiveness, leading to a competitive and rapidly evolving market. The regional distribution shows strong market presence in North America and Europe, driven by early adoption of AI technologies and a well-established technological infrastructure. Asia-Pacific is also demonstrating significant growth potential due to increasing technological advancements and investments in AI research and development. The forecast period of 2025-2033 presents substantial opportunities for market expansion, contingent upon addressing the challenges and leveraging emerging technologies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292
Abstract
The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset.
After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post into
These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.
The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.
The following table represents the data description for this dataset
Attribute Name |
Attribute Description |
Post ID |
Unique ID of each Instagram post |
Post Description |
Complete description of each post in the language in which it was originally published |
Date |
Date of publication in MM/DD/YYYY format |
Language |
Language of the post as detected using the Google Translate API |
Translated Post Description |
Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts. |
Sentiment |
Results of sentiment analysis (using translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutral |
Hate |
Results of hate speech detection (using translated Post Description) where each post was classified as hate or not hate |
Anxiety or Stress |
Results of anxiety or stress detection (using translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected. |
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
There are over 7,000 unique rare diseases, some of which affecting 3,500 or fewer patients in the US. Due to clinicians' limited experience with such diseases and the considerable heterogeneity of their clinical presentations, many patients with rare genetic diseases remain undiagnosed. While artificial intelligence has demonstrated success in assisting diagnosis, its success is usually contingent on the availability of large annotated datasets. Here, we present SHEPHERD, a deep learning approach for multi-faceted rare disease diagnosis. To overcome the limitations of supervised learning, SHEPHERD performs label-efficient training by (1) training exclusively on simulated rare disease patients without the use of any real labeled data and (2) incorporating external knowledge of known phenotype, gene and disease associations via knowledge-guided deep learning. This repository houses (1) the preprocessed rare disease knowledge graph, (2) the simulated patients used for training SHEPHERD, and (3) the myGene2 rare disease patients used for evaluation. The accompanying github repository can be found at: https://github.com/mims-harvard/SHEPHERD.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning can be used to predict fault properties such as shear stress, friction, and time to failure using continuous records of fault zone acoustic emissions. The files are extracted features and labels from lab data (experiment p4679). The features are extracted with a non-overlapping window from the original acoustic data. The first column is the time of the window. The second and third columns are the mean and the variance of the acoustic data in this window, respectively. The 4th-11th column is the the power spectrum density ranging from low to high frequency. And the last column is the corresponding label (shear stress level). The name of the file means which driving velocity the sequence is generated from. Data were generated from laboratory friction experiments conducted with a biaxial shear apparatus. Experiments were conducted in the double direct shear configuration in which two fault zones are sheared between three rigid forcing blocks. Our samples consisted of two 5-mm-thick layers of simulated fault gouge with a nominal contact area of 10 by 10 cm^2. Gouge material consisted of soda-lime glass beads with initial particle size between 105 and 149 micrometers. Prior to shearing, we impose a constant fault normal stress of 2 MPa using a servo-controlled load-feedback mechanism and allow the sample to compact. Once the sample has reached a constant layer thickness, the central block is driven down at constant rate of 10 micrometers per second. In tandem, we collect an AE signal continuously at 4 MHz from a piezoceramic sensor embedded in a steel forcing block about 22 mm from the gouge layer The data from this experiment can be used with the deep learning algorithm to train it for future fault property prediction.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Traffic camera images from the New York State Department of Transportation (511ny.org) are used to create a hand-labeled dataset of images classified into to one of six road surface conditions: 1) severe snow, 2) snow, 3) wet, 4) dry, 5) poor visibility, or 6) obstructed. Six labelers (authors Sutter, Wirz, Przybylo, Cains, Radford, and Evans) went through a series of four labeling trials where reliability across all six labelers were assessed using the Krippendorff’s alpha (KA) metric (Krippendorff, 2007). The online tool by Dr. Freelon (Freelon, 2013; Freelon, 2010) was used to calculate reliability metrics after each trial, and the group achieved inter-coder reliability with KA of 0.888 on the 4th trial. This process is known as quantitative content analysis, and three pieces of data used in this process are shared, including: 1) a PDF of the codebook which serves as a set of rules for labeling images, 2) images from each of the four labeling trials, including the use of New York State Mesonet weather observation data (Brotzge et al., 2020), and 3) an Excel spreadsheet including the calculated inter-coder reliability metrics and other summaries used to asses reliability after each trial.
The broader purpose of this work is that the six human labelers, after achieving inter-coder reliability, can then label large sets of images independently, each contributing to the creation of larger labeled dataset used for training supervised machine learning models to predict road surface conditions from camera images. The xCITE lab (xCITE, 2023) is used to store camera images from 511ny.org, and the lab provides computing resources for training machine learning models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations. Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.