100+ datasets found

e
SYNERGY - Open machine learning dataset on study selection in systematic...
b2find.eudat.eu
Updated Jul 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1bea4d3c-ceef-5f63-89ed-80aeab18f601
Explore at:
Dataset updated
Jul 21, 2024
Description
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information. The recommended way to work with the SYNERGY dataset is via the Python package "synergy-dataset". This flexible package downloads and builds the dataset.
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
d
Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...
catalog.data.gov
s.cnmilf.com
+1more
Updated Sep 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. https://catalog.data.gov/dataset/dataset-an-open-combinatorial-diffraction-dataset-including-consensus-human-and-machine-le
Explore at:
Dataset updated
Sep 30, 2025
Dataset provided by
National Institute of Standards and Technology
Description
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations.Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
Data from: Open Images
kaggle.com
opendatalab.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/datasets/bigquery/open-images
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?
Data from: AppClassNet - A commercial-grade dataset for application...
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dario rossi (2023). AppClassNet - A commercial-grade dataset for application identification research [Dataset]. http://doi.org/10.6084/m9.figshare.20375580.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20375580.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
dario rossi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AppClassNet is a commercial-grade dataset that represent a realistic benchmark for the use-case of traffic classification and management.

The AppClassNet dataset is complemented by companion artifacts containing baseline code to train and test state of the art baseline models for a quick boostrap.

A description of the dataset, the expected performance of the baseline models, the allowed and forbidden usages of the dataset, and more is available in a companion technical report [1]

[1] https://dl.acm.org/doi/10.1145/3561954.3561958
A Dataset for Machine Learning Algorithm Development
fisheries.noaa.gov
catalog.data.gov
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaska Fisheries Science Center (AFSC) (2021). A Dataset for Machine Learning Algorithm Development [Dataset]. https://www.fisheries.noaa.gov/inport/item/63322
Explore at:
Dataset updated
Jan 1, 2021
Dataset provided by
Alaska Fisheries Science Center
Authors
Alaska Fisheries Science Center (AFSC)
Area covered
Chukchi Sea, Alaska, Beaufort Sea, Kotzebue Sound
Description
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.
O
BUTTER - Empirical Deep Learning Dataset
data.openei.org
datasets.ai
+2more
code, data, website
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek (2022). BUTTER - Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/1872441
Explore at:
code, website, dataAvailable download formats
Unique identifier
https://doi.org/10.25984/1872441
Dataset updated
May 20, 2022
Dataset provided by
Open Energy Data Initiative (OEDI)
National Renewable Energy Laboratory
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
Authors
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.
DataCenter-Traces-Datasets
zenodo.org
bin, csv
Updated Dec 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alejandro Fernández-Montes; Alejandro Fernández-Montes; Damián Fernández Cerero; Damián Fernández Cerero (2024). DataCenter-Traces-Datasets [Dataset]. http://doi.org/10.5281/zenodo.14564935
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14564935
Dataset updated
Dec 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alejandro Fernández-Montes; Alejandro Fernández-Montes; Damián Fernández Cerero; Damián Fernández Cerero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 17, 2022
Description
Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:

Alibaba 2018 machine usage

Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018

This repository dataset of machine usage includes the following columns:

+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | cpu_util_percent | bigint | | [0, 100] | | mem_util_percent | bigint | | [0, 100] | | net_in | double | | normarlized in coming network traffic, [0, 100] | | net_out | double | | normarlized out going network traffic, [0, 100] | | disk_io_percent | double | | [0, 100], abnormal values are of -1 or 101 | +--------------------------------------------------------------------------------------------+

Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.

Google 2019 instance usage

Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | avg_cpu | double | | [0, 1] | | avg_mem | double | | [0, 1] | | avg_assigned_mem | double | | [0, 1] | | avg_cycles_per_instruction | double | | [0, _] | +--------------------------------------------------------------------------------------------+

One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.

Azure v2 virtual machine workload

Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+ | Field | Type | Label | Comment | +--------------------------------------------------------------------------------------------+ | cpu_usage | double | | [0, _] | | assigned_mem | double | | [0, _] | +--------------------------------------------------------------------------------------------+

One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp

Access Level
The dataset is freely accessible under an Open Access model. There are no restrictions for reuse, and it is licensed under [Creative Commons Attribution 4.0 (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
h
ember
huggingface.co
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SeokHee Chang (2025). ember [Dataset]. https://huggingface.co/datasets/cycloevan/ember
Explore at:
Dataset updated
Jun 19, 2025
Authors
SeokHee Chang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
EMBER Dataset

EMBER (Elastic Malware Benchmark for Empowering Researchers) is an open dataset for training static PE malware machine learning models.

References

Paper: EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models GitHub: elastic/ember
i
Dataset of article: Synthetic Datasets Generator for Testing Information...
ieee-dataport.org
Updated Mar 13, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Santos (2020). Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools [Dataset]. https://ieee-dataport.org/open-access/dataset-article-synthetic-datasets-generator-testing-information-visualization-and
Explore at:
Dataset updated
Mar 13, 2020
Authors
Carlos Santos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.
LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical...
zenodo.org
bin, pdf +1
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthieu Gallet; Matthieu Gallet; Abdourrahmane Atto; Abdourrahmane Atto; Fatima Karbou; Fatima Karbou; Emmanuel Trouvé; Emmanuel Trouvé (2024). LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical Labelling [Dataset]. http://doi.org/10.5281/zenodo.10046730
Explore at:
text/x-python, bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10046730
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matthieu Gallet; Matthieu Gallet; Abdourrahmane Atto; Abdourrahmane Atto; Fatima Karbou; Fatima Karbou; Emmanuel Trouvé; Emmanuel Trouvé
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LSD4WSD V2.0
Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.
The aim of this dataset is to provide a basis for automatic learning to detect wet snow. It is based on Sentinel-1 SAR GRD satellite images acquired between August 2020 and August 2021 over the French Alps. The new version of this dataset is no longer simply restricted to a classification task, and provides a set of metadata for each sample.
Modification and improvements of the version 2.0.0 :
Number of massif: add 7 new massif to cover the all Sentinel-1 images (cf info.pdf).
Acquisition: add images of the descending pass in addition to those originally used in the ascending pass.
Sample: reduction in the size of the samples considered to 15 by 15 to facilitate evaluation at the central pixel.
Sample: increased density of extracted windows, with a distance of approximately 500 meters between the centers of the windows.
Sample: removal of the pre-processing involving the use of logarithms.
Sample: removal of the pre-processing involving the normalisation.
Labels: new structure for the labels part: dictionary with keys: topography, metadata and physics.
Labels: physics: addition of direct information from the CROCUS model for 3 simulations: Liquid Water Content, snow height and minimum snowpack temperature.
Labels: topography: information on the slope, altitude and average orientation of the sample.
Labels: metadata : information on the date of the sample, the mountain massif and the run (ascending or descending).
Dataset: removal of the train/test split*
We leave it up to the user to use the Group Kfold method to validate the models using the alpine massif information.
Finally, it consists of 2467516 samples of size 15 by 15 by 9. For each sample, the 9 metadata are provided, using in particular the Crocus physical model:
topography:
elevation (meters) (average),
orientation (degrees) (average),
slope (degrees) (average),
metadata:
name of the alpine massif,
date of acquisition,
type of acquisition (ascending/descending),
physics
Liquid Water Content (km/m2),
snow height (m),
minimum snowpack temperature (Celsius degree).
The 9 channels are in the following order:
Sentinel-1 polarimetric channels: VV, VH and the combination C: VV/VH in linear,
Topographical features: altitude, orientation, slope
Polarimetric ratio with a reference summer image: VV/VVref, VH/VHref, C/Cref
* The reference image selected is that of August 9th 2020, as a reference image without snow (cf. Nagler&al)
An overview of the distribution and a summary of the sample statistics can be found in the file info.pdf.
The data is stored in .hdf5 format with gzip compression. We provide a python script to read and request the data. The script is dataset_load.py. It is based on the h5py, numpy and pandas libraries. It allows to select a part or the whole dataset using requests on the metadata. The script is documented and can be used as described in the README.md file
The processing chain is available at the following Github address.
The authors would like to acknowledge the support from the National Centre for Space Studies (CNES) in providing computing facilities and access to SAR images via the PEPS platform.
The authors would like to deeply thank Mathieu Fructus for running the Crocus simulations.
Erratum :
In the dataloader file, the name of the "aquisition" column must be added twice, see the correction below.:
dtst_ld = Dataset_loader(path_dataset,shuffle=False,descrp=["date","massif","aquisition","aquisition","elevation","slope","orientation","tmin","hsnow","tel",],)
If you have any comments, questions or suggestions, please contact the authors:
matthieu.gallet@univ-smb.fr
fatima.karbou@meteo.fr
abdourrahmane.atto@univ-smb.fr
emmanuel.trouve@univ-smb.fr
Machine learning code and best models.
plos.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qingxin Yang; Li Luo; Zhangpeng Lin; Wei Wen; Wenbo Zeng; Hong Deng (2024). Machine learning code and best models. [Dataset]. http://doi.org/10.1371/journal.pone.0300662.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300662.s002
Dataset updated
Apr 17, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Qingxin Yang; Li Luo; Zhangpeng Lin; Wei Wen; Wenbo Zeng; Hong Deng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
They are available at https://github.com/nerdyqx/ML. (ZIP)
D
AI Training Dataset Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI Training Dataset Market Outlook

The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.

One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.

Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.

The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.

As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.

Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.

Data Type Analysis

The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.

Image data is critical for computer vision application
m
KU-HAR: An Open Dataset for Human Activity Recognition
data.mendeley.com
Updated Feb 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah-Al Nahid (2021). KU-HAR: An Open Dataset for Human Activity Recognition [Dataset]. http://doi.org/10.17632/45f952y38r.5
Explore at:
Unique identifier
https://doi.org/10.17632/45f952y38r.5
Dataset updated
Feb 16, 2021
Authors
Abdullah-Al Nahid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
(Always use the latest version of the dataset. )

Human Activity Recognition (HAR) refers to the capacity of machines to perceive human actions. This dataset contains information on 18 different activities collected from 90 participants (75 male and 15 female) using smartphone sensors (Accelerometer and Gyroscope). It has 1945 raw activity samples collected directly from the participants, and 20750 subsamples extracted from them. The activities are:

Stand➞ Standing still (1 min) Sit➞ Sitting still (1 min) Talk-sit➞ Talking with hand movements while sitting (1 min) Talk-stand➞ Talking with hand movements while standing or walking(1 min) Stand-sit➞ Repeatedly standing up and sitting down (5 times) Lay➞ Laying still (1 min) Lay-stand➞ Repeatedly standing up and laying down (5 times) Pick➞ Picking up an object from the floor (10 times) Jump➞ Jumping repeatedly (10 times) Push-up➞ Performing full push-ups (5 times) Sit-up➞ Performing sit-ups (5 times) Walk➞ Walking 20 meters (≈12 s) Walk-backward➞ Walking backward for 20 meters (≈20 s) Walk-circle➞ Walking along a circular path (≈ 20 s) Run➞ Running 20 meters (≈7 s) Stair-up➞ Ascending on a set of stairs (≈1 min) Stair-down➞ Descending from a set of stairs (≈50 s) Table-tennis➞ Playing table tennis (1 min)

Contents of the attached .zip files are: 1.Raw_time_domian_data.zip➞ Originally collected 1945 time-domain samples in separate .csv files. The arrangement of information in each .csv file is: Column 1, 5➞ exact time (elapsed since the start) when the Accelerometer & Gyro output was recorded (in ms) Col. 2, 3, 4➞ Acceleration along X,Y,Z axes (in m/s^2) Col. 6, 7, 8➞ Rate of rotation around X,Y,Z axes (in rad/s)

2.Trimmed_interpolated_raw_data.zip➞ Unnecessary parts of the samples were trimmed (only from the beginning and the end). The samples were interpolated to keep a constant sampling rate of 100 Hz. The arrangement of information is the same as above.

3.Time_domain_subsamples.zip➞ 20750 subsamples extracted from the 1945 collected samples provided in a single .csv file. Each of them contains 3 seconds of non-overlapping data of the corresponding activity. Arrangement of information: Col. 1–300, 301–600, 601–900➞ Acc.meter X, Y, Z axes readings Col. 901–1200, 1201–1500, 1501–1800➞ Gyro X, Y, Z axes readings Col. 1801➞ Class ID (0 to 17, in the order mentioned above) Col. 1802➞ length of the each channel data in the subsample Col. 1803➞ serial no. of the subsample

Gravity acceleration was omitted from the Acc.meter data, and no filter was applied to remove noise. The dataset is free to download, modify, and use.

More information is provided in the data paper which is currently under review: N. Sikder, A.-A. Nahid, KU-HAR: An open dataset for heterogeneous human activity recognition, Pattern Recognit. Lett. (submitted).

A preprint will be available soon.

Backup: drive.google.com/drive/folders/1yrG8pwq3XMlyEGYMnM-8xnrd6js0oXA7
m
Data from: Potrika: Raw and Balanced Newspaper Datasets in the Bangla...
data.mendeley.com
Updated Nov 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Istiak Ahmad (2022). Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes [Dataset]. http://doi.org/10.17632/v362rp78dc.4
Explore at:
Unique identifier
https://doi.org/10.17632/v362rp78dc.4
Dataset updated
Nov 7, 2022
Authors
Istiak Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in the machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science & Technology) providing five attributes (News Article, Category, Headline, Publication Date, and Newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.

Further details of the dataset, its collection, and usage can be found in our article here: https://doi.org/10.48550/arXiv.2210.09389.
Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning
figshare.com
bin
Updated May 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28729127.v2
Dataset updated
May 25, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Rikuto Kotoge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
i
Malware Analysis Datasets: Top-1000 PE Imports
ieee-dataport.org
Updated Nov 8, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports
Explore at:
Dataset updated
Nov 8, 2019
Authors
Angelo Oliveira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.
d
Machine learning model that estimates public-supply deliveries for domestic...
catalog.data.gov
data.usgs.gov
+2more
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Machine learning model that estimates public-supply deliveries for domestic and other use types [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-that-estimates-public-supply-deliveries-for-domestic-and-other-use-
Explore at:
Dataset updated
Oct 1, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This child item describes a public-supply delivery machine learning model that was developed to estimate public-supply deliveries. Publicly supplied water may be delivered to domestic users or to commercial, industrial, institutional, and irrigation (CII) users. This model predicts total, domestic, and CII per capita rates for public-supply water service areas within the conterminous United States for 2009-2020. This child item contains model input datasets, code used to build the delivery machine learning model, and national predictions. This dataset is part of a larger data release using machine learning to predict public-supply water use for 12-digit hydrologic units from 2000-2020. This page includes the following file: delivery_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the delivery water use machine learning model
Zenodo Open Metadata snapshot - Training dataset for records and communities...
zenodo.org
data.niaid.nih.gov
application/gzip, bin
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo team; Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. http://doi.org/10.5281/zenodo.7438358
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7438358
Dataset updated
Dec 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zenodo team; Zenodo team
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

Records dataset

Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

In addition, some terms have been altered:

The term files contains a list of dictionaries containing filetype, size, and filename only.

The term license contains a short Zenodo ID of the license (e.g. "cc-by").

Communities dataset

Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

Each object contains the terms: id, title, description, curation_policy, page

which correspond to the fields with the same name available in Zenodo's community creation form.

Notes for all datasets

For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

Some values for the top-level terms, which were missing in the metadata may contain a null value.

A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
i
Big Data Machine Learning Benchmark on Spark
ieee-dataport.org
Updated Jun 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jairson Rodrigues (2019). Big Data Machine Learning Benchmark on Spark [Dataset]. https://ieee-dataport.org/open-access/big-data-machine-learning-benchmark-spark
Explore at:
Dataset updated
Jun 6, 2019
Authors
Jairson Rodrigues
Description
net traffic

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1bea4d3c-ceef-5f63-89ed-80aeab18f601

SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND

Explore at:

Dataset updated

Jul 21, 2024

Description

SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information. The recommended way to work with the SYNERGY dataset is via the Python package "synergy-dataset". This flexible package downloads and builds the dataset.

Clear search

Close search

Google apps

Main menu

SYNERGY - Open machine learning dataset on study selection in systematic...

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

Data from: Open Images

Context

Content

Querying BigQuery Tables

Acknowledgements

Inspiration

Data from: AppClassNet - A commercial-grade dataset for application...

A Dataset for Machine Learning Algorithm Development

BUTTER - Empirical Deep Learning Dataset

DataCenter-Traces-Datasets

Alibaba 2018 machine usage

Google 2019 instance usage

Azure v2 virtual machine workload

ember

Dataset of article: Synthetic Datasets Generator for Testing Information...

LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical...

Machine learning code and best models.

AI Training Dataset Market Report | Global Forecast From 2025 To 2033

AI Training Dataset Market Outlook

Data Type Analysis

KU-HAR: An Open Dataset for Human Activity Recognition

Data from: Potrika: Raw and Balanced Newspaper Datasets in the Bangla...

Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

Malware Analysis Datasets: Top-1000 PE Imports

Machine learning model that estimates public-supply deliveries for domestic...

Zenodo Open Metadata snapshot - Training dataset for records and communities...

Big Data Machine Learning Benchmark on Spark

SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FINDSee More Versions

SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND