100+ datasets found

d
Open Data Portal and Data & Insights Training
datasets.ai
catalog.data.gov
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Maryland (2024). Open Data Portal and Data & Insights Training [Dataset]. https://datasets.ai/datasets/socrata-and-open-data-portal-training
Explore at:
Dataset updated
Sep 26, 2024
Dataset authored and provided by
State of Maryland
Description
For newcomers to the opendata.maryland.gov site, gopi.data.socrata.com, and performance.maryland.gov, this page provides some insight and training into navigating these portals and how to effectively use Data & Insights, these sites' data management tool.
B
Open Data Training Workshop: Case Studies in Open Data for Qualitative and...
borealisdata.ca
search.dataone.org
Updated Apr 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Srinvivas Murthy; Maggie Woo Kinshella; Jessica Trawin; Teresa Johnson; Niranjan Kissoon; Matthew Wiens; Gina Ogilvie; Gurm Dhugga; J Mark Ansermino (2023). Open Data Training Workshop: Case Studies in Open Data for Qualitative and Quantitative Clinical Research [Dataset]. http://doi.org/10.5683/SP3/BNNAE7
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/BNNAE7
Dataset updated
Apr 18, 2023
Dataset provided by
Borealis
Authors
Srinvivas Murthy; Maggie Woo Kinshella; Jessica Trawin; Teresa Johnson; Niranjan Kissoon; Matthew Wiens; Gina Ogilvie; Gurm Dhugga; J Mark Ansermino
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset funded by
Digital Research Alliance of Canada
Description
Objective(s): Momentum for open access to research is growing. Funding agencies and publishers are increasingly requiring researchers make their data and research outputs open and publicly available. However, clinical researchers struggle to find real-world examples of Open Data sharing. The aim of this 1 hr virtual workshop is to provide real-world examples of Open Data sharing for both qualitative and quantitative data. Specifically, participants will learn: 1. Primary challenges and successes when sharing quantitative and qualitative clinical research data. 2. Platforms available for open data sharing. 3. Ways to troubleshoot data sharing and publish from open data. Workshop Agenda: 1. “Data sharing during the COVID-19 pandemic” - Speaker: Srinivas Murthy, Clinical Associate Professor, Department of Pediatrics, Faculty of Medicine, University of British Columbia. Investigator, BC Children's Hospital 2. “Our experience with Open Data for the 'Integrating a neonatal healthcare package for Malawi' project.” - Speaker: Maggie Woo Kinshella, Global Health Research Coordinator, Department of Obstetrics and Gynaecology, BC Children’s and Women’s Hospital and University of British Columbia This workshop draws on work supported by the Digital Research Alliance of Canada. Data Description: Presentation slides, Workshop Video, and Workshop Communication Srinivas Murthy: Data sharing during the COVID-19 pandemic presentation and accompanying PowerPoint slides. Maggie Woo Kinshella: Our experience with Open Data for the 'Integrating a neonatal healthcare package for Malawi' project presentation and accompanying Powerpoint slides. This workshop was developed as part of Dr. Ansermino's Data Champions Pilot Project supported by the Digital Research Alliance of Canada. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."
H
TRAINING DATASET: Hands-On Uploading Data (Download This File)
opendata.hawaii.gov
xls
Updated Sep 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training (2020). TRAINING DATASET: Hands-On Uploading Data (Download This File) [Dataset]. https://opendata.hawaii.gov/dataset/training-dataset-hands-on-uploading-data-download-this-file
Explore at:
xlsAvailable download formats
Dataset updated
Sep 23, 2020
Dataset authored and provided by
Training
Description
TRAINING DATASET: Hands-On Uploading Data (Download This File)
u
Data from: Open Data Training Workshop: Synthetic Data & The 2023 Pediatric...
open.library.ubc.ca
Updated Apr 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huxford, Charly; Nguyen, Vuong; Trawin, Jessica; Johnson, Teresa; Kissoon, Niranjan; Wiens, Matthew; Ogilvie, Gina; Murthy, Srinivas; Dhugga, Gurm; Kinshella, Maggie Woo; Ansermino, J Mark (2023). Open Data Training Workshop: Synthetic Data & The 2023 Pediatric Sepsis Data Challenge [Dataset]. http://doi.org/10.14288/1.0439798
Explore at:
Unique identifier
https://doi.org/10.14288/1.0439798
Dataset updated
Apr 18, 2023
Authors
Huxford, Charly; Nguyen, Vuong; Trawin, Jessica; Johnson, Teresa; Kissoon, Niranjan; Wiens, Matthew; Ogilvie, Gina; Murthy, Srinivas; Dhugga, Gurm; Kinshella, Maggie Woo; Ansermino, J Mark
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
Mar 14, 2023
Description

NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days.
Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on https://www.bcchr.ca/pediatric-sepsis-data-colab">this page under "collaborate with the pediatric sepsis colab."
H
Training - Hawaii Briefing on Open Government
opendata.hawaii.gov
catalog.data.gov
html
Updated Dec 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Accounting and General Services (2019). Training - Hawaii Briefing on Open Government [Dataset]. https://opendata.hawaii.gov/dataset/training-hawaii-briefing-on-open-government
Explore at:
htmlAvailable download formats
Dataset updated
Dec 12, 2019
Dataset authored and provided by
Accounting and General Services
Area covered
Hawaii
Description
A recording of a one hour training session done for the State of Hawaii by Socrata. It covers what Open Data is all about, how others have used Open data to improve the relationship of government to citizens. It also explores the many features and uses of Open Data.
o
Training courses - Dataset - Open Government Data
opendata.gov.jo
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Training courses - Dataset - Open Government Data [Dataset]. https://opendata.gov.jo/dataset/training-courses-3274-2023
Explore at:
Dataset updated
Aug 27, 2024
Description
Training courses
AI Training Dataset Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI Training Dataset Market Outlook

The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.

One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.

Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.

The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.

As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.

Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.

Data Type Analysis

The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.

Image data is critical for computer vision application
h
llm-training-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UniData, llm-training-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/llm-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
UniData
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages

The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

Models used for text generation:

GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)

Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
B
Open Data Training Video: A proposed data de-identification framework for...
borealisdata.ca
dataone.org
Updated Mar 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alishah Mawji; Holly Longstaff; Jessica Trawin; Clare Komugisha; Stefanie K. Novakowski; Matt Wiens; Samuel Akech; Abner Tagoola; Niranjan Kissoon; Mark J. Ansermino (2023). Open Data Training Video: A proposed data de-identification framework for clinical research [Dataset]. http://doi.org/10.5683/SP3/7XYZVC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/7XYZVC
Dataset updated
Mar 15, 2023
Dataset provided by
Borealis
Authors
Alishah Mawji; Holly Longstaff; Jessica Trawin; Clare Komugisha; Stefanie K. Novakowski; Matt Wiens; Samuel Akech; Abner Tagoola; Niranjan Kissoon; Mark J. Ansermino
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Objective(s): Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw data sets, in part due to concerns regarding research participant confidentiality and privacy. We provide an instructional video to describe a standardized de-identification framework that can be adapted and refined based on specific context and risks. Data Description: Training video, presentation slides. Related Resources: The data de-identification algorithm, dataset, and data dictionary that correspond with this training video are available through the Smart Triage sub-Dataverse. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."
d
Open Data Portal Tutorial for Maryland State Agencies
datasets.ai
opendata.maryland.gov
33
Updated Oct 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Maryland (2024). Open Data Portal Tutorial for Maryland State Agencies [Dataset]. https://datasets.ai/datasets/open-data-portal-tutorial-for-maryland-state-agencies
Explore at:
33Available download formats
Dataset updated
Oct 8, 2024
Dataset authored and provided by
State of Maryland
Area covered
Maryland
Description
This is a PDF document created by the Department of Information Technology (DoIT) and the Governor's Office of Performance Improvement to assist training Maryland state employees on use of the Open Data Portal, https://opendata.maryland.gov. This document covers direct data entry, uploading Excel spreadsheets, connecting source databases, and transposing data. Please note that this tutorial is intended for use by state employees, as non-state users cannot upload datasets to the Open Data Portal.
c
ATLAS SUSY Searches in Wh1Lbb Channel Open Data Set
opendata-qa.cern.ch
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATLAS collaboration (2024). ATLAS SUSY Searches in Wh1Lbb Channel Open Data Set [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.KG0B.XX5J
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.ATLAS.KG0B.XX5J
Dataset updated
2024
Dataset provided by
CERN Open Data Portal
Authors
ATLAS collaboration
Description
For this new study, researchers looked for charginos decaying in three ways – via two W bosons (WW), a W boson and a Z boson (WZ), or a W boson and a Higgs boson (WH). These decay channels can all result in similar experimental signatures with one lepton. Researchers looked for unique collision-event signatures with isolated leptons, missing momentum, and large-radius jets (or b-jets in the WH case). They applied improved cut-and-count strategies in the WW/WZ cases, and revised the previous cut-and-count WH analysis with new machine-learning techniques. Using Boosted Decision Trees (BDTs), researchers were able to enhance signal identification in scenarios where the chargino and next-to-lightest neutralino decays were mediated by a Higgs boson, or when their mass difference closely aligns with the mass of the Higgs boson itself.

Researchers utilised this open dataset for training the analysis BDTs, making it readily available for subsequent advanced theoretical or machine learning investigations. The dataset is organised into 16 folders, each containing root files derived from Monte Carlo (MC) simulations. These files encompass both object-level and event-level variables, incorporating their associated systematic uncertainties.

Within these folders, 14 pertain to Standard Model background samples, with three major contributors being Single Top, ttbar, and W jets. The remaining two folders house signal samples and theory uncertainties for all MC-generated events. Each file is enriched with additional variables representing BDT scores for both Signal and Backgrounds.

Adopting a 1 vs all strategy, separate BDTs undergo individual training, reweighing, and optimisation tailored to specific classifications. The resultant scores conform to a comprehensive classification framework, providing sample-targeted independent probabilities spanning from 0 to 1 for all noteworthy Backgrounds and Signal categories. These scores serve as benchmarks for evaluating other cutting-edge models, such as Graph Neural Networks (GNNs), in the ongoing exploration of competitive state-of-the-art methodologies.
The dataset contains a total number of 12,380,322 events of which more than 6 million are ttbar events, 463,056 events in MC-generated Signal samples, and 23,251,217 events in theory samples.
u
Gamma Training Dataset
rdr.ucl.ac.uk
bin
Updated Dec 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Matzner (2022). Gamma Training Dataset [Dataset]. http://doi.org/10.5522/04/21696008.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5522/04/21696008.v1
Dataset updated
Dec 8, 2022
Dataset provided by
University College London
Authors
Robin Matzner
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This training dataset included optical network topologies that are generated via SNR-BA method [1] with nodes scattered uniformly randomly over a grid the size of the north american continent. Here there is a minimum radius that is adhered to (100km) between the nodes. The nodes are between scales of 55-100 nodes. The routings of the network are computed under uniform bandwidth conditions with the first-fit k-shortest-path (FF-kSP) algorithm and sequential loading (SL) until the maximum state of the network is found at zero blocking. The Gaussian noise (GN) model is used to calculate the signal-to-noise ratio of paths and the total throughput of the network. This throughput is given as a training label. [1] R. Matzner, D. Semrau, R. Luo, G. Zervas, and P. Bayvel, ‘Making intelligent topology design choices: understanding structural and physical property performance implications in optical networks [Invited]’, J. Opt. Commun. Netw., JOCN, vol. 13, no. 8, pp. D53–D67, Aug. 2021, doi: 10.1364/JOCN.423490.
U
U.S. AI Training Dataset Market Report
archivemarketresearch.com
doc, pdf, ppt
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). U.S. AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/us-ai-training-dataset-market-4957
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
May 19, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
United States
Variables measured
Market Size
Description
The U.S. AI Training Dataset Market size was valued at USD 590.4 million in 2023 and is projected to reach USD 1880.70 million by 2032, exhibiting a CAGR of 18.0 % during the forecasts period. The U. S. AI training dataset market deals with the generation, selection, and organization of datasets used in training artificial intelligence. These datasets contain the requisite information that the machine learning algorithms need to infer and learn from. Conducts include the advancement and improvement of AI solutions in different fields of business like transport, medical analysis, computing language, and money related measurements. The applications include training the models for activities such as image classification, predictive modeling, and natural language interface. Other emerging trends are the change in direction of more and better-quality, various and annotated data for the improvement of model efficiency, synthetic data generation for data shortage, and data confidentiality and ethical issues in dataset management. Furthermore, due to arising technologies in artificial intelligence and machine learning, there is a noticeable development in building and using the datasets. Recent developments include: In February 2024, Google struck a deal worth USD 60 million per year with Reddit that will give the former real-time access to the latter’s data and use Google AI to enhance Reddit’s search capabilities. , In February 2024, Microsoft announced around USD 2.1 billion investment in Mistral AI to expedite the growth and deployment of large language models. The U.S. giant is expected to underpin Mistral AI with Azure AI supercomputing infrastructure to provide top-notch scale and performance for AI training and inference workloads. .
h
bigcode-pii-dataset-training
huggingface.co
Updated May 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2023). bigcode-pii-dataset-training [Dataset]. https://huggingface.co/datasets/bigcode/bigcode-pii-dataset-training
Explore at:
Dataset updated
May 4, 2023
Dataset authored and provided by
BigCode
Description
Bigcode PII Training Dataset

Dataset Description

This is the dataset used for the training of bigcode-pii-model (after training on pseudo-labeled data). It is a concatenation of an early version of bigcode-pii-dataset which had less samples, and pii-for-code (a dataset with 400 files we annotated in a previous iteration: MORE INFO TO BE ADDED). Files with AMBIGUOUS and ID were excluded. Each PII subtype was remaped to it supertype.

Statistics

The dataset… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/bigcode-pii-dataset-training.
Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...
catalog.data.gov
data.nist.gov
+1more
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. https://catalog.data.gov/dataset/dataset-an-open-combinatorial-diffraction-dataset-including-consensus-human-and-machine-le-0de06
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations.Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
R
Training and development dataset for information extraction in plant...
entrepot.recherche.data.gouv.fr
zip
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MaIAGE; Plateforme ESV; MaIAGE; Plateforme ESV (2025). Training and development dataset for information extraction in plant epidemiomonitoring [Dataset]. http://doi.org/10.57745/ZDNOGF
Explore at:
zip(479001)Available download formats
Unique identifier
https://doi.org/10.57745/ZDNOGF
Dataset updated
Feb 20, 2025
Dataset provided by
Recherche Data Gouv
Authors
MaIAGE; Plateforme ESV; MaIAGE; Plateforme ESV
License
https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/ZDNOGFhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/ZDNOGF
Dataset funded by
INRAE
Agence nationale de la recherche
PIA DATAIA
Description
The “Training and development dataset for information extraction in plant epidemiomonitoring” is the annotation set of the “Corpus for the epidemiomonitoring of plant”. The annotations include seven entity types (e.g. species, locations, disease), their normalisation by the NCBI taxonomy and GeoNames and binary (seven) and ternary relationships. The annotations refer to character positions within the documents of the corpus. The annotation guidelines give their definitions and representative examples. Both datasets are intended for the training and validation of information extraction methods.
O
BUTTER - Empirical Deep Learning Dataset
data.openei.org
datasets.ai
+2more
code, data, website
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek (2022). BUTTER - Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/1872441
Explore at:
code, website, dataAvailable download formats
Unique identifier
https://doi.org/10.25984/1872441
Dataset updated
May 20, 2022
Dataset provided by
National Renewable Energy Laboratory
Open Energy Data Initiative (OEDI)
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
Authors
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.
A
AI Training Dataset Market Report
archivemarketresearch.com
doc, pdf, ppt
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-training-dataset-market-5881
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Jun 6, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
global
Variables measured
Market Size
Description
The AI Training Dataset Market size was valued at USD 2124.0 million in 2023 and is projected to reach USD 8593.38 million by 2032, exhibiting a CAGR of 22.1 % during the forecasts period. An AI training dataset is a collection of data used to train machine learning models. It typically includes labeled examples, where each data point has an associated output label or target value. The quality and quantity of this data are crucial for the model's performance. A well-curated dataset ensures the model learns relevant features and patterns, enabling it to generalize effectively to new, unseen data. Training datasets can encompass various data types, including text, images, audio, and structured data. The driving forces behind this growth include:
H
TRAINING DATASET: Hands-On Formatting Data Part 1 (Download This File)
opendata.hawaii.gov
xls
Updated Sep 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training (2020). TRAINING DATASET: Hands-On Formatting Data Part 1 (Download This File) [Dataset]. https://opendata.hawaii.gov/dataset/training-dataset-hands-on-formatting-data-part-1-download-this-file
Explore at:
xlsAvailable download formats
Dataset updated
Sep 23, 2020
Dataset authored and provided by
Training
Description
TRAINING DATASET: Hands-On Formatting Data Part 1 (Download This File)
Military Installations, Ranges, and Training Areas
catalog.data.gov
data.globalchange.gov
+3more
Updated Feb 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Defense (2021). Military Installations, Ranges, and Training Areas [Dataset]. https://catalog.data.gov/dataset/military-installations-ranges-and-training-areas
Explore at:
Dataset updated
Feb 24, 2021
Dataset provided by
United States Department of Defensehttp://www.defense.gov/
Description
This dataset, released by DoD, contains geographic information for major installations, ranges, and training areas in the United States and its territories. This release integrates site information about DoD installations, training ranges, and land assets in a format which can be immediately put to work in commercial geospatial information systems. Homeland Security/Homeland Defense, law enforcement, and readiness planners will benefit from immediate access to DoD site location data during emergencies. Land use planning and renewable energy planning will also benefit from use of this data. Users are advised that the point and boundary location datasets are intended for planning purposes only, and do not represent the legal or surveyed land parcel boundaries.

Facebook

Twitter

Click to copy link

Link copied

Cite

State of Maryland (2024). Open Data Portal and Data & Insights Training [Dataset]. https://datasets.ai/datasets/socrata-and-open-data-portal-training

Open Data Portal and Data & Insights Training

Explore at:

Dataset updated

Sep 26, 2024

Dataset authored and provided by

State of Maryland

Description

For newcomers to the opendata.maryland.gov site, gopi.data.socrata.com, and performance.maryland.gov, this page provides some insight and training into navigating these portals and how to effectively use Data & Insights, these sites' data management tool.

Clear search

Close search

Google apps

Main menu

Open Data Portal and Data & Insights Training

Open Data Training Workshop: Case Studies in Open Data for Qualitative and...

TRAINING DATASET: Hands-On Uploading Data (Download This File)

Data from: Open Data Training Workshop: Synthetic Data & The 2023 Pediatric...

Training - Hawaii Briefing on Open Government

Training courses - Dataset - Open Government Data

AI Training Dataset Market Report | Global Forecast From 2025 To 2033

AI Training Dataset Market Outlook

Data Type Analysis

llm-training-dataset

Open Data Training Video: A proposed data de-identification framework for...

Open Data Portal Tutorial for Maryland State Agencies

ATLAS SUSY Searches in Wh1Lbb Channel Open Data Set

Gamma Training Dataset

U.S. AI Training Dataset Market Report

bigcode-pii-dataset-training

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

Training and development dataset for information extraction in plant...

BUTTER - Empirical Deep Learning Dataset

AI Training Dataset Market Report

TRAINING DATASET: Hands-On Formatting Data Part 1 (Download This File)

Military Installations, Ranges, and Training Areas

Open Data Portal and Data & Insights Training