https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Data Description
We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
The training samples of the entire year (from yr-2 of simulation) are compressed in SPCAM_ML_Han_et_al_0.tar.gz, and testing samples of the entire year (from yr-3 of simulation) are compressed in SPCAM_ML_Han_et_al_1.tar.gz. In each dataset, there are a data documentation file and 365 netCDF data files (one file for each day) that are marked by its date. The variable fields contain temperature and moisture tendencies and cloud water and cloud ice from the CRM, and vertical profiles of temperature and moisture and large-scale temperature and moisture tendencies from the dynamic core of SPCAM’s host model CAM5 and PBL diffusion. In addition, we include surface sensible and latent heat fluxes. For more details, please read the data documentation inside the tar.gz files.
Extracting useful and accurate information from scanned geologic and other earth science maps is a time-consuming and laborious process involving manual human effort. To address this limitation, the USGS partnered with the Defense Advanced Research Projects Agency (DARPA) to run the AI for Critical Mineral Assessment Competition, soliciting innovative solutions for automatically georeferencing and extracting features from maps. The competition opened for registration in August 2022 and concluded in December 2022. Training and validation data from the competition are provided here, as well as competition details and baseline solutions. The data are derived from published sources and are provided to the public to support continued development of automated georeferencing and feature extraction tools. References for all maps are included with the data.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global Data Labeling Solution and Services market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across diverse sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated market value of $70 billion by 2033. This significant expansion is fueled by the burgeoning need for high-quality training data to enhance the accuracy and performance of AI models. Key growth drivers include the expanding application of AI in various industries like automotive (autonomous vehicles), healthcare (medical image analysis), and financial services (fraud detection). The increasing availability of diverse data types (text, image/video, audio) further contributes to market growth. However, challenges such as the high cost of data labeling, data privacy concerns, and the need for skilled professionals to manage and execute labeling projects pose certain restraints on market expansion. Segmentation by application (automotive, government, healthcare, financial services, others) and data type (text, image/video, audio) reveals distinct growth trajectories within the market. The automotive and healthcare sectors currently dominate, but the government and financial services segments are showing promising growth potential. The competitive landscape is marked by a mix of established players and emerging startups. Companies like Amazon Mechanical Turk, Appen, and Labelbox are leading the market, leveraging their expertise in crowdsourcing, automation, and specialized data labeling solutions. However, the market shows strong potential for innovation, particularly in the development of automated data labeling tools and the expansion of services into niche areas. Regional analysis indicates strong market penetration in North America and Europe, driven by early adoption of AI technologies and robust research and development efforts. However, Asia-Pacific is expected to witness significant growth in the coming years fueled by rapid technological advancements and a rising demand for AI solutions. Further investment in R&D focused on automation, improved data security, and the development of more effective data labeling methodologies will be crucial for unlocking the full potential of this rapidly expanding market.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". Source data contains CSV files with dataset results summaries, false positives lists, the evaluated sentences, and their keystroke timings. Results data contains training and evaluation ARFF files for each user and sentence with the calculated Manhattan and euclidean distance, R metric, and the directionality index for each challenge instance. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY. The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y. We proposed a method to find, using only flight times (keydown/keydown), whether a medium-sized candidate list of possible texts includes the one to which the timings belong. Nor the text length neither the candidate texts list were restricted, and previous samples of the timing parameters for the candidates were not required to train the model. The method was evaluated using three datasets collected by non-mutually-collaborating sets of authors in different environments. False acceptance and false rejection rates were found to remain below or very near to 1% when user data was available for training. The former increased between two- to three-fold when the models were trained with data from other users, while the latter jumped to around 15%. These error rates are competitive against current methods for text recovery based on keystroke timings, and show that the method can be used effectively even without user-specific samples for training, by recurring to general population data.
Remotely sensed imagery is increasingly used by emergency managers to monitor and map the impact of flood events to support preparedness, response, and critical decision making throughout the flood event lifecycle. To reduce latency in delivery of imagery-derived information, ensure consistent and reliably derived map products, and facilitate processing of an increasing volume of remote sensing data-streams, automated flood mapping workflows are needed. The U.S. Geological Survey is facilitating the development and integration of machine-learning algorithms in collaboration with NASA, National Geospatial Intelligence Agency (NGA), University of Alabama, and University of Illinois to create a workflow for rapidly generating improved flood-map products. A major bottleneck to the training of robust, generalizable machine learning algorithms for pattern recognition is a lack of training data that is representative across the landscape. To overcome this limitation for the training of algorithms capable of detection of surface inundation in diverse contexts, this publication includes the data developed from MAXAR Worldview sensors that is input as training data for machine learning. This data release consists of 100 thematic rasters, in GeoTiff format, with image labels representing five discrete categories: water, not water, maybe water, clouds and background/no data. Specifically, these training data were created by labeling 8-band, multispectral scenes from the MAXAR-Digital Globe, Worldview-2 and 3 satellite-based sensors. Scenes were selected to be spatially and spectrally diverse and geographically representative of different water features within the continental U.S. The labeling procedures used a hybrid approach of unsupervised classification for the initial spectral clustering, followed by expert-level manual interpretation and QA/QC peer review to finalize each labeled image. Updated versions of the data may be issued along with version update documentation. The 100 raster files that make up the training data are available to download here (https://doi.org/10.5066/P9C7HYRV).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).
This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.
This sample dataset also includes files relative to metadata, static data, normalization, and plotting.
To use the data, clone the corresponding repository and unzip this zip file in the data folder.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains a snapshot of the learning resource metadata from ESIP's Data management Training Clearinghouse (DMTC) associated with the closeout (March 30, 2023) of the Institute of Museum and Library Services funded (Award Number: LG-70-18-0092-18) Development of an Enhanced and Expanded Data Management Training Clearinghouse project. The shared metadata are a snapshot associated with the final reporting date for the project, and the associated data report is also based upon the same data snapshot on the same date.
The materials included in the collection consist of the following:
esip-dev-02.edacnm.org.json.zip - a zip archive containing the metadata for 587 published learning resources as of March 30, 2023. These metadata include all publicly available metadata elements for the published learning resources with the exception of the metadata elements containing individual email addresses (submitter and contact) to reduce the exposure of these data.
statistics.pdf - an automatically generated report summarizing information about the collection of materials in the DMTC Clearinghouse, including both published and unpublished learning resources. This report includes the numbers of published and unpublished resources through time; the number of learning resources within subject categories and detailed subject categories, the dates items assigned to each category were first added to the Clearinghouse, and the most recent data that items were added to that category; the distribution of learning resources across target audiences; and the frequency of keywords within the learning resource collection. This report is based on the metadata for published resourced included in this collection, and preliminary metadata for unpublished learning resources that are not included in the shared dataset.
The metadata fields consist of the following:
Fieldname
Description
abstract_data
A brief synopsis or abstract about the learning resource
abstract_format
Declaration for how the abstract description will be represented.
access_conditions
Conditions upon which the resource can be accessed beyond cost, e.g., login required.
access_cost
Yes or No choice stating whether othere is a fee for access to or use of the resource.
accessibililty_features_name
Content features of the resource, such as accessible media, alternatives and supported enhancements for accessibility.
accessibililty_summary
A human-readable summary of specific accessibility features or deficiencies.
author_names
List of authors for a resource derived from the given/first and family/last names of the personal author fields by the system
author_org
- name
- name_identifier
- name_identifier_type
- Name of organization authoring the learning resource.
- The unique identifier for the organization authoring the resource.
- The identifier scheme associated with the unique identifier for the organization authoring the resource.
authors - givenName - familyName - name_identifier - name_identifier_type
- Given or first name of person(s) authoring the resource.
- Last or family name of person(s) authoring the resource.
- The unique identifier for the person(s) authoring the resource.
- The identifier scheme associated with the unique identifier for the person(s) authoring the resource, e.g., ORCID.
citation
Preferred Form of Citation.
completion_time
Intended Time to Complete
contact - name - org - email
- Name of person(s) who has/have been asserted as the contact(s) for the resource in case of questions or follow-up by resource user.
- Name of organization that has/have been asserted as the contact(s) for the resource in case of questions or follow-up by resource user.
- (excluded) Contact email address.
contributor_orgs
- name
- name_identifier
- name_identifier_type
- type
- Name of organization that is a secondary contributor to the learningresource. A contributor can also be an individual person.
- The unique identifier for the organization contributing to the resource.
- The identifier scheme associated with the unique identifier for the organization contributing to the resource.
- Type of contribution to the resource made by an organization.
contributors
- familyName
- givenName
- name_identifier
- name_identifier_type
contributors.type
Type of contribution to the resource made by a person.
created
The date on which the metadata record was first saved as part of the input workflow.
creator
The name of the person creating the MD record for a resource.
credential_status
Declaration of whether a credential is offered for comopletion of the resource.
ed_frameworks - name - description - nodes.name
- The name of the educational framework to which the resource is aligned, if any. An educational framework is a structured description of educational concepts such as a shared curriculum, syllabus or set of learning objectives, or a vocabulary for describing some other aspect of education such as educational levels or reading ability.
- A description of one or more subcategories of an educational framework to which a resource is associated.
- The name of a subcategory of an educational framework to which a resource is associated.
expertise_level
The skill level targeted for the topic being taught.
id
Unique identifier for the MD record generated by the system in UUID format.
keywords
Important phrases or words used to describe the resource.
language_primary
Original language in which the learning resource being described is published or made available.
languages_secondary
Additional languages in which the resource is tranlated or made available, if any.
license
A license for use of that applies to the resource, typically indicated by URL.
locator_data
The identifier for the learning resource used as part of a citation, if available.
locator_type
Designation of citation locatorr type, e.g., DOI, ARK, Handle.
lr_outcomes
Descriptions of what knowledge, skills or abilities students should learn from the resource.
lr_type
A characteristic that describes the predominant type or kind of learning resource.
media_type
Media type of resource.
modification_date
System generated date and time when MD record is modified.
notes
MD Record Input Notes
pub_status
Status of metadata record within the system, i.e., in-process, in-review, pre-pub-review, deprecate-request, deprecated or published.
published
Date of first broadcast / publication.
publisher
The organization credited with publishing or broadcasting the resource.
purpose
The purpose of the resource in the context of education; e.g., instruction, professional education, assessment.
rating
The aggregation of input from all user assessments evaluating users' reaction to the learning resource following Kirkpatrick's model of training evaluation.
ratings
Inputs from users assessing each user's reaction to the learning resource following Kirkpatrick's model of training evaluation.
resource_modification_date
Date in which the resource has last been modified from the original published or broadcast version.
status
System generated publication status of the resource w/in the registry as a yes for published or no for not published.
subject
Subject domain(s) toward which the resource is targeted. There may be more than one value for this field.
submitter_email
(excluded) Email address of person who submitted the resource.
submitter_name
Submission Contact Person
target_audience
Audience(s) for which the resource is intended.
title
The name of the resource.
url
URL that resolves to a downloadable version of the learning resource or to a landing page for the resource that contains important contextual information including the direct resolvable link to the resource, if applicable.
usage_info
Descriptive information about using the resource, not addressed by the License information field.
version
The specific version of the resource, if declared.
This table provides information on, among other things, participation and expenditure in courses, the training policies of companies and the quality assurance of business training provided. The table also sets out the reasons why some companies did not offer company training.
The data come from the Business Training Research and relate to the reporting years 2010 and 2015. The survey was conducted among companies with 10 and more private sector employees. The sectors Public administration, public administration and compulsory social insurance, Education and Health and Welfare were excluded. Nor are the companies in the Agriculture, Forestry and Fisheries sector involved in the investigation. The figures can be broken down by activity of the companies (SBI 2008) and by size class of the company.
A number of characteristics of company training will no longer be requested from 2015. The question is whether the company has its own training centre, whether the training needs of individual employees are identified and the different ways in which the quality of company training is ensured.
A number of features are not very similar between 2010 and 2015. These are the key future skills, the skills covered within current courses and the implementing agencies of external courses. This is because in these questionnaire items (where several answers are possible) it is no longer possible to indicate a most important choice from 2015. As a result, the total amount of the percentages is more than 100 percent in 2015.
Data available from: 2010 to 2015
Status of the figures: The figures in this table are final.
Changes as of 8 December 2017: The figures of the sectors C Industrie, B-E Industry (no construction) and energy and B-F Industry and energy for the reporting year 2015 have been adjusted. Erroneously, these aggregates did not include the figures of the underlying industry 31-33 Other industry and repair. This has been corrected in this version.
Changes by: 6 December 2017: Figures for 2015 have been added. The subjects spent for each male and female worker respectively have been removed from the table. The data was insufficiently reliable.
When are new figures coming? The new figures for 2020 are expected in 2022.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The AI data labeling solutions market is experiencing robust growth, driven by the increasing demand for high-quality data to train and improve the accuracy of artificial intelligence algorithms. The market size in 2025 is estimated at $5 billion, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. The proliferation of AI applications across diverse sectors, including automotive, healthcare, and finance, necessitates vast amounts of labeled data. Cloud-based solutions are gaining prominence due to their scalability, cost-effectiveness, and accessibility. Furthermore, advancements in data annotation techniques and the emergence of specialized AI data labeling platforms are contributing to market expansion. However, challenges such as data privacy concerns, the need for highly skilled professionals, and the complexities of handling diverse data formats continue to restrain market growth to some extent. The market segmentation reveals that the cloud-based solutions segment is expected to dominate due to its inherent advantages over on-premise solutions. In terms of application, the automotive sector is projected to exhibit the fastest growth, driven by the increasing adoption of autonomous driving technology and advanced driver-assistance systems (ADAS). The healthcare industry is also a major contributor, with the rise of AI-powered diagnostic tools and personalized medicine driving demand for accurate medical image and data labeling. Geographically, North America currently holds a significant market share, but the Asia-Pacific region is poised for rapid growth owing to increasing investments in AI and technological advancements. The competitive landscape is marked by a diverse range of established players and emerging startups, fostering innovation and competition within the market. The continued evolution of AI and its integration across various industries ensures the continued expansion of the AI data labeling solution market in the coming years.
DPH note about change from 7-day to 14-day metrics: As of 10/15/2020, this dataset is no longer being updated. Starting on 10/15/2020, the school learning model indicator metrics will be calculated using a 14-day average rather than a 7-day average. The new school learning model indicators dataset using 14-day averages can be accessed here: https://data.ct.gov/Health-and-Human-Services/CT-School-Learning-Model-Indicators-by-County-14-d/e4bh-ax24 As you know, we are learning more about COVID-19 all the time, including the best ways to measure COVID-19 activity in our communities. CT DPH has decided to shift to 14-day rates because these are more stable, particularly at the town level, as compared to 7-day rates. In addition, since the school indicators were initially published by DPH last summer, CDC has recommended 14-day rates and other states (e.g., Massachusetts) have started to implement 14-day metrics for monitoring COVID transmission as well. With respect to geography, we also have learned that many people are looking at the town-level data to inform decision making, despite emphasis on the county-level metrics in the published addenda. This is understandable as there has been variation within counties in COVID-19 activity (for example, rates that are higher in one town than in most other towns in the county). This dataset includes the leading and secondary metrics identified by the Connecticut Department of Health (DPH) and the Department of Education (CSDE) to support local district decision-making on the level of in-person, hybrid (blended), and remote learning model for Pre K-12 education. Data represent daily averages for each week by date of specimen collection (cases and positivity), date of hospital admission, or date of ED visit. Hospitalization data come from the Connecticut Hospital Association and are based on hospital location, not county of patient residence. COVID-19-like illness includes fever and cough or shortness of breath or difficulty breathing or the presence of coronavirus diagnosis code and excludes patients with influenza-like illness. All data are preliminary. These data are updated weekly; the previous week period for each dataset is the previous Sunday-Saturday, known as an MMWR week (https://wwwn.cdc.gov/nndss/document/MMWR_week_overview.pdf). The date listed is the date the dataset was last updated and corresponds to a reporting period of the previous MMWR week. For instance, the data for 8/20/2020 corresponds to a reporting period of 8/9/2020-8/15/2020. These metrics were adapted from recommendations by the Harvard Global Institute and supplemented by existing DPH measures. For national data on COVID-19, see COVID View, the national weekly surveillance summary of U.S. COVID-19 activity, at https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/index.html Notes: 9/25/2020: Data for Mansfield and Middletown for the week of Sept 13-19 were unavailable at the time of reporting due to delays in lab reporting.
Road Surfaces derived from imagery captured across Western Australia using a deep learning algorithm. The algorithm was created in-house within Landgate to demonstrate to approved State and Local Government (Government) entities the capabilities of Deep Learning [DL] and Artificial Intelligence [AI] algorithms. Show full description
This dataset features over 750,000 high-quality images of furniture sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a diverse and richly annotated collection of flower imagery.
Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.
Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on flower photography ensure fresh, relevant, and high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements such as particular flower species or geographic regions to be met efficiently.
Global Diversity: photographs have been sourced from contributors in over 100 countries, ensuring a vast array of flower species, colors, and environmental settings. The images feature varied contexts, including natural habitats, gardens, bouquets, and urban landscapes, providing an unparalleled level of diversity.
High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a mix of artistic and practical perspectives suitable for a variety of applications.
Popularity Scores Each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on user preferences or engagement trends.
I-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in tasks such as image recognition, classification, and segmentation. It is compatible with a wide range of machine learning frameworks and workflows, ensuring seamless integration into your projects.
Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.
Use Cases 1. Training AI systems for plant recognition and classification. 2. Enhancing agricultural AI models for plant health assessment and species identification. 3. Building datasets for educational tools and augmented reality applications. 4. Supporting biodiversity and conservation research through AI-powered analysis.
This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models, tailored to deliver exceptional performance for your projects. Customizations are available to suit specific project needs. Contact us to learn more!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.
This dataset features over 15,000,000 high-quality images sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a diverse and richly annotated collection of imagery.
Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.
Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on flower photography ensure fresh, relevant, and high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements such as particular flower species or geographic regions to be met efficiently.
Global Diversity: photographs have been sourced from contributors in over 100 countries, ensuring a vast array of flower species, colors, and environmental settings. The images feature varied contexts, including natural habitats, gardens, bouquets, and urban landscapes, providing an unparalleled level of diversity.
High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a mix of artistic and practical perspectives suitable for a variety of applications.
Popularity Scores Each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on user preferences or engagement trends.
I-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in tasks such as image recognition, classification, and segmentation. It is compatible with a wide range of machine learning frameworks and workflows, ensuring seamless integration into your projects.
Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.
Use Cases 1. Training AI systems for plant recognition and classification. 2. Enhancing agricultural AI models for plant health assessment and species identification. 3. Building datasets for educational tools and augmented reality applications. 4. Supporting biodiversity and conservation research through AI-powered analysis.
This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models, tailored to deliver exceptional performance for your projects. Customizations are available to suit specific project needs. Contact us to learn more!
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
For the purposes of training AI-based models to identify (map) road features in rural/remote tropical regions on the basis of true-colour satellite imagery, and subsequently testing the accuracy of these AI-derived road maps, we produced a dataset of 8904 satellite image ‘tiles’ and their corresponding known road features across Equatorial Asia (Indonesia, Malaysia, Papua New Guinea). Methods
The main dataset shared here was derived from a set of 200 input satellite images, also provided here. These 200 images are effectively ‘screenshots’ (i.e., reduced-resolution copies) of high-resolution true-colour satellite imagery (~0.5-1m pixel resolution) observed using the Elvis Elevation and Depth spatial data portal (https://elevation.fsdf.org.au/), which here is functionally equivalent to the more familiar Google Earth. Each of these original images was initially acquired at a resolution of 1920x886 pixels. Actual image resolution was coarser than the native high-resolution imagery. Visual inspection of these 200 images suggests a pixel resolution of ~5 meters, given the number of pixels required to span features of familiar scale, such as roads and roofs, as well as the ready discrimination of specific land uses, vegetation types, etc. These 200 images generally spanned either forest-agricultural mosaics or intact forest landscapes with limited human intervention. Sloan et al. (2023) present a map indicating the various areas of Equatorial Asia from which these images were sourced.
IMAGE NAMING CONVENTION
A common naming convention applies to satellite images’ file names:
XX##.png
where:
XX – denotes the geographical region / major island of Equatorial Asia of the image, as follows: ‘bo’ (Borneo), ‘su’ (Sumatra), ‘sl’ (Sulawesi), ‘pn’ (Papua New Guinea), ‘jv’ (java), ‘ng’ (New Guinea [i.e., Papua and West Papua provinces of Indonesia])
INTERPRETING ROAD FEATURES IN THE IMAGES For each of the 200 input satellite images, its road was visually interpreted and manually digitized to create a reference image dataset by which to train, validate, and test AI road-mapping models, as detailed in Sloan et al. (2023). The reference dataset of road features was digitized using the ‘pen tool’ in Adobe Photoshop. The pen’s ‘width’ was held constant over varying scales of observation (i.e., image ‘zoom’) during digitization. Consequently, at relatively small scales at least, digitized road features likely incorporate vegetation immediately bordering roads. The resultant binary (Road / Not Road) reference images were saved as PNG images with the same image dimensions as the original 200 images.
IMAGE TILES AND REFERENCE DATA FOR MODEL DEVELOPMENT
The 200 satellite images and the corresponding 200 road-reference images were both subdivided (aka ‘sliced’) into thousands of smaller image ‘tiles’ of 256x256 pixels each. Subsequent to image subdivision, subdivided images were also rotated by 90, 180, or 270 degrees to create additional, complementary image tiles for model development. In total, 8904 image tiles resulted from image subdivision and rotation. These 8904 image tiles are the main data of interest disseminated here. Each image tile entails the true-colour satellite image (256x256 pixels) and a corresponding binary road reference image (Road / Not Road).
Of these 8904 image tiles, Sloan et al. (2023) randomly selected 80% for model training (during which a model ‘learns’ to recognize road features in the input imagery), 10% for model validation (during which model parameters are iteratively refined), and 10% for final model testing (during which the final accuracy of the output road map is assessed). Here we present these data in two folders accordingly:
'Training’ – contains 7124 image tiles used for model training in Sloan et al. (2023), i.e., 80% of the original pool of 8904 image tiles. ‘Testing’– contains 1780 image tiles used for model validation and model testing in Sloan et al. (2023), i.e., 20% of the original pool of 8904 image tiles, being the combined set of image tiles for model validation and testing in Sloan et al. (2023).
IMAGE TILE NAMING CONVENTION A common naming convention applies to image tiles’ directories and file names, in both the ‘training’ and ‘testing’ folders: XX##_A_B_C_DrotDDD where
XX – denotes the geographical region / major island of Equatorial Asia of the original input 1920x886 pixel image, as follows: ‘bo’ (Borneo), ‘su’ (Sumatra), ‘sl’ (Sulawesi), ‘pn’ (Papua New Guinea), ‘jv’ (java), ‘ng’ (New Guinea [i.e., Papua and West Papua provinces of Indonesia])
A, B, C and D – can all be ignored. These values, which are one of 0, 256, 512, 768, 1024, 1280, 1536, and 1792, are effectively ‘pixel coordinates’ in the corresponding original 1920x886-pixel input image. They were recorded within the names of image tiles’ sub-directories and file names merely to ensure that names/directory were uniquely named)
rot – implies an image rotation. Not all image tiles are rotated, so ‘rot’ will appear only occasionally.
DDD – denotes the degree of image-tile rotation, e.g., 90, 180, 270. Not all image tiles are rotated, so ‘DD’ will appear only occasionally.
Note that the designator ‘XX##’ is directly equivalent to the filenames of the corresponding 1920x886-pixel input satellite images, detailed above. Therefore, each image tiles can be ‘matched’ with its parent full-scale satellite image. For example, in the ‘training’ folder, the subdirectory ‘Bo12_0_0_256_256’ indicates that its image tile therein (also named ‘Bo12_0_0_256_256’) would have been sourced from the full-scale image ‘Bo12.png’.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The ParlaSpeech-HR dataset is built from parliamentary proceedings available in the Croatian part of the ParlaMint corpus and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of segments 8-20 seconds in length. There are two transcripts available: the original one, and the one normalised via a simple rule-based normaliser. Each of the transcripts contains word-level alignments to the recordings. Each segment has a reference to the ParlaMint 2.1 corpus (http://hdl.handle.net/11356/1432) via utterance IDs. If a segment is based on a single utterance, speaker information for that segment is available as well. There is speaker information available for 381,849 segments, i.e., 95% of all segments. Speaker information consists of all the speaker information available from the ParlaMint 2.1 corpus (name, party, gender, age, status, role). There are all together 309 speakers in the dataset.
The dataset is divided into a training, a development, and a testing subset. Development data consist of 500 segments coming from the 5 most frequent speakers, with the goal of not losing speaker variety on dev data. Test data consist of 513 segments that come from 3 male (258 segments) and 3 female speakers (255 segments). There are no segments coming from the 6 test speakers in the two remaining subsets. The 22,076 instances not having speaker information are not assigned to any of the three subsets. The remaining 380,836 instances form the training set.
This dataset features over 20,000 high-quality images of parking lots sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a diverse and richly annotated collection of flower imagery.
Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.
Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on flower photography ensure fresh, relevant, and high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements such as particular flower species or geographic regions to be met efficiently.
Global Diversity: photographs have been sourced from contributors in over 100 countries, ensuring a vast array of flower species, colors, and environmental settings. The images feature varied contexts, including natural habitats, gardens, bouquets, and urban landscapes, providing an unparalleled level of diversity.
High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a mix of artistic and practical perspectives suitable for a variety of applications.
Popularity Scores Each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on user preferences or engagement trends.
I-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in tasks such as image recognition, classification, and segmentation. It is compatible with a wide range of machine learning frameworks and workflows, ensuring seamless integration into your projects.
Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.
Use Cases 1. Training AI systems for plant recognition and classification. 2. Enhancing agricultural AI models for plant health assessment and species identification. 3. Building datasets for educational tools and augmented reality applications. 4. Supporting biodiversity and conservation research through AI-powered analysis.
This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models, tailored to deliver exceptional performance for your projects. Customizations are available to suit specific project needs. Contact us to learn more!
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.
This comes directly from the README:
The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:
CustomerID,Rating,Date
Movie information in "movie_titles.txt" is in the following format:
MovieID,YearOfRelease,Title
The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.
MovieID1:
CustomerID11,Date11
CustomerID12,Date12
...
MovieID2:
CustomerID21,Date21
CustomerID22,Date22
For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.
The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.
For example, if the qualifying dataset looked like:
111:
3245,2005-12-19
5666,2005-12-23
6789,2005-03-14
225:
1234,2005-05-26
3456,2005-11-07
then a prediction file should look something like:
111:
3.0
3.4
4.0
225:
1.0
2.0
which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.
You must make predictions for all customers for all movies in the qualifying dataset.
To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.
MovieID1:
CustomerID11
CustomerID12
...
MovieID2:
CustomerID21
CustomerID22
Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.
If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.
The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt
The contest was originally hosted at http://netflixprize.com/index.html
The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar
This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ECMWF is now running its own Artificial Intelligence Forecasting System (AIFS). The AIFS consists of a deterministic model and an ensemble model. The deterministic model has been running operationally since 25 February 2025; further details can be found on the dedicated Implementation of AIFS Single v1 page.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Data Description
We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.