100+ datasets found
  1. ChatQA-Training-Data

    • huggingface.co
    Updated Jun 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2023). ChatQA-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Data Description

    We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.

  2. d

    Training data from SPCAM for machine learning in moist physics

    • datadryad.org
    • zenodo.org
    zip
    Updated Aug 7, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guang Zhang; Yilun Han; Xiaomeng Huang; Yong Wang (2020). Training data from SPCAM for machine learning in moist physics [Dataset]. http://doi.org/10.6075/J0CZ35PP
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 7, 2020
    Dataset provided by
    Dryad
    Authors
    Guang Zhang; Yilun Han; Xiaomeng Huang; Yong Wang
    Time period covered
    2020
    Description

    The training samples of the entire year (from yr-2 of simulation) are compressed in SPCAM_ML_Han_et_al_0.tar.gz, and testing samples of the entire year (from yr-3 of simulation) are compressed in SPCAM_ML_Han_et_al_1.tar.gz. In each dataset, there are a data documentation file and 365 netCDF data files (one file for each day) that are marked by its date. The variable fields contain temperature and moisture tendencies and cloud water and cloud ice from the CRM, and vertical profiles of temperature and moisture and large-scale temperature and moisture tendencies from the dynamic core of SPCAM’s host model CAM5 and PBL diffusion. In addition, we include surface sensible and latent heat fluxes. For more details, please read the data documentation inside the tar.gz files.

  3. d

    Training and validation data from the AI for Critical Mineral Assessment...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training and validation data from the AI for Critical Mineral Assessment Competition [Dataset]. https://catalog.data.gov/dataset/training-and-validation-data-from-the-ai-for-critical-mineral-assessment-competition
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    Extracting useful and accurate information from scanned geologic and other earth science maps is a time-consuming and laborious process involving manual human effort. To address this limitation, the USGS partnered with the Defense Advanced Research Projects Agency (DARPA) to run the AI for Critical Mineral Assessment Competition, soliciting innovative solutions for automatically georeferencing and extracting features from maps. The competition opened for registration in August 2022 and concluded in December 2022. Training and validation data from the competition are provided here, as well as competition details and baseline solutions. The data are derived from published sources and are provided to the public to support continued development of automated georeferencing and feature extraction tools. References for all maps are included with the data.

  4. D

    Data Labeling Solution and Services Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMA Research & Media LLP (2025). Data Labeling Solution and Services Report [Dataset]. https://www.archivemarketresearch.com/reports/data-labeling-solution-and-services-52815
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    AMA Research & Media LLP
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Data Labeling Solution and Services market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across diverse sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated market value of $70 billion by 2033. This significant expansion is fueled by the burgeoning need for high-quality training data to enhance the accuracy and performance of AI models. Key growth drivers include the expanding application of AI in various industries like automotive (autonomous vehicles), healthcare (medical image analysis), and financial services (fraud detection). The increasing availability of diverse data types (text, image/video, audio) further contributes to market growth. However, challenges such as the high cost of data labeling, data privacy concerns, and the need for skilled professionals to manage and execute labeling projects pose certain restraints on market expansion. Segmentation by application (automotive, government, healthcare, financial services, others) and data type (text, image/video, audio) reveals distinct growth trajectories within the market. The automotive and healthcare sectors currently dominate, but the government and financial services segments are showing promising growth potential. The competitive landscape is marked by a mix of established players and emerging startups. Companies like Amazon Mechanical Turk, Appen, and Labelbox are leading the market, leveraging their expertise in crowdsourcing, automation, and specialized data labeling solutions. However, the market shows strong potential for innovation, particularly in the development of automated data labeling tools and the expansion of services into niche areas. Regional analysis indicates strong market penetration in North America and Europe, driven by early adoption of AI technologies and robust research and development efforts. However, Asia-Pacific is expected to witness significant growth in the coming years fueled by rapid technological advancements and a rising demand for AI solutions. Further investment in R&D focused on automation, improved data security, and the development of more effective data labeling methodologies will be crucial for unlocking the full potential of this rapidly expanding market.

  5. i

    Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text...

    • ieee-dataport.org
    • data.mendeley.com
    • +1more
    Updated Apr 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nahuel González (2021). Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings [Dataset]. http://doi.org/10.21227/7616-7964
    Explore at:
    Dataset updated
    Apr 22, 2021
    Dataset provided by
    IEEE Dataport
    Authors
    Nahuel González
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". Source data contains CSV files with dataset results summaries, false positives lists, the evaluated sentences, and their keystroke timings. Results data contains training and evaluation ARFF files for each user and sentence with the calculated Manhattan and euclidean distance, R metric, and the directionality index for each challenge instance. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY. The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y. We proposed a method to find, using only flight times (keydown/keydown), whether a medium-sized candidate list of possible texts includes the one to which the timings belong. Nor the text length neither the candidate texts list were restricted, and previous samples of the timing parameters for the candidates were not required to train the model. The method was evaluated using three datasets collected by non-mutually-collaborating sets of authors in different environments. False acceptance and false rejection rates were found to remain below or very near to 1% when user data was available for training. The former increased between two- to three-fold when the models were trained with data from other users, while the latter jumped to around 15%. These error rates are competitive against current methods for text recovery based on keystroke timings, and show that the method can be used effectively even without user-specific samples for training, by recurring to general population data.

  6. d

    Satellite-Derived Training Data for Automated Flood Detection in the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Satellite-Derived Training Data for Automated Flood Detection in the Continental U.S. [Dataset]. https://catalog.data.gov/dataset/satellite-derived-training-data-for-automated-flood-detection-in-the-continental-u-s
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    Remotely sensed imagery is increasingly used by emergency managers to monitor and map the impact of flood events to support preparedness, response, and critical decision making throughout the flood event lifecycle. To reduce latency in delivery of imagery-derived information, ensure consistent and reliably derived map products, and facilitate processing of an increasing volume of remote sensing data-streams, automated flood mapping workflows are needed. The U.S. Geological Survey is facilitating the development and integration of machine-learning algorithms in collaboration with NASA, National Geospatial Intelligence Agency (NGA), University of Alabama, and University of Illinois to create a workflow for rapidly generating improved flood-map products. A major bottleneck to the training of robust, generalizable machine learning algorithms for pattern recognition is a lack of training data that is representative across the landscape. To overcome this limitation for the training of algorithms capable of detection of surface inundation in diverse contexts, this publication includes the data developed from MAXAR Worldview sensors that is input as training data for machine learning. This data release consists of 100 thematic rasters, in GeoTiff format, with image labels representing five discrete categories: water, not water, maybe water, clouds and background/no data. Specifically, these training data were created by labeling 8-band, multispectral scenes from the MAXAR-Digital Globe, Worldview-2 and 3 satellite-based sensors. Scenes were selected to be spatially and spectrally diverse and geographically representative of different water features within the continental U.S. The labeling procedures used a hybrid approach of unsupervised classification for the initial spectral clustering, followed by expert-level manual interpretation and QA/QC peer review to finalize each labeled image. Updated versions of the data may be issued along with version update documentation. The 100 raster files that make up the training data are available to download here (https://doi.org/10.5066/P9C7HYRV).

  7. Sample dataset for the models trained and tested in the paper 'Can AI be...

    • zenodo.org
    zip
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

    This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

    This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

    To use the data, clone the corresponding repository and unzip this zip file in the data folder.

  8. Z

    Data Management Training Clearinghouse Metadata and Collection Statistics...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoebelheinrich, Nancy (2024). Data Management Training Clearinghouse Metadata and Collection Statistics Report [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7786963
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Benedict, Karl
    Hoebelheinrich, Nancy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This collection contains a snapshot of the learning resource metadata from ESIP's Data management Training Clearinghouse (DMTC) associated with the closeout (March 30, 2023) of the Institute of Museum and Library Services funded (Award Number: LG-70-18-0092-18) Development of an Enhanced and Expanded Data Management Training Clearinghouse project. The shared metadata are a snapshot associated with the final reporting date for the project, and the associated data report is also based upon the same data snapshot on the same date.

    The materials included in the collection consist of the following:

    esip-dev-02.edacnm.org.json.zip - a zip archive containing the metadata for 587 published learning resources as of March 30, 2023. These metadata include all publicly available metadata elements for the published learning resources with the exception of the metadata elements containing individual email addresses (submitter and contact) to reduce the exposure of these data.

    statistics.pdf - an automatically generated report summarizing information about the collection of materials in the DMTC Clearinghouse, including both published and unpublished learning resources. This report includes the numbers of published and unpublished resources through time; the number of learning resources within subject categories and detailed subject categories, the dates items assigned to each category were first added to the Clearinghouse, and the most recent data that items were added to that category; the distribution of learning resources across target audiences; and the frequency of keywords within the learning resource collection. This report is based on the metadata for published resourced included in this collection, and preliminary metadata for unpublished learning resources that are not included in the shared dataset.

    The metadata fields consist of the following:

        Fieldname
        Description
    
    
    
    
        abstract_data
        A brief synopsis or abstract about the learning resource
    
    
        abstract_format
        Declaration for how the abstract description will be represented.
    
    
        access_conditions
        Conditions upon which the resource can be accessed beyond cost, e.g., login required.
    
    
        access_cost
        Yes or No choice stating whether othere is a fee for access to or use of the resource.
    
    
        accessibililty_features_name
        Content features of the resource, such as accessible media, alternatives and supported enhancements for accessibility.
    
    
        accessibililty_summary
        A human-readable summary of specific accessibility features or deficiencies.
    
    
        author_names
        List of authors for a resource derived from the given/first and family/last names of the personal author fields by the system
    
    
        author_org
        - name
        - name_identifier
        - name_identifier_type
    
    
    
        - Name of organization authoring the learning resource.
        - The unique identifier for the organization authoring the resource.
        - The identifier scheme associated with the unique identifier for the organization authoring the resource.
    

    authors - givenName - familyName - name_identifier - name_identifier_type

        - Given or first name of person(s) authoring the resource.
        - Last or family name of person(s) authoring the resource.
        - The unique identifier for the person(s) authoring the resource.
        - The identifier scheme associated with the unique identifier for the person(s) authoring the resource, e.g., ORCID.
    
    
    
        citation
        Preferred Form of Citation.
    
    
        completion_time
        Intended Time to Complete
    

    contact - name - org - email

        - Name of person(s) who has/have been asserted as the contact(s) for the resource in case of questions or follow-up by resource user.
        - Name of organization that has/have been asserted as the contact(s) for the resource in case of questions or follow-up by resource user.
        - (excluded) Contact email address.
    
    
    
        contributor_orgs
        - name
        - name_identifier
        - name_identifier_type
        - type
        - Name of organization that is a secondary contributor to the learningresource. A contributor can also be an individual person.
        - The unique identifier for the organization contributing to the resource.
        - The identifier scheme associated with the unique identifier for the organization contributing to the resource.
        - Type of contribution to the resource made by an organization.
    
    
        contributors
        - familyName
        - givenName
        - name_identifier
        - name_identifier_type
    
    • Last or family name of person(s) contributing to the resource. - Given or first name of person(s) contributing to the resource. - The unique identifier for the person(s) contributing to the resource. - The identifier scheme associated with the unique identifier for the person(s) contributing to the resource, e.g., ORCID.

    contributors.type

    Type of contribution to the resource made by a person.

        created
        The date on which the metadata record was first saved as part of the input workflow.
    
    
        creator
        The name of the person creating the MD record for a resource.
    
    
        credential_status
        Declaration of whether a credential is offered for comopletion of the resource.
    

    ed_frameworks - name - description - nodes.name

        - The name of the educational framework to which the resource is aligned, if any. An educational framework is a structured description of educational concepts such as a shared curriculum, syllabus or set of learning objectives, or a vocabulary for describing some other aspect of education such as educational levels or reading ability.
        - A description of one or more subcategories of an educational framework to which a resource is associated.
        - The name of a subcategory of an educational framework to which a resource is associated.
    
    
        expertise_level
        The skill level targeted for the topic being taught.
    
    
        id
        Unique identifier for the MD record generated by the system in UUID format.
    
    
        keywords
        Important phrases or words used to describe the resource.
    
    
        language_primary
        Original language in which the learning resource being described is published or made available.
    
    
        languages_secondary
        Additional languages in which the resource is tranlated or made available, if any.
    
    
        license
        A license for use of that applies to the resource, typically indicated by URL.
    
    
        locator_data
        The identifier for the learning resource used as part of a citation, if available.
    
    
        locator_type
        Designation of citation locatorr type, e.g., DOI, ARK, Handle.
    
    
        lr_outcomes
        Descriptions of what knowledge, skills or abilities students should learn from the resource.
    
    
        lr_type
        A characteristic that describes the predominant type or kind of learning resource.
    
    
        media_type
        Media type of resource.
    
    
        modification_date
        System generated date and time when MD record is modified.
    
    
        notes
        MD Record Input Notes
    
    
        pub_status
        Status of metadata record within the system, i.e., in-process, in-review, pre-pub-review, deprecate-request, deprecated or published.
    
    
        published
        Date of first broadcast / publication.
    
    
        publisher
        The organization credited with publishing or broadcasting the resource.
    
    
        purpose
        The purpose of the resource in the context of education; e.g., instruction, professional education, assessment.
    
    
        rating
        The aggregation of input from all user assessments evaluating users' reaction to the learning resource following Kirkpatrick's model of training evaluation.
    
    
        ratings
        Inputs from users assessing each user's reaction to the learning resource following Kirkpatrick's model of training evaluation.
    
    
        resource_modification_date
        Date in which the resource has last been modified from the original published or broadcast version.
    
    
        status
        System generated publication status of the resource w/in the registry as a yes for published or no for not published.
    
    
        subject
        Subject domain(s) toward which the resource is targeted. There may be more than one value for this field.
    
    
        submitter_email
        (excluded) Email address of person who submitted the resource.
    
    
        submitter_name
        Submission Contact Person
    
    
        target_audience
        Audience(s) for which the resource is intended.
    
    
        title
        The name of the resource.
    
    
        url
        URL that resolves to a downloadable version of the learning resource or to a landing page for the resource that contains important contextual information including the direct resolvable link to the resource, if applicable.
    
    
        usage_info
        Descriptive information about using the resource, not addressed by the License information field.
    
    
        version
        The specific version of the resource, if declared.
    
  9. Business training; participation, costs and policy

    • data.subak.org
    • data.europa.eu
    atom feed, json
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministerie van Binnenlandse Zaken en Koninkrijksrelaties (2023). Business training; participation, costs and policy [Dataset]. https://data.subak.org/dataset/business-training-participation-costs-and-policy
    Explore at:
    atom feed, jsonAvailable download formats
    Dataset updated
    Feb 15, 2023
    Dataset provided by
    Ministry of the Interior and Kingdom Relations of Netherlandshttps://www.government.nl/ministries/ministry-of-the-interior-and-kingdom-relations
    Description

    This table provides information on, among other things, participation and expenditure in courses, the training policies of companies and the quality assurance of business training provided. The table also sets out the reasons why some companies did not offer company training.

    The data come from the Business Training Research and relate to the reporting years 2010 and 2015. The survey was conducted among companies with 10 and more private sector employees. The sectors Public administration, public administration and compulsory social insurance, Education and Health and Welfare were excluded. Nor are the companies in the Agriculture, Forestry and Fisheries sector involved in the investigation. The figures can be broken down by activity of the companies (SBI 2008) and by size class of the company.

    A number of characteristics of company training will no longer be requested from 2015. The question is whether the company has its own training centre, whether the training needs of individual employees are identified and the different ways in which the quality of company training is ensured.

    A number of features are not very similar between 2010 and 2015. These are the key future skills, the skills covered within current courses and the implementing agencies of external courses. This is because in these questionnaire items (where several answers are possible) it is no longer possible to indicate a most important choice from 2015. As a result, the total amount of the percentages is more than 100 percent in 2015.

    Data available from: 2010 to 2015

    Status of the figures: The figures in this table are final.

    Changes as of 8 December 2017: The figures of the sectors C Industrie, B-E Industry (no construction) and energy and B-F Industry and energy for the reporting year 2015 have been adjusted. Erroneously, these aggregates did not include the figures of the underlying industry 31-33 Other industry and repair. This has been corrected in this version.

    Changes by: 6 December 2017: Figures for 2015 have been added. The subjects spent for each male and female worker respectively have been removed from the table. The data was insufficiently reliable.

    When are new figures coming? The new figures for 2020 are expected in 2022.

  10. A

    AI Data Labeling Solution Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMA Research & Media LLP (2025). AI Data Labeling Solution Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-data-labeling-solution-56186
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    AMA Research & Media LLP
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The AI data labeling solutions market is experiencing robust growth, driven by the increasing demand for high-quality data to train and improve the accuracy of artificial intelligence algorithms. The market size in 2025 is estimated at $5 billion, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. The proliferation of AI applications across diverse sectors, including automotive, healthcare, and finance, necessitates vast amounts of labeled data. Cloud-based solutions are gaining prominence due to their scalability, cost-effectiveness, and accessibility. Furthermore, advancements in data annotation techniques and the emergence of specialized AI data labeling platforms are contributing to market expansion. However, challenges such as data privacy concerns, the need for highly skilled professionals, and the complexities of handling diverse data formats continue to restrain market growth to some extent. The market segmentation reveals that the cloud-based solutions segment is expected to dominate due to its inherent advantages over on-premise solutions. In terms of application, the automotive sector is projected to exhibit the fastest growth, driven by the increasing adoption of autonomous driving technology and advanced driver-assistance systems (ADAS). The healthcare industry is also a major contributor, with the rise of AI-powered diagnostic tools and personalized medicine driving demand for accurate medical image and data labeling. Geographically, North America currently holds a significant market share, but the Asia-Pacific region is poised for rapid growth owing to increasing investments in AI and technological advancements. The competitive landscape is marked by a diverse range of established players and emerging startups, fostering innovation and competition within the market. The continued evolution of AI and its integration across various industries ensures the continued expansion of the AI data labeling solution market in the coming years.

  11. d

    CT School Learning Model Indicators by County (7-day metrics) - ARCHIVE

    • catalog.data.gov
    • data.ct.gov
    • +1more
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.ct.gov (2023). CT School Learning Model Indicators by County (7-day metrics) - ARCHIVE [Dataset]. https://catalog.data.gov/dataset/ct-school-learning-model-indicators-by-county-7-day-metrics-archive
    Explore at:
    Dataset updated
    Aug 12, 2023
    Dataset provided by
    data.ct.gov
    Area covered
    Connecticut
    Description

    DPH note about change from 7-day to 14-day metrics: As of 10/15/2020, this dataset is no longer being updated. Starting on 10/15/2020, the school learning model indicator metrics will be calculated using a 14-day average rather than a 7-day average. The new school learning model indicators dataset using 14-day averages can be accessed here: https://data.ct.gov/Health-and-Human-Services/CT-School-Learning-Model-Indicators-by-County-14-d/e4bh-ax24 As you know, we are learning more about COVID-19 all the time, including the best ways to measure COVID-19 activity in our communities. CT DPH has decided to shift to 14-day rates because these are more stable, particularly at the town level, as compared to 7-day rates. In addition, since the school indicators were initially published by DPH last summer, CDC has recommended 14-day rates and other states (e.g., Massachusetts) have started to implement 14-day metrics for monitoring COVID transmission as well. With respect to geography, we also have learned that many people are looking at the town-level data to inform decision making, despite emphasis on the county-level metrics in the published addenda. This is understandable as there has been variation within counties in COVID-19 activity (for example, rates that are higher in one town than in most other towns in the county). This dataset includes the leading and secondary metrics identified by the Connecticut Department of Health (DPH) and the Department of Education (CSDE) to support local district decision-making on the level of in-person, hybrid (blended), and remote learning model for Pre K-12 education. Data represent daily averages for each week by date of specimen collection (cases and positivity), date of hospital admission, or date of ED visit. Hospitalization data come from the Connecticut Hospital Association and are based on hospital location, not county of patient residence. COVID-19-like illness includes fever and cough or shortness of breath or difficulty breathing or the presence of coronavirus diagnosis code and excludes patients with influenza-like illness. All data are preliminary. These data are updated weekly; the previous week period for each dataset is the previous Sunday-Saturday, known as an MMWR week (https://wwwn.cdc.gov/nndss/document/MMWR_week_overview.pdf). The date listed is the date the dataset was last updated and corresponds to a reporting period of the previous MMWR week. For instance, the data for 8/20/2020 corresponds to a reporting period of 8/9/2020-8/15/2020. These metrics were adapted from recommendations by the Harvard Global Institute and supplemented by existing DPH measures. For national data on COVID-19, see COVID View, the national weekly surveillance summary of U.S. COVID-19 activity, at https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/index.html Notes: 9/25/2020: Data for Mansfield and Middletown for the week of Sept 13-19 were unavailable at the time of reporting due to delays in lab reporting.

  12. d

    Deep Learning Road Surfaces 2023 (LGATE-482) - Datasets - data.wa.gov.au

    • catalogue.data.wa.gov.au
    Updated Dec 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Deep Learning Road Surfaces 2023 (LGATE-482) - Datasets - data.wa.gov.au [Dataset]. https://catalogue.data.wa.gov.au/dataset/deep-learning-road-surfaces-2023
    Explore at:
    Dataset updated
    Dec 31, 2023
    Area covered
    Western Australia
    Description

    Road Surfaces derived from imagery captured across Western Australia using a deep learning algorithm. The algorithm was created in-house within Landgate to demonstrate to approved State and Local Government (Government) entities the capabilities of Deep Learning [DL] and Artificial Intelligence [AI] algorithms. Show full description

  13. d

    750K+ Furniture Images | AI Training Data | Object Detection Data |...

    • datarade.ai
    Updated Dec 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Image Datasets (2007). 750K+ Furniture Images | AI Training Data | Object Detection Data | Annotated imagery data | Global Coverage [Dataset]. https://datarade.ai/data-products/500k-furniture-images-object-detection-data-full-exif-da-image-datasets
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 22, 2007
    Dataset authored and provided by
    Image Datasets
    Area covered
    Slovakia, Suriname, Fiji, Liberia, Burkina Faso, Indonesia, Saint Kitts and Nevis, Tunisia, Uruguay, Haiti
    Description

    This dataset features over 750,000 high-quality images of furniture sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a diverse and richly annotated collection of flower imagery.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.

    1. Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on flower photography ensure fresh, relevant, and high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements such as particular flower species or geographic regions to be met efficiently.

    2. Global Diversity: photographs have been sourced from contributors in over 100 countries, ensuring a vast array of flower species, colors, and environmental settings. The images feature varied contexts, including natural habitats, gardens, bouquets, and urban landscapes, providing an unparalleled level of diversity.

    3. High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a mix of artistic and practical perspectives suitable for a variety of applications.

    4. Popularity Scores Each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on user preferences or engagement trends.

    5. I-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in tasks such as image recognition, classification, and segmentation. It is compatible with a wide range of machine learning frameworks and workflows, ensuring seamless integration into your projects.

    6. Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.

    Use Cases 1. Training AI systems for plant recognition and classification. 2. Enhancing agricultural AI models for plant health assessment and species identification. 3. Building datasets for educational tools and augmented reality applications. 4. Supporting biodiversity and conservation research through AI-powered analysis.

    This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models, tailored to deliver exceptional performance for your projects. Customizations are available to suit specific project needs. Contact us to learn more!

  14. f

    Model performance per dataset.

    • plos.figshare.com
    xls
    Updated May 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik D. Huckvale; Hunter N. B. Moseley (2024). Model performance per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0299583.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Erik D. Huckvale; Hunter N. B. Moseley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

  15. d

    15M+ Images | AI Training Data | Annotated imagery data for AI | Object &...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Image Datasets, 15M+ Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/2m-images-annotated-imagery-data-full-exif-data-object-image-datasets
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Image Datasets
    Area covered
    Albania, Malta, Brunei Darussalam, Chad, United States Minor Outlying Islands, New Zealand, Mexico, Georgia, Qatar, Anguilla
    Description

    This dataset features over 15,000,000 high-quality images sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a diverse and richly annotated collection of imagery.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.

    1. Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on flower photography ensure fresh, relevant, and high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements such as particular flower species or geographic regions to be met efficiently.

    2. Global Diversity: photographs have been sourced from contributors in over 100 countries, ensuring a vast array of flower species, colors, and environmental settings. The images feature varied contexts, including natural habitats, gardens, bouquets, and urban landscapes, providing an unparalleled level of diversity.

    3. High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a mix of artistic and practical perspectives suitable for a variety of applications.

    4. Popularity Scores Each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on user preferences or engagement trends.

    5. I-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in tasks such as image recognition, classification, and segmentation. It is compatible with a wide range of machine learning frameworks and workflows, ensuring seamless integration into your projects.

    6. Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.

    Use Cases 1. Training AI systems for plant recognition and classification. 2. Enhancing agricultural AI models for plant health assessment and species identification. 3. Building datasets for educational tools and augmented reality applications. 4. Supporting biodiversity and conservation research through AI-powered analysis.

    This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models, tailored to deliver exceptional performance for your projects. Customizations are available to suit specific project needs. Contact us to learn more!

  16. Satellite images and road-reference data for AI-based road mapping in...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sean Sloan; Raiyan Talkhani; Tao Huang; Jayden Engert; William Laurance (2024). Satellite images and road-reference data for AI-based road mapping in Equatorial Asia [Dataset]. http://doi.org/10.5061/dryad.bvq83bkg7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 4, 2024
    Dataset provided by
    Vancouver Island University
    James Cook University
    Authors
    Sean Sloan; Raiyan Talkhani; Tao Huang; Jayden Engert; William Laurance
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Asia
    Description

    For the purposes of training AI-based models to identify (map) road features in rural/remote tropical regions on the basis of true-colour satellite imagery, and subsequently testing the accuracy of these AI-derived road maps, we produced a dataset of 8904 satellite image ‘tiles’ and their corresponding known road features across Equatorial Asia (Indonesia, Malaysia, Papua New Guinea). Methods

    1. INPUT 200 SATELLITE IMAGES

    The main dataset shared here was derived from a set of 200 input satellite images, also provided here. These 200 images are effectively ‘screenshots’ (i.e., reduced-resolution copies) of high-resolution true-colour satellite imagery (~0.5-1m pixel resolution) observed using the Elvis Elevation and Depth spatial data portal (https://elevation.fsdf.org.au/), which here is functionally equivalent to the more familiar Google Earth. Each of these original images was initially acquired at a resolution of 1920x886 pixels. Actual image resolution was coarser than the native high-resolution imagery. Visual inspection of these 200 images suggests a pixel resolution of ~5 meters, given the number of pixels required to span features of familiar scale, such as roads and roofs, as well as the ready discrimination of specific land uses, vegetation types, etc. These 200 images generally spanned either forest-agricultural mosaics or intact forest landscapes with limited human intervention. Sloan et al. (2023) present a map indicating the various areas of Equatorial Asia from which these images were sourced.
    IMAGE NAMING CONVENTION A common naming convention applies to satellite images’ file names: XX##.png where:

    XX – denotes the geographical region / major island of Equatorial Asia of the image, as follows: ‘bo’ (Borneo), ‘su’ (Sumatra), ‘sl’ (Sulawesi), ‘pn’ (Papua New Guinea), ‘jv’ (java), ‘ng’ (New Guinea [i.e., Papua and West Papua provinces of Indonesia])

    – denotes the ith image for a given geographical region / major island amongst the original 200 images, e.g., bo1, bo2, bo3…

    1. INTERPRETING ROAD FEATURES IN THE IMAGES For each of the 200 input satellite images, its road was visually interpreted and manually digitized to create a reference image dataset by which to train, validate, and test AI road-mapping models, as detailed in Sloan et al. (2023). The reference dataset of road features was digitized using the ‘pen tool’ in Adobe Photoshop. The pen’s ‘width’ was held constant over varying scales of observation (i.e., image ‘zoom’) during digitization. Consequently, at relatively small scales at least, digitized road features likely incorporate vegetation immediately bordering roads. The resultant binary (Road / Not Road) reference images were saved as PNG images with the same image dimensions as the original 200 images.

    2. IMAGE TILES AND REFERENCE DATA FOR MODEL DEVELOPMENT

    The 200 satellite images and the corresponding 200 road-reference images were both subdivided (aka ‘sliced’) into thousands of smaller image ‘tiles’ of 256x256 pixels each. Subsequent to image subdivision, subdivided images were also rotated by 90, 180, or 270 degrees to create additional, complementary image tiles for model development. In total, 8904 image tiles resulted from image subdivision and rotation. These 8904 image tiles are the main data of interest disseminated here. Each image tile entails the true-colour satellite image (256x256 pixels) and a corresponding binary road reference image (Road / Not Road).
    Of these 8904 image tiles, Sloan et al. (2023) randomly selected 80% for model training (during which a model ‘learns’ to recognize road features in the input imagery), 10% for model validation (during which model parameters are iteratively refined), and 10% for final model testing (during which the final accuracy of the output road map is assessed). Here we present these data in two folders accordingly:

    'Training’ – contains 7124 image tiles used for model training in Sloan et al. (2023), i.e., 80% of the original pool of 8904 image tiles. ‘Testing’– contains 1780 image tiles used for model validation and model testing in Sloan et al. (2023), i.e., 20% of the original pool of 8904 image tiles, being the combined set of image tiles for model validation and testing in Sloan et al. (2023).

    IMAGE TILE NAMING CONVENTION A common naming convention applies to image tiles’ directories and file names, in both the ‘training’ and ‘testing’ folders: XX##_A_B_C_DrotDDD where

    XX – denotes the geographical region / major island of Equatorial Asia of the original input 1920x886 pixel image, as follows: ‘bo’ (Borneo), ‘su’ (Sumatra), ‘sl’ (Sulawesi), ‘pn’ (Papua New Guinea), ‘jv’ (java), ‘ng’ (New Guinea [i.e., Papua and West Papua provinces of Indonesia])

    – denotes the ith image for a given geographical region / major island amongst the original 200 images, e.g., bo1, bo2, bo3…

    A, B, C and D – can all be ignored. These values, which are one of 0, 256, 512, 768, 1024, 1280, 1536, and 1792, are effectively ‘pixel coordinates’ in the corresponding original 1920x886-pixel input image. They were recorded within the names of image tiles’ sub-directories and file names merely to ensure that names/directory were uniquely named)

    rot – implies an image rotation. Not all image tiles are rotated, so ‘rot’ will appear only occasionally.

    DDD – denotes the degree of image-tile rotation, e.g., 90, 180, 270. Not all image tiles are rotated, so ‘DD’ will appear only occasionally.

    Note that the designator ‘XX##’ is directly equivalent to the filenames of the corresponding 1920x886-pixel input satellite images, detailed above. Therefore, each image tiles can be ‘matched’ with its parent full-scale satellite image. For example, in the ‘training’ folder, the subdirectory ‘Bo12_0_0_256_256’ indicates that its image tile therein (also named ‘Bo12_0_0_256_256’) would have been sourced from the full-scale image ‘Bo12.png’.

  17. E

    Data from: ASR training dataset for Croatian ParlaSpeech-HR v1.0

    • live.european-language-grid.eu
    binary format
    Updated Apr 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ASR training dataset for Croatian ParlaSpeech-HR v1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20176
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Apr 3, 2022
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The ParlaSpeech-HR dataset is built from parliamentary proceedings available in the Croatian part of the ParlaMint corpus and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of segments 8-20 seconds in length. There are two transcripts available: the original one, and the one normalised via a simple rule-based normaliser. Each of the transcripts contains word-level alignments to the recordings. Each segment has a reference to the ParlaMint 2.1 corpus (http://hdl.handle.net/11356/1432) via utterance IDs. If a segment is based on a single utterance, speaker information for that segment is available as well. There is speaker information available for 381,849 segments, i.e., 95% of all segments. Speaker information consists of all the speaker information available from the ParlaMint 2.1 corpus (name, party, gender, age, status, role). There are all together 309 speakers in the dataset.

    The dataset is divided into a training, a development, and a testing subset. Development data consist of 500 segments coming from the 5 most frequent speakers, with the goal of not losing speaker variety on dev data. Test data consist of 513 segments that come from 3 male (258 segments) and 3 female speakers (255 segments). There are no segments coming from the 6 test speakers in the two remaining subsets. The 22,076 instances not having speaker information are not assigned to any of the three subsets. The remaining 380,836 instances form the training set.

  18. d

    20K+ Parking Lots Images | AI Training Data | Machine Learning (ML) data |...

    • datarade.ai
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Image Datasets (2024). 20K+ Parking Lots Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/20k-parking-lots-images-machine-learning-ml-data-full-image-datasets
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Nov 29, 2024
    Dataset authored and provided by
    Image Datasets
    Area covered
    Malta, Bolivia (Plurinational State of), Slovakia, Liechtenstein, Saudi Arabia, El Salvador, Saint Pierre and Miquelon, Botswana, Saint Helena, Pitcairn
    Description

    This dataset features over 20,000 high-quality images of parking lots sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a diverse and richly annotated collection of flower imagery.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.

    1. Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on flower photography ensure fresh, relevant, and high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements such as particular flower species or geographic regions to be met efficiently.

    2. Global Diversity: photographs have been sourced from contributors in over 100 countries, ensuring a vast array of flower species, colors, and environmental settings. The images feature varied contexts, including natural habitats, gardens, bouquets, and urban landscapes, providing an unparalleled level of diversity.

    3. High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a mix of artistic and practical perspectives suitable for a variety of applications.

    4. Popularity Scores Each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on user preferences or engagement trends.

    5. I-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in tasks such as image recognition, classification, and segmentation. It is compatible with a wide range of machine learning frameworks and workflows, ensuring seamless integration into your projects.

    6. Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.

    Use Cases 1. Training AI systems for plant recognition and classification. 2. Enhancing agricultural AI models for plant health assessment and species identification. 3. Building datasets for educational tools and augmented reality applications. 4. Supporting biodiversity and conservation research through AI-powered analysis.

    This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models, tailored to deliver exceptional performance for your projects. Customizations are available to suit specific project needs. Contact us to learn more!

  19. Netflix Prize data

    • kaggle.com
    zip
    Updated Jul 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netflix (2017). Netflix Prize data [Dataset]. https://www.kaggle.com/netflix-inc/netflix-prize-data
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jul 19, 2017
    Dataset authored and provided by
    Netflixhttp://netflix.com/
    Description

    Context

    Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

    Content

    This comes directly from the README:

    TRAINING DATASET FILE DESCRIPTION

    The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

    CustomerID,Rating,Date

    • MovieIDs range from 1 to 17770 sequentially.
    • CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
    • Ratings are on a five star (integral) scale from 1 to 5.
    • Dates have the format YYYY-MM-DD.

    MOVIES FILE DESCRIPTION

    Movie information in "movie_titles.txt" is in the following format:

    MovieID,YearOfRelease,Title

    • MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
    • YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.
    • Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.

    QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

    The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.

    MovieID1:

    CustomerID11,Date11

    CustomerID12,Date12

    ...

    MovieID2:

    CustomerID21,Date21

    CustomerID22,Date22

    For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.

    The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.

    For example, if the qualifying dataset looked like:

    111:

    3245,2005-12-19

    5666,2005-12-23

    6789,2005-03-14

    225:

    1234,2005-05-26

    3456,2005-11-07

    then a prediction file should look something like:

    111:

    3.0

    3.4

    4.0

    225:

    1.0

    2.0

    which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.

    You must make predictions for all customers for all movies in the qualifying dataset.

    THE PROBE DATASET FILE DESCRIPTION

    To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.

    MovieID1:

    CustomerID11

    CustomerID12

    ...

    MovieID2:

    CustomerID21

    CustomerID22

    Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.

    If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.

    Acknowledgements

    The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt

    The contest was originally hosted at http://netflixprize.com/index.html

    The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar

    Inspiration

    This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here

  20. AIFS Machine Learning data

    • ecmwf.int
    application/x-grib +1
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (2023). AIFS Machine Learning data [Dataset]. https://www.ecmwf.int/en/forecasts/dataset/aifs-machine-learning-data
    Explore at:
    application/x-grib(1 datasets), nc(1 datasets)Available download formats
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ECMWF is now running its own Artificial Intelligence Forecasting System (AIFS). The AIFS consists of a deterministic model and an ensemble model. The deterministic model has been running operationally since 25 February 2025; further details can be found on the dedicated Implementation of AIFS Single v1 page.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NVIDIA (2023). ChatQA-Training-Data [Dataset]. https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
Organization logo

ChatQA-Training-Data

nvidia/ChatQA-Training-Data

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2023
Dataset provided by
Nvidiahttp://nvidia.com/
Authors
NVIDIA
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Data Description

We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.

Search
Clear search
Close search
Google apps
Main menu