100+ datasets found

d
my.Harvard Operational Data Store
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
My.Harvard Support (2023). my.Harvard Operational Data Store [Dataset]. http://doi.org/10.7910/DVN/VX5Y9G
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/VX5Y9G
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
My.Harvard Support
Description
This entry provides access to the data elements available in the Operational Data Store (ODS) for my.Harvard Student Information System. These data are available through a request process. What are the goals of the Operational Data Store? Provide data in a more real-time environment than the Warehouse (refresh 1x a day) while not putting additional load on the transactional my.harvard system. Provide a single (university-wide) standard set of exports and then web-services for retrieving key Student data. Provide the ability to incrementally load the SIS Data warehouse star schemas, making it possible to refresh certain stars more than once a day. Provide Institutional Research and Registrar power-users the ability to investigate the Student data via direct SQL access. What is the SIS Operational Data Store (SIS ODS)? A database schema on the SIS Datawarehouse that will contain replicated core tables of the my.harvard transactional system along with standardized, simplified and performant views for extracting that data. We intend to make most data available through web services before the end of academic year 2015-2016. However, our first iteration will to be make data available via db views. The refresh schedule for the SIS ODS tables for this first release will be: Academic Class Data - 1x a day between 5:30am and 6:00am. What data will be available in the SIS ODS? ODS - Academic Class v SISODS_1.0.6.xlsx follow link to get to older versions ODS - Bio Demo v SISODS_1.0.5.xlsx follow link to get to older versions ODS - Class Enrollment.xlsx ODS - Student Career Program Plan v SISODS_1.0.6.xlsx ODS - Admissions v. SISODS_1.0.7 Document coming Snapshots - non-FAS. For FAS Snapshots, please contact Harvard College Institutional Research. How can I request access to the SIS ODS? Send an email to myharvard_support@harvard.edu to request access Please indicate what data you want to access through the ODS: School & Component Available components: Academic Class (course descriptors). Biographic - Demographic Class Enrollment Student Career Program Plan Please indicate whether the request is for a personal account or for an application integration account. For personal accounts, please provide the HUIDs of the individuals to be set up. How do I connect to the SIS ODS? SIS ODS connections are currently limited to ODBC/JDBC connections to a database. The attached instructions explain how to install SQL Developer and configure a connection.
T
mnist
tensorflow.org
universe.roboflow.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/mnist
Explore at:
Dataset updated
Jun 1, 2024
Description
The MNIST database of handwritten digits.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('mnist', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Level Crossing Warning Bell (LCWB) Dataset
zenodo.org
explore.openaire.eu
+1more
Updated May 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo De Donato; Lorenzo De Donato; Valeria Vittorini; Valeria Vittorini; Francesco Flammini; Francesco Flammini; Stefano Marrone; Stefano Marrone (2023). Level Crossing Warning Bell (LCWB) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7945412
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7945412
Dataset updated
May 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorenzo De Donato; Lorenzo De Donato; Valeria Vittorini; Valeria Vittorini; Francesco Flammini; Francesco Flammini; Stefano Marrone; Stefano Marrone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Acknowledgement
These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project which has received funding from the Shift2Rail Joint Undertaking under the European Union’s Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Union’s Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.

Disclaimers
The information and views set out in this document are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this document. Neither the JU nor any person acting on the JU’s behalf may be held responsible for the use which may be made of the information contained therein.

This "dataset" has been created for scientific purposes only - and WITHOUT ANY COMMERCIAL purposes - to study the potentials of Deep Learning and Transfer Learning approaches. We are NOT re-distributing any video or audio; our files just contain pointers and indications needed to reproduce our study. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.

General Info
The CSV files contained in this folder (and subfolders) compose the Level Crossing (LC) Warning Bell (WB) Dataset.

When using any of these data, please mention:

De Donato, L., Marrone, S., Flammini, F., Sansone, C., Vittorini, V., Nardone, R., Mazzariello, C., and Bernaudine, F., "Intelligent Detection of Warning Bells at Level Crossings through Deep Transfer Learning for Smarter Railway Maintenance", Engineering Applications of Artificial Intelligence, Elsevier, 2023

Content of the folder
This folder contains the following subfolders and files.

"Data Files" contains all the CSV files related to the data composing the LCWB Dataset:

WB_data.csv (WB_labels.csv): representing data of the "Warning Bell (WB)" class;

NA_data.csv (NA_labels.csv): representing data of the "No Alarm (NA)" class;

GE_data.csv (GE_labels.csv): representing data of the "GEneric alarm (GE)" class.

"LCWB Dataset" contains all the JSON files that show how the aforementioned data have been distributed among training, validation, and test sets:

IT_Distribution.json and UK_distribution.json respectively show how Italian (IT) WBs and British (UK) WBs have been distributed;

The same goes for NA_Distribution.json and GE_Distribution.json, which show the distribution of NA and GE data respectively;

DatasetDistribution.json simply incorporates the content of the aforementioned JSON files in a unique file that can be exploited to obtain exactly the same dataset we adopted in our analyses.

"Additional Files" contains some CSV files related to data we adopted to further test the deep neural network leveraged in the aforementioned manuscript:

FR_DE_data.csv (FR_DE_labels.csv): representing data that have been used to test the generalisation performances of the network we exploited on LC WBs related to countries that were not considered in the training phase.

Noises_data.csv (Noises_labels.csv): representing the noises that were considered to study the behaviour of the network in case of noisy data.

CSV Files Structure
Each "XX_labels.csv" file contains, for each entry, the following information:

The identifier ("index") of the sub-class (which is not relevant in our case);

The code-name ("mid") of the class, which is used in the "XX_data.csv" file to indicate the sub-class of a specific audio;

The extended name of the class ("display_name").

Worth mentioning, sub-classes do not have a specific purpose in our task. They have been kept to maintain as much as possible the structure of the "class_labels_indices.csv" file provided by AudioSet. The same applies to the "XX_data.csv" files, which have roughly the same structures of "Evaluation", "Balanced train", and "Unbalanced train" AudioSet CSV files.

Indeed, each "XX_data.csv" file contains, for each entry, the following information:

ID: the identifier of the entry;

YTID: the YouTube identifier of the video;

start_seconds and end_seconds: which delimit the portion of audio (extracted from YTID) which is of interest for this task;

positive_labels: the label(s) associated with the audio.

Credits
The structure of the CSV files contained in this dataset, as well as part of their content, was inspired by the CSV files composing the AudioSet dataset which is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while its ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Particularly, from AudioSet, we retrieved:

The structure of the CSV files as discussed above.

Data contained in GE_data.csv (which is a minimal portion of data made available by AudioSet) as well as the related 19 classes (in GE_labels.csv) which we selected among the hundreds of classes included in the AudioSet ontology.

Pointers contained in "XX_data.csv" files other than GE_data.csv have been retrieved manually from scratch. Then, the related "XX_labels.csv" files have been created consequently.

More about downloading the AudioSet dataset can be found here.
T
plant_village
tensorflow.org
opendatalab.com
+1more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). plant_village [Dataset]. http://identifiers.org/arxiv:1511.08060
Explore at:
Unique identifier
https://identifiers.org/arxiv:1511.08060
Dataset updated
Jun 1, 2024
Description
The PlantVillage dataset consists of 54303 healthy and unhealthy leaf images divided into 38 categories by species and disease.

NOTE: The original dataset is not available from the original source (plantvillage.org), therefore we get the unaugmented dataset from a paper that used that dataset and republished it. Moreover, we dropped images with Background_without_leaves label, because these were not present in the original dataset.

Original paper URL: https://arxiv.org/abs/1511.08060 Dataset URL: https://data.mendeley.com/datasets/tywbtsjrjv/1

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('plant_village', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/plant_village-1.0.2.png" alt="Visualization" width="500px">
c
Techcrunch news dataset
crawlfeeds.com
csv, zip
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Techcrunch news dataset [Dataset]. https://crawlfeeds.com/datasets/techcrunch-news-dataset
Explore at:
csv, zipAvailable download formats
Dataset updated
May 16, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Get access to a structured dataset of articles from TechCrunch, a top source for startup, technology, and business news. This dataset includes thousands of articles covering topics like venture funding, product launches, AI, crypto, and more.

Perfect for use in:

News aggregation and monitoring

Sentiment or trend analysis

NLP model training

Startup or tech sector research

The data is available in CSV and JSON formats and can be customized by date or topic on request.

👉 Contact us for full access or a filtered sample.
d
Number of Classes and Class Size by Level
data.gov.sg
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Education (2025). Number of Classes and Class Size by Level [Dataset]. https://data.gov.sg/dataset/number-of-classes-and-class-size-by-level
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Ministry of Education
License
https://data.gov.sg/open-data-licencehttps://data.gov.sg/open-data-licence
Time period covered
Jan 1982 - Jan 2024
Description
Dataset from Ministry of Education. For more information, visit https://data.gov.sg/datasets/d_bb5f828263a942a9af869eccf9b0068d/view
R
Coco Limited (person Only) Dataset
universe.roboflow.com
zip
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shreks swamp (2022). Coco Limited (person Only) Dataset [Dataset]. https://universe.roboflow.com/shreks-swamp/coco-dataset-limited--person-only
Explore at:
zipAvailable download formats
Dataset updated
May 31, 2022
Dataset authored and provided by
shreks swamp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
People Bounding Boxes
Description
COCO Dataset Limited (Person Only)

## Overview COCO Dataset Limited (Person Only) is a dataset for object detection tasks - it contains People annotations for 5,438 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Fruits-360 dataset
kaggle.com
data.mendeley.com
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mihai Oltean (2025). Fruits-360 dataset [Dataset]. https://www.kaggle.com/datasets/moltean/fruits
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mihai Oltean
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

Version: 2025.06.07.0

Content

The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

Branches

The dataset has 5 major branches:

-The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.

-The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.

-The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.

-The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.

-The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.

How to cite

Mihai Oltean, Fruits-360 dataset, 2017-

Dataset properties

For the 100x100 branch

Total number of images: 138704.

Training set size: 103993 images.

Test set size: 34711 images.

Number of classes: 206 (fruits, vegetables, nuts and seeds).

Image size: 100x100 pixels.

For the original-size branch

Total number of images: 58363.

Training set size: 29222 images.

Validation set size: 14614 images

Test set size: 14527 images.

Number of classes: 90 (fruits, vegetables, nuts and seeds).

Image size: various (original, captured, size) pixels.

For the 3-body-problem branch

Total number of images: 47033.

Training set size: 34800 images.

Test set size: 12233 images.

Number of classes: 3 (Apples, Cherries, Tomatoes).

Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.

Image size: 100x100 pixels.

For the meta branch

Number of classes: 26 (fruits, vegetables, nuts and seeds).

For the multi branch

Number of images: 150.

Filename format:

For the 100x100 branch

image_index_100.jpg (e.g. 31_100.jpg) or

r_image_index_100.jpg (e.g. r_31_100.jpg) or

r?_image_index_100.jpg (e.g. r2_31_100.jpg)

where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

For the original-size branch

r?_image_index.jpg (e.g. r2_31.jpg)

where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

For the multi branch

The file's name is the concatenation of the names of the fruits inside that picture.

Alternate download

The Fruits-360 dataset can be downloaded from:

Kaggle https://www.kaggle.com/moltean/fruits

GitHub https://github.com/fruits-360

How fruits were filmed

Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

Behind the fruits, we placed a white sheet of paper as a background.

Here i...
CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)
zenodo.org
data.niaid.nih.gov
+2more
bin, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farah Zaib Khan; Farah Zaib Khan; Stian Soiland-Reyes; Stian Soiland-Reyes (2020). CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object) [Dataset]. http://doi.org/10.17632/xnwncxpw42.1
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.17632/xnwncxpw42.1
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Farah Zaib Khan; Farah Zaib Khan; Stian Soiland-Reyes; Stian Soiland-Reyes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:

Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.

The Genome BAM file is processed using Picard MarkDuplicates. producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).

SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.

The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.

In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.

For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.

This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl

Steps to reproduce

To build the research object again, use Python 3 on macOS. Built with:

Processor 2.8GHz Intel Core i7

Memory: 16GB

OS: macOS High Sierra, Version 10.13.3

Storage: 250GB

Install cwltool

pip3 install cwltool==1.0.20180912090223

Install git lfs
The data download with the git repository requires the installation of Git lfs:
https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs

Get the data and make the analysis environment ready:

git clone https://github.com/FarahZKhan/cwl_workflows.git cd cwl_workflows/ git checkout CWLProvTesting ./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh

Run the following commands to create the CWLProv Research Object:

cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256

The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120
h
DBPedia_Classes
huggingface.co
Updated Jun 8, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Willem (2016). DBPedia_Classes [Dataset]. https://huggingface.co/datasets/DeveloperOats/DBPedia_Classes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 8, 2016
Authors
Willem
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
About Dataset DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in Wikipedia. This is an extract of the data (after cleaning, kernel included) that provides taxonomic, hierarchical categories ("classes") for 342,782 wikipedia articles. There are 3 levels, with 9, 70 and 219 classes respectively. A version of this dataset is a popular baseline for NLP/text classification tasks. This version of the dataset is much tougher… See the full description on the dataset page: https://huggingface.co/datasets/DeveloperOats/DBPedia_Classes.
Artificial Intelligence (AI) Training Dataset Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Artificial Intelligence (AI) Training Dataset Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/artificial-intelligence-training-dataset-market-global-industry-analysis
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Artificial Intelligence (AI) Training Dataset Market Outlook

According to our latest research, the global Artificial Intelligence (AI) Training Dataset market size reached USD 3.15 billion in 2024, reflecting robust industry momentum. The market is expanding at a notable CAGR of 20.8% and is forecasted to attain USD 20.92 billion by 2033. This impressive growth is primarily attributed to the surging demand for high-quality, annotated datasets to fuel machine learning and deep learning models across diverse industry verticals. The proliferation of AI-driven applications, coupled with rapid advancements in data labeling technologies, is further accelerating the adoption and expansion of the AI training dataset market globally.

One of the most significant growth factors propelling the AI training dataset market is the exponential rise in data-driven AI applications across industries such as healthcare, automotive, retail, and finance. As organizations increasingly rely on AI-powered solutions for automation, predictive analytics, and personalized customer experiences, the need for large, diverse, and accurately labeled datasets has become critical. Enhanced data annotation techniques, including manual, semi-automated, and fully automated methods, are enabling organizations to generate high-quality datasets at scale, which is essential for training sophisticated AI models. The integration of AI in edge devices, smart sensors, and IoT platforms is further amplifying the demand for specialized datasets tailored for unique use cases, thereby fueling market growth.

Another key driver is the ongoing innovation in machine learning and deep learning algorithms, which require vast and varied training data to achieve optimal performance. The increasing complexity of AI models, especially in areas such as computer vision, natural language processing, and autonomous systems, necessitates the availability of comprehensive datasets that accurately represent real-world scenarios. Companies are investing heavily in data collection, annotation, and curation services to ensure their AI solutions can generalize effectively and deliver reliable outcomes. Additionally, the rise of synthetic data generation and data augmentation techniques is helping address challenges related to data scarcity, privacy, and bias, further supporting the expansion of the AI training dataset market.

The market is also benefiting from the growing emphasis on ethical AI and regulatory compliance, particularly in data-sensitive sectors like healthcare, finance, and government. Organizations are prioritizing the use of high-quality, unbiased, and diverse datasets to mitigate algorithmic bias and ensure transparency in AI decision-making processes. This focus on responsible AI development is driving demand for curated datasets that adhere to strict quality and privacy standards. Moreover, the emergence of data marketplaces and collaborative data-sharing initiatives is making it easier for organizations to access and exchange valuable training data, fostering innovation and accelerating AI adoption across multiple domains.

From a regional perspective, North America currently dominates the AI training dataset market, accounting for the largest revenue share in 2024, driven by significant investments in AI research, a mature technology ecosystem, and the presence of leading AI companies and data annotation service providers. Europe and Asia Pacific are also witnessing rapid growth, with increasing government support for AI initiatives, expanding digital infrastructure, and a rising number of AI startups. While North America sets the pace in terms of technological innovation, Asia Pacific is expected to exhibit the highest CAGR during the forecast period, fueled by the digital transformation of emerging economies and the proliferation of AI applications across various industry sectors.

Data Type Analysis

The AI training dataset market is segmented by data type into Text, Image/Video, Audio, and Others, each playing a crucial role in powering different AI applications. Text da
d
Manual snow course observations, raw met data, raw snow depth observations,...
catalog.data.gov
Updated Jun 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Climate Adaptation Science Centers (2024). Manual snow course observations, raw met data, raw snow depth observations, locations, and associated metadata for Oregon sites [Dataset]. https://catalog.data.gov/dataset/manual-snow-course-observations-raw-met-data-raw-snow-depth-observations-locations-and-ass
Explore at:
Dataset updated
Jun 15, 2024
Dataset provided by
Climate Adaptation Science Centers
Area covered
Oregon
Description
OSU_SnowCourse Summary: Manual snow course observations were collected over WY 2012-2014 from four paired forest-open sites chosen to span a broad elevation range. Study sites were located in the upper McKenzie (McK) River watershed, approximately 100 km east of Corvallis, Oregon, on the western slope of the Cascade Range and in the Middle Fork Willamette (MFW) watershed, located to the south of the McKenzie. The sites were designated based on elevation, with a range of 1110-1480 m. Distributed snow depth and snow water equivalent (SWE) observations were collected via monthly manual snow courses from 1 November through 1 April and bi-weekly thereafter. Snow courses spanned 500 m of forested terrain and 500 m of adjacent open terrain. Snow depth observations were collected approximately every 10 m and SWE was measured every 100 m along the snow courses with a federal snow sampler. These data are raw observations and have not been quality controlled in any way. Distance along the transect was estimated in the field. OSU_SnowDepth Summary: 10-minute snow depth observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meterological stations were located in the approximate center of each forest or open snow course transect. These data have undergone basic quality control. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN flags for missing data to NA, and added site attributes such as site name and cover. We replaced positive values with NA, since snow depth values in raw data are negative (i.e., flipped, with some correction to use the height of the sensor as zero). Thus, positive snow depth values in the raw data equal negative snow depth values. Second, the sign of the data was switched to make them positive. Then, the smooth.m (MATLAB) function was used to roughly smooth the data, with a moving window of 50 points. Third, outliers were removed. All values higher than the smoothed values +10, were replaced with NA. In some cases, further single point outliers were removed. OSU_Met Summary: Raw, 10-minute meteorological observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meteorological stations were located in the approximate center of each forest or open snow course transect. These stations were deployed to collect numerous meteorological variables, of which snow depth and wind speed are included here. These data are raw datalogger output and have not been quality controlled in any way. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN and 7999 flags for missing data to NA, and added site attributes such as site name and cover. OSU_Location Summary: Location Metadata for manual snow course observations and meteorological sensors. These data are compiled from GPS data for which the horizontal accuracy is unknown, and from processed hemispherical photographs. They have not been quality controlled in any way.
p
DCAT-AP API endpoints for data.public.lu
data.public.lu
html, rdf, xlsx
Updated May 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Data Lëtzebuerg (2024). DCAT-AP API endpoints for data.public.lu [Dataset]. https://data.public.lu/en/datasets/dcat-ap-api-endpoints-for-data-public-lu/
Explore at:
rdf, html, xlsx(16280)Available download formats
Dataset updated
May 27, 2024
Dataset authored and provided by
Open Data Lëtzebuerg
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data.public.lu provides all its metadata in the DCAT and DCAT-AP formats, i.e. all data about the data stored or referenced on data.public.lu. DCAT (Data Catalog Vocabulary) is a specification designed to facilitate interoperability between data catalogs published on the Web. This specification has been extended via the DCAT-AP (DCAT Application Profile for data portals in Europe) standard, specifically for data portals in Europe. The serialisation of those vocabularies is mainly done in RDF (Resource Description Framework). The implementation of data.public.lu is based on the one of the open source udata platform. This API enables the federation of multiple Data portals together, for example, all the datasets published on data.public.lu are also published on data.europa.eu. The DCAT API from data.public.lu is used by the european data portal to federate its metadata. The DCAT standard is thus very important to guarantee the interoperability between all data portals in Europe. Usage Full catalog You can find here a few examples using the curl command line tool: To get all the metadata from the whole catalog hosted on data.public.lu curl https://data.public.lu/catalog.rdf Metadata for an organization To get the metadata of a specific organization, you need first to find its ID. The ID of an organization is the last part of its URL. For the organization "Open data Lëtzebuerg" its URL is https://data.public.lu/fr/organizations/open-data-letzebuerg/ and its ID is open-data-letzebuerg. To get all the metadata for a given organization, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/organizations/{id}/catalog.rdf Example: curl https://data.public.lu/api/1/organizations/open-data-letzebuerg/catalog.rdf Metadata for a dataset To get the metadata of a specific dataset, you need first to find its ID. The ID of dataset is the last part of its URL. For the dataset "Digital accessibility monitoring report - 2020-2021" its URL is https://data.public.lu/fr/datasets/digital-accessibility-monitoring-report-2020-2021/ and its ID is digital-accessibility-monitoring-report-2020-2021. To get all the metadata for a given dataset, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/datasets/{id}/rdf Example: curl https://data.public.lu/api/1/datasets/digital-accessibility-monitoring-report-2020-2021/rdf Compatibility with DCAT-AP 2.1.1 The DCAT-AP standard is in constant evolution, so the compatibility of the implementation should be regularly compared with the standard and adapted accordingly. In May 2023, we have done this comparison, and the result is available in the resources below (see document named 'udata 6 dcat-ap implementation status"). In the DCAT-AP model, classes and properties have a priority level which should be respected in every implementation: mandatory, recommended and optional. Our goal is to implement all mandatory classes and properties, and if possible implement all recommended classes and properties which make sense in the context of our open data portal.
The Quick, Draw! Dataset
github.com
carrfratagen43.blogspot.com
Updated Mar 1, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2017). The Quick, Draw! Dataset [Dataset]. https://github.com/googlecreativelab/quickdraw-dataset
Explore at:
Dataset updated
Mar 1, 2017
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game "Quick, Draw!". The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.

Example drawings: https://raw.githubusercontent.com/googlecreativelab/quickdraw-dataset/master/preview.jpg" alt="preview">
w
Immigration system statistics data tables
gov.uk
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Home Office (2025). Immigration system statistics data tables [Dataset]. https://www.gov.uk/government/statistical-data-sets/immigration-system-statistics-data-tables
Explore at:
Dataset updated
May 22, 2025
Dataset provided by
GOV.UK
Authors
Home Office
Description
List of the data tables as part of the Immigration System Statistics Home Office release. Summary and detailed data tables covering the immigration system, including out-of-country and in-country visas, asylum, detention, and returns.

If you have any feedback, please email MigrationStatsEnquiries@homeoffice.gov.uk.

Accessible file formats

The Microsoft Excel .xlsx files may not be suitable for users of assistive technology.
If you use assistive technology (such as a screen reader) and need a version of these documents in a more accessible format, please email MigrationStatsEnquiries@homeoffice.gov.uk
Please tell us what format you need. It will help us if you say what assistive technology you use.

Related content

Immigration system statistics, year ending March 2025
Immigration system statistics quarterly release
Immigration system statistics user guide
Publishing detailed data tables in migration statistics
Policy and legislative changes affecting migration to the UK: timeline
Immigration statistics data archives

Passenger arrivals

https://assets.publishing.service.gov.uk/media/68258d71aa3556876875ec80/passenger-arrivals-summary-mar-2025-tables.xlsx">Passenger arrivals summary tables, year ending March 2025 (MS Excel Spreadsheet, 66.5 KB)

‘Passengers refused entry at the border summary tables’ and ‘Passengers refused entry at the border detailed datasets’ have been discontinued. The latest published versions of these tables are from February 2025 and are available in the ‘Passenger refusals – release discontinued’ section. A similar data series, ‘Refused entry at port and subsequently departed’, is available within the Returns detailed and summary tables.

Electronic travel authorisation

https://assets.publishing.service.gov.uk/media/681e406753add7d476d8187f/electronic-travel-authorisation-datasets-mar-2025.xlsx">Electronic travel authorisation detailed datasets, year ending March 2025 (MS Excel Spreadsheet, 56.7 KB)
ETA_D01: Applications for electronic travel authorisations, by nationality ETA_D02: Outcomes of applications for electronic travel authorisations, by nationality

Entry clearance visas granted outside the UK

https://assets.publishing.service.gov.uk/media/68247953b296b83ad5262ed7/visas-summary-mar-2025-tables.xlsx">Entry clearance visas summary tables, year ending March 2025 (MS Excel Spreadsheet, 113 KB)

https://assets.publishing.service.gov.uk/media/682c4241010c5c28d1c7e820/entry-clearance-visa-outcomes-datasets-mar-2025.xlsx">Entry clearance visa applications and outcomes detailed datasets, year ending March 2025 (MS Excel Spreadsheet, 29.1 MB)
Vis_D01: Entry clearance visa applications, by nationality and visa type
Vis_D02: Outcomes of entry clearance visa applications, by nationality, visa type, and outcome

Additional d
d
Replication Data for: \"A Topic-based Segmentation Model for Identifying...
search.dataone.org
dataverse.harvard.edu
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EE3DE2
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Classification results for: Hellinger Distance Trees for Imbalanced Streams
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Lyon (2023). Classification results for: Hellinger Distance Trees for Imbalanced Streams [Dataset]. http://doi.org/10.6084/m9.figshare.1534549.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1534549.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Robert Lyon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data sets supporting the results reported in the paper: Hellinger Distance Trees for Imbalanced Streams, R. J. Lyon, J.M. Brooke, J.D. Knowles, B.W Stappers, 22nd International Conference on Pattern Recognition (ICPR), p.1969 - 1974, 2014. DOI: 10.1109/ICPR.2014.344 Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper, Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008. The data sets used for these experiments include, MAGIC Gamma Telescope Data Set : https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+TelescopeMiniBooNE particle identification Data Set : https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identificationSkin Segmentation Data Set : https://archive.ics.uci.edu/ml/datasets/Skin+SegmentationLetter Recognition Data Set : https://archive.ics.uci.edu/ml/datasets/Letter+RecognitionPen-Based Recognition of Handwritten Digits Data Set : https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+DigitsStatlog (Landsat Satellite) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)Statlog (Image Segmentation) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Image+Segmentation) A further data set used is not publicly available at present. However we are in the process of releasing it for public use. Please get in touch if you'd like to use it.

A readme file accompanies the data describing it in more detail.
BTS: Building Timeseries Dataset: Raw
figshare.com
csv
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arian Prabowo; Xiachong Lin; Imran Razzak; Hao Xue; Emily Wern Jien Yap; Matthew Amos; Flora D. Salim (2025). BTS: Building Timeseries Dataset: Raw [Dataset]. http://doi.org/10.6084/m9.figshare.28705559.v3
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28705559.v3
Dataset updated
Apr 3, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Arian Prabowo; Xiachong Lin; Imran Razzak; Hao Xue; Emily Wern Jien Yap; Matthew Amos; Flora D. Salim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Building TimeSeries (BTS) dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique Brick classes. Moreover, the metadata is standardized using the Brick schema.To get started, download the data and run the DIEF_inspect_raw.ipynb file.For more info, including data cards: https://github.com/cruiseresearchgroup/DIEF_BTS
Vehicle licensing statistics data files
gov.uk
s3.amazonaws.com
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Transport (2025). Vehicle licensing statistics data files [Dataset]. https://www.gov.uk/government/statistical-data-sets/vehicle-licensing-statistics-data-files
Explore at:
Dataset updated
Jun 11, 2025
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Department for Transport
Description
Data tables containing aggregated information about vehicles in the UK are also available.

Recent changes

A number of changes were introduced to these data files in the 2022 release to help meet the needs of our users and to provide more detail.

Fuel type has been added to:

df_VEH0120_GB

df_VEH0120_UK

df_VEH0160_GB

df_VEH0160_UK

Historic UK data has been added to:

df_VEH0124 (now split into 2 files)

df_VEH0220

df_VEH0270

A new datafile has been added df_VEH0520.

We welcome any feedback on the structure of our data files, their usability, or any suggestions for improvements; please contact vehicles statistics.

How to use CSV files

CSV files can be used either as a spreadsheet (using Microsoft Excel or similar spreadsheet packages) or digitally using software packages and languages (for example, R or Python).

When using as a spreadsheet, there will be no formatting, but the file can still be explored like our publication tables. Due to their size, older software might not be able to open the entire file.

Download data files

Make and model by quarter

df_VEH0120_GB: https://assets.publishing.service.gov.uk/media/68494aca74fe8fe0cbb4676c/df_VEH0120_GB.csv">Vehicles at the end of the quarter by licence status, body type, make, generic model and model: Great Britain (CSV, 58.1 MB)

Scope: All registered vehicles in Great Britain; from 1994 Quarter 4 (end December)

Schema: BodyType, Make, GenModel, Model, Fuel, LicenceStatus, [number of vehicles; 1 column per quarter]

df_VEH0120_UK: https://assets.publishing.service.gov.uk/media/68494acb782e42a839d3a3ac/df_VEH0120_UK.csv">Vehicles at the end of the quarter by licence status, body type, make, generic model and model: United Kingdom (CSV, 34.1 MB)

Scope: All registered vehicles in the United Kingdom; from 2014 Quarter 3 (end September)

Schema: BodyType, Make, GenModel, Model, Fuel, LicenceStatus, [number of vehicles; 1 column per quarter]

df_VEH0160_GB: https://assets.publishing.service.gov.uk/media/68494ad774fe8fe0cbb4676d/df_VEH0160_GB.csv">Vehicles registered for the first time by body type, make, generic model and model: Great Britain (CSV, 24.8 MB)

Scope: All vehicles registered for the first time in Great Britain; from 2001 Quarter 1 (January to March)

Schema: BodyType, Make, GenModel, Model, Fuel, [number of vehicles; 1 column per quarter]

df_VEH0160_UK: https://assets.publishing.service.gov.uk/media/68494ad7aae47e0d6c06e078/df_VEH0160_UK.csv">Vehicles registered for the first time by body type, make, generic model and model: United Kingdom (CSV, 8.26 MB)

Scope: All vehicles registered for the first time in the United Kingdom; from 2014 Quarter 3 (July to September)

Schema: BodyType, Make, GenModel, Model, Fuel, [number of vehicles; 1 column per quarter]

Make and model by age

In order to keep the datafile df_VEH0124 to a reasonable size, it has been split into 2 halves; 1 covering makes starting with A to M, and the other covering makes starting with N to Z.

df_VEH0124_AM: <a class="govuk-link" href="https://assets.
Z
Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haque, Mohammad Nazmul (2020). Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_33539
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Haque, Mohammad Nazmul
Noman, Nasimul
Berratta, Regina
Moscato, Pablo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Faces Dataset: PubFig05

This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:

Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba

Feature Extraction

To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.

Feature Selection

Details about feature selection followed in brief as follows:

Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.

Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.

(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.

UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.

All of these datasets are inside the compressed folder. It also contains the document describing the process detail.

References

[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).

[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).

[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).

[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.

Facebook

Twitter

Click to copy link

Link copied

Cite

My.Harvard Support (2023). my.Harvard Operational Data Store [Dataset]. http://doi.org/10.7910/DVN/VX5Y9G

my.Harvard Operational Data Store

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/VX5Y9G

Dataset updated

Nov 21, 2023

Dataset provided by

Harvard Dataverse

Authors

My.Harvard Support

Description

This entry provides access to the data elements available in the Operational Data Store (ODS) for my.Harvard Student Information System. These data are available through a request process. What are the goals of the Operational Data Store? Provide data in a more real-time environment than the Warehouse (refresh 1x a day) while not putting additional load on the transactional my.harvard system. Provide a single (university-wide) standard set of exports and then web-services for retrieving key Student data. Provide the ability to incrementally load the SIS Data warehouse star schemas, making it possible to refresh certain stars more than once a day. Provide Institutional Research and Registrar power-users the ability to investigate the Student data via direct SQL access. What is the SIS Operational Data Store (SIS ODS)? A database schema on the SIS Datawarehouse that will contain replicated core tables of the my.harvard transactional system along with standardized, simplified and performant views for extracting that data. We intend to make most data available through web services before the end of academic year 2015-2016. However, our first iteration will to be make data available via db views. The refresh schedule for the SIS ODS tables for this first release will be: Academic Class Data - 1x a day between 5:30am and 6:00am. What data will be available in the SIS ODS? ODS - Academic Class v SISODS_1.0.6.xlsx follow link to get to older versions ODS - Bio Demo v SISODS_1.0.5.xlsx follow link to get to older versions ODS - Class Enrollment.xlsx ODS - Student Career Program Plan v SISODS_1.0.6.xlsx ODS - Admissions v. SISODS_1.0.7 Document coming Snapshots - non-FAS. For FAS Snapshots, please contact Harvard College Institutional Research. How can I request access to the SIS ODS? Send an email to myharvard_support@harvard.edu to request access Please indicate what data you want to access through the ODS: School & Component Available components: Academic Class (course descriptors). Biographic - Demographic Class Enrollment Student Career Program Plan Please indicate whether the request is for a personal account or for an application integration account. For personal accounts, please provide the HUIDs of the individuals to be set up. How do I connect to the SIS ODS? SIS ODS connections are currently limited to ODBC/JDBC connections to a database. The attached instructions explain how to install SQL Developer and configure a connection.

Clear search

Close search

Google apps

Main menu

my.Harvard Operational Data Store

mnist

Level Crossing Warning Bell (LCWB) Dataset

plant_village

Techcrunch news dataset

Number of Classes and Class Size by Level

Coco Limited (person Only) Dataset

COCO Dataset Limited (Person Only)

Fruits-360 dataset

Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

Version: 2025.06.07.0

Content

Branches

How to cite

Dataset properties

For the 100x100 branch

For the original-size branch

For the 3-body-problem branch

For the meta branch

For the multi branch

Filename format:

For the 100x100 branch

For the original-size branch

For the multi branch

Alternate download

How fruits were filmed

CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

DBPedia_Classes

Artificial Intelligence (AI) Training Dataset Market Research Report 2033

Artificial Intelligence (AI) Training Dataset Market Outlook

Data Type Analysis

Manual snow course observations, raw met data, raw snow depth observations,...

DCAT-AP API endpoints for data.public.lu

The Quick, Draw! Dataset

Immigration system statistics data tables

Accessible file formats

Related content

Passenger arrivals

Electronic travel authorisation

Entry clearance visas granted outside the UK

Replication Data for: \"A Topic-based Segmentation Model for Identifying...

Classification results for: Hellinger Distance Trees for Imbalanced Streams

BTS: Building Timeseries Dataset: Raw

Vehicle licensing statistics data files

Recent changes

How to use CSV files

Download data files

Make and model by quarter

Make and model by age

Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...

my.Harvard Operational Data Store