100+ datasets found

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
i
Public datasets for GBSR
ieee-dataport.org
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JianHua Peng (2024). Public datasets for GBSR [Dataset]. https://ieee-dataport.org/documents/public-datasets-gbsr
Explore at:
Dataset updated
Jun 11, 2024
Authors
JianHua Peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Yale and ORL
g
Machine learning model that estimates total monthly and annual per capita...
gimi9.com
data.usgs.gov
+4more
Updated Aug 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. https://gimi9.com/dataset/data-gov_machine-learning-model-that-estimates-total-monthly-and-annual-per-capita-public-supply-wi/
Explore at:
Dataset updated
Aug 29, 2024
Description
This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. STAT_PS_HUC12_Tot_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply total water use from 2000-2020 STAT_PS_HUC12_GW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply groundwater use for 2000-2020 STAT_PS_HUC12_SW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply surface water use for 2000-2020 public_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the public supply water use machine learning model version_history_MLmodel.txt - a txt file describing changes in this version
e
SYNERGY - Open machine learning dataset on study selection in systematic...
b2find.eudat.eu
Updated Jul 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1bea4d3c-ceef-5f63-89ed-80aeab18f601
Explore at:
Dataset updated
Jul 21, 2024
Description
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information. The recommended way to work with the SYNERGY dataset is via the Python package "synergy-dataset". This flexible package downloads and builds the dataset.
m
Composed Encrypted Malicious Traffic Dataset for machine learning based...
data.mendeley.com
Updated Oct 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2021). Composed Encrypted Malicious Traffic Dataset for machine learning based encrypted malicious traffic analysis. [Dataset]. http://doi.org/10.17632/ztyk4h3v6s.2
Explore at:
Unique identifier
https://doi.org/10.17632/ztyk4h3v6s.2
Dataset updated
Oct 12, 2021
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a traffic dataset which contains balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection. The dataset is a secondary csv feature data which is composed of five public traffic datasets. Our dataset is composed based on three criteria: The first criterion is to combine widely considered public datasets which contain both encrypted malicious and legitimate traffic in existing works, such as the Malwares Capture Facility Project dataset and the CICIDS-2017 dataset. The second criterion is to ensure the data balance, i.e., balance of malicious and legitimate network traffic and similar size of network traffic contributed by each individual dataset. Thus, approximate proportions of malicious and legitimate traffic from each selected public dataset are extracted by using random sampling. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. The third criterion is that our dataset includes both conventional devices' and IoT devices' encrypted malicious and legitimate traffic, as these devices are increasingly being deployed and are working in the same environments such as offices, homes, and other smart city settings.

Based on the criteria, 5 public datasets are selected. After data pre-processing, details of each selected public dataset and the final composed dataset are shown in “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, proportions of selected traffic size from each selected public dataset with respect to the total traffic size of the composed dataset (% w.r.t the composed dataset), proportions of selected encrypted traffic size from each selected public dataset (% of selected public dataset), and total traffic size of the composed dataset. From the table, we are able to observe that each public dataset equally contributes to approximately 20% of the composed dataset, except for CICDS-2012 (due to its limited number of encrypted malicious traffic). This achieves a balance across individual datasets and reduces bias towards traffic belonging to any dataset during learning. We can also observe that the size of malicious and legitimate traffic are almost the same, thus achieving class balance. The datasets now made available were prepared aiming at encrypted malicious traffic detection. Since the dataset is used for machine learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4 and stratification is applied during data split. Such datasets can be used directly for machine or deep learning model training based on selected features.
Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...
catalog.data.gov
data.nist.gov
+2more
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. https://catalog.data.gov/dataset/dataset-an-open-combinatorial-diffraction-dataset-including-consensus-human-and-machine-le-0de06
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations.Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
D
AI Training Dataset Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI Training Dataset Market Outlook

The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.

One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.

Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.

The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.

As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.

Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.

Data Type Analysis

The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.

Image data is critical for computer vision application
m
Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...
data.mendeley.com
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2022). Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis [Dataset]. http://doi.org/10.17632/xw7r4tt54g.1
Explore at:
Unique identifier
https://doi.org/10.17632/xw7r4tt54g.1
Dataset updated
Dec 6, 2022
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.

Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.

Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.

The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.
A Dataset for Machine Learning Algorithm Development
fisheries.noaa.gov
catalog.data.gov
Updated Jan 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaska Fisheries Science Center (AFSC) (2021). A Dataset for Machine Learning Algorithm Development [Dataset]. https://www.fisheries.noaa.gov/inport/item/63322
Explore at:
Dataset updated
Jan 1, 2021
Dataset provided by
Alaska Fisheries Science Center
Authors
Alaska Fisheries Science Center (AFSC)
Area covered
Chukchi Sea, Alaska, Beaufort Sea, Kotzebue Sound
Description
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.
Detection of Areas with Human Vulnerability Using Public Satellite Images...
zenodo.org
zip
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flavio de Barros Vidal; Flavio de Barros Vidal (2024). Detection of Areas with Human Vulnerability Using Public Satellite Images and Deep Learning (Dataset) [Dataset]. http://doi.org/10.5281/zenodo.13768463
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13768463
Dataset updated
Sep 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Flavio de Barros Vidal; Flavio de Barros Vidal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 1, 2023
Description
Overview

This repository contains the code and resources for the project titled "Detection of Areas with Human Vulnerability Using Public Satellite Images and Deep Learning". The goal of this project is to identify regions where individuals are living under precarious conditions and facing neglected basic needs, a situation often seen in Brazil. This concept is referred to as "human vulnerability" and is exemplified by families living in inadequate shelters or on the streets in both urban and rural areas.

Focusing on the Federal District of Brazil as the research area, this project aims to develop two novel public datasets consisting of satellite images. The datasets contain imagery captured at 50m and 100m scales, covering regions of human vulnerability, traditional areas, and improperly disposed waste sites.

The project also leverages these datasets for training deep learning models, including YOLOv7 and other state-of-the-art models, to perform image segmentation. A comparative analysis is conducted between the models using two training strategies: training from scratch with random weight initialization and fine-tuning using pre-trained weights through transfer learning.

Key Achievements

Two new satellite image datasets focusing on human vulnerability and improperly disposed waste sites, available in public domains.

Comparison of image segmentation models, including YOLOv7 and Segmentation Models, with performance metrics.

Best F1-scores: 0.55 for YOLOv7 and 0.64 for Segmentation Models.

This repository provides the code, models, and data pipelines used for training, evaluation, and performance comparison of these deep learning models.

Citation (Bibtex)

@TECHREPORT {TechReport-Julia-Laura-HumanVulnerability-2024, author = "Julia Passos Pontes, Laura Maciel Neves Franco, Flavio De Barros Vidal", title = "Detecção de Áreas com Atividades de Vulnerabilidade Humana utilizando Imagens Públicas de Satélites e Aprendizagem Profunda", institution = "University of Brasilia", year = "2024", type = "Undergraduate Thesis", address = "Computer Science Department - University of Brasilia - Asa Norte - Brasilia - DF, Brazil", month = "aug", note = "People living in precarious conditions and with their basic needs neglected is an unfortunate reality in Brazil. This scenario will be approached in this work according to the concept of \"human vulnerability\" and can be exemplified through families who live in inadequate shelters, without basic structures and on the streets of urban or rural centers. Therefore, assuming the Federal District as the research scope, this project proposes to develop two new databases to be made available publicly, considering the map scales of 50m and 100m, and composed by satellite images of human vulnerability areas, regions treated as traditional and waste disposed inadequately. Furthermore, using these image bases, trainings were done with the YOLOv7 model and other deep learning models for image segmentation. By adopting an exploratory approach, this work compares the results of different image segmentation models and training strategies, using random weight initialization (from scratch) and pre-trained weights (transfer learning). Thus, the present work was able to reach maximum F1 score values of 0.55 for YOLOv7 and 0.64 for other segmentation models." }

License

This project is licensed under the MIT License - see the LICENSE file for details.
Network Traffic Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravikumar Gattu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
f
CANDID-II Dataset
figshare.com
png
Updated Jun 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sijing Feng (2025). CANDID-II Dataset [Dataset]. http://doi.org/10.17608/k6.auckland.19606921.v2
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.17608/k6.auckland.19606921.v2
Dataset updated
Jun 27, 2025
Dataset provided by
figshare
Authors
Sijing Feng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
53,054 anonymized adult chest x-ray dataset in 1024 x 1024 pixel DICOM format with corresponding anonymized free-text reports from Dunedin Hospital, New Zealand between 2010 - 2020. Corresponding radiology reports generated by FRANZCR radiologists were manually annotated for 46 common radiological findings mapped to Unified Medical Language System (UMLS) and RadLex ontology. Each of the multiclassification annotations contains 4 types of labels, namely positive, uncertain, negative and not mentioned. In the provided dataset, image filenames contain patient index (enabling analysis requiring grouping of images by patients), as well as anonymized date of acquisition information where the temporal relationship between images is preserved. This dataset can be used for training and testing for deep learning algorithms for adult chest x rays.Unfortunately, since Feb 2024, the New Zealand government is changing the data governance on datasets used for AI development and this affects the process of how the CANDID II dataset is to be accessed by the external users. Therefore, the CANDID II dataset is not available for access by users outside Health New Zealand. Further notice of access will be updated here should access by external users be reopened.
c
Machine learning model that estimates public-supply deliveries for domestic...
s.cnmilf.com
data.usgs.gov
+3more
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Machine learning model that estimates public-supply deliveries for domestic and other use types [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/machine-learning-model-that-estimates-public-supply-deliveries-for-domestic-and-other-use-
Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This child item describes a public-supply delivery machine learning model that was developed to estimate public-supply deliveries. Publicly supplied water may be delivered to domestic users or to commercial, industrial, institutional, and irrigation (CII) users. This model predicts total, domestic, and CII per capita rates for public-supply water service areas within the conterminous United States for 2009-2020. This child item contains model input datasets, code used to build the delivery machine learning model, and national predictions. This dataset is part of a larger data release using machine learning to predict public-supply water use for 12-digit hydrologic units from 2000-2020. This page includes the following file: delivery_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the delivery water use machine learning model

SMDG, A Standardized Fundus Glaucoma Dataset

kaggle.com

Updated Apr 23, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Riley Kiefer (2023). SMDG, A Standardized Fundus Glaucoma Dataset [Dataset]. http://doi.org/10.34740/kaggle/ds/2329670

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/ds/2329670

Dataset updated

Apr 23, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Riley Kiefer

Description

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19), a standardization of 19 public glaucoma datasets for AI applications.

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19) is a collection and standardization of 19 public datasets, comprised of full-fundus glaucoma images, associated image metadata like, optic disc segmentation, optic cup segmentation, blood vessel segmentation, and any provided per-instance text metadata like sex and age. This dataset is designed to be exploratory and open-ended with multiple use cases and no established training/validation/test cases. This dataset is the largest public repository of fundus images with glaucoma.

Citation

Please cite at least the first work in academic publications: 1. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 2. R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429. 3. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 4. R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.

Please also see the following optometry abstract publications: 1. A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection; AAO 2022; https://aaopt.org/past-meeting-abstract-archives/?SortBy=ArticleYear&ArticleType=&ArticleYear=2022&Title=&Abstract=&Authors=&Affiliation=&PROGRAMNUMBER=225129 2. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2790420 3. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2791017

Please also see the DOI citations for this and related datasets: 1. SMDG; @dataset{smdg, title={SMDG, A Standardized Fundus Glaucoma Dataset}, url={https://www.kaggle.com/ds/2329670}, DOI={10.34740/KAGGLE/DS/2329670}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 2. EyePACS-light-v1 @dataset{eyepacs-light-v1, title={Glaucoma Dataset: EyePACS AIROGS - Light}, url={https://www.kaggle.com/ds/3222646}, DOI={10.34740/KAGGLE/DS/3222646}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 3. EyePACS-light-v2 @dataset{eyepacs-light-v2, title={Glaucoma Dataset: EyePACS-AIROGS-light-V2}, url={https://www.kaggle.com/dsv/7300206}, DOI={10.34740/KAGGLE/DSV/7300206}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} }

Dataset Objective

The objective of this dataset is a machine learning-ready dataset for glaucoma-related applications. Using the help of the community, new open-source glaucoma datasets will be reviewed for standardization and inclusion in this dataset.

Data Standardization

Full fundus images (and corresponding segmentation maps) are standardized using a novel algorithm (Citation 1) by cropping the background, centering the fundus image, padding missing information, and resizing to 512x512 pixels. This standardization ensures that the most amount of foreground information is prevalent during the resizing process for machine-learning-ready image processing.
Each available metadata text is standardized by provided each fundus image as a row and each fundus attribute as a column in a CSV file

Dataset Instance	Original Fundus	Standardized Fundus Image
sjchoi86-HRF	https://user-images.githubusercontent.com/65875562/204170005-2d4dd051-0032-40c8-ba0b-390b6080bb69.png">	https://user-images.githubusercontent.com/65875562/204170011-51b7d001-4d43-4f0d-835e-984d45116b18.png">
BEH	https://user-images.githubusercontent.com/65875562/211052753-93f8a3aa-cc65-4790-8da6-229f512a6afb.PNG">	<img src="htt...

m
Data for: A Realistic and Public Dataset with Rare Undesirable Real Events...
data.mendeley.com
Updated Jul 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo Vargas (2019). Data for: A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells [Dataset]. http://doi.org/10.17632/r7774rwc7v.1
Explore at:
Unique identifier
https://doi.org/10.17632/r7774rwc7v.1
Dataset updated
Jul 15, 2019
Authors
Ricardo Vargas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the first realistic and public dataset with rare undesirable real events in oil wells as far as the authors of this work know. It can be used in development of several kinds of techniques and methods for different tasks associated with undesirable events in oil and gas wells.
u
PDMX
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
UniToBrain Dataset
zenodo.org
ieee-dataport.org
+2more
bin, csv, pdf
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umberto Gava; Umberto Gava; Federico D'Agata; Federico D'Agata; Edwin Bennink; Edwin Bennink; Enzo Tartaglione; Enzo Tartaglione; Annamaria Vernone; Annamaria Vernone; Francesca Bertolino; Eleonora Ficiarà; Eleonora Ficiarà; Alessandro Cicerale; Alessandro Cicerale; Fabrizio Pizzagalli; Fabrizio Pizzagalli; Caterina Guiot; Caterina Guiot; Marco Grangetto; Marco Grangetto; Mauro Bergui; Mauro Bergui; Francesca Bertolino (2024). UniToBrain Dataset [Dataset]. http://doi.org/10.5281/zenodo.4817605
Explore at:
pdf, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4817605
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Umberto Gava; Umberto Gava; Federico D'Agata; Federico D'Agata; Edwin Bennink; Edwin Bennink; Enzo Tartaglione; Enzo Tartaglione; Annamaria Vernone; Annamaria Vernone; Francesca Bertolino; Eleonora Ficiarà; Eleonora Ficiarà; Alessandro Cicerale; Alessandro Cicerale; Fabrizio Pizzagalli; Fabrizio Pizzagalli; Caterina Guiot; Caterina Guiot; Marco Grangetto; Marco Grangetto; Mauro Bergui; Mauro Bergui; Francesca Bertolino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The University of Turin (UniTO) released the open-access dataset UniTOBrain collected for the homonymous Use Case 3 in the DeepHealth project (https://deephealth-project.eu/). UniToBrain is a dataset of Computed Tomography (CT) perfusion images (CTP). The dataset includes 100 training subjects and 15 testing subjects used in a submitted publication for the training and the testing of a Convolutional Neural Network (CNN, see for details: https://arxiv.org/abs/2101.05992, https://paperswithcode.com/paper/neural-network-derived-perfusion-maps-a-model, https://www.medrxiv.org/content/10.1101/2021.01.13.21249757v1). At this stage, the UniTO team released this dataset privately, but soon it will be public. This is a subsample of a greater dataset of 258 subjects that will be soon available for download at https://ieee-dataport.org/.
CTP data from 258 consecutive patients were retrospectively obtained from the hospital PACS of Città della Salute e della Scienza di Torino (Molinette). CTP acquisition parameters were as follows: Scanner GE, 64 slices, 80 kV, 150 mAs, 44.5 sec duration, 89 volumes (40 mm axial coverage), injection of 40 ml of Iodine contrast agent (300 mg/ml) at 4 ml/s speed.
Meta Kaggle Code
kaggle.com
zip
Updated Aug 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(154608704973 bytes)Available download formats
Dataset updated
Aug 28, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
ExioML: Global Sectoral Sustainability Dataset
kaggle.com
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanming Yann Guo (2024). ExioML: Global Sectoral Sustainability Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8690108
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8690108
Dataset updated
Jun 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yanming Yann Guo
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🙋‍♂️ Introduction

ExioML is the first ML-ready benchmark dataset in eco-economic research, designed for global sectoral sustainability analysis. It addresses significant research gaps by leveraging the high-quality, open-source EE-MRIO dataset ExioBase 3.8.2. ExioML covers 163 sectors across 49 regions from 1995 to 2022, overcoming data inaccessibility issues. The dataset includes both factor accounting in tabular format and footprint networks in graph structure.

We demonstrate a GHG emission regression task using a factor accounting table, comparing the performance of shallow and deep models. The results show a low Mean Squared Error (MSE), quantifying sectoral GHG emissions in terms of value-added, employment, and energy consumption, validating the dataset's usability. The footprint network in ExioML, inherent in the multi-dimensional MRIO framework, enables tracking resource flow between international sectors.

ExioML offers promising research opportunities, such as predicting embodied emissions through international trade, estimating regional sustainability transitions, and analyzing the topological changes in global trading networks over time. It reduces barriers and intensive data pre-processing for ML researchers, facilitates the integration of ML and eco-economic research, and provides new perspectives for sound climate policy and global sustainable development.

📊 Dataset

ExioML supports graph and tabular structure learning algorithms through the Footprint Network and Factor Accounting table. The dataset includes the following factors in PxP and IxI:

Region (Categorical feature)

Sector (Categorical feature)

Value Added M.EUR

Employment 1000 p.

GHG emissions kg CO2 eq.

Energy Carrier Net Total TJ

Year (Numerical feature)

☁️ Factor Accounting

The Factor Accounting table shares common features with the Footprint Network and summarizes the total heterogeneous characteristics of various sectors.

🚞 Footprint Network

The Footprint Network models the high-dimensional global trading network, capturing its economic, social, and environmental impacts. This network is structured as a directed graph, where directionality represents sectoral input-output relationships, delineating sectors by their roles as sources (exporting) and targets (importing). The basic element in the ExioML Footprint Network is international trade across different sectors with features such as value-added, emission amount, and energy input. The Footprint Network helps identify critical sectors and paths for sustainability management and optimization. The Footprint Network is hosted on Zenodo.

🔗 Code and Data Availability

The ExioML development toolkit in Python and the regression model used for validation are available on the GitHub repository: https://github.com/YVNMINC/ExioML. The complete ExioML dataset is hosted by Zenodo: https://zenodo.org/records/10604610.

💡 Additional Information

More details about the dataset are available in our paper: ExioML: Eco-economic dataset for Machine Learning in Global Sectoral Sustainability, accepted by the ICLR 2024 Climate Change AI workshop: https://arxiv.org/abs/2406.09046.

📄 Citation

@inproceedings{guo2024exioml, title={ExioML: Eco-economic dataset for Machine Learning in Global Sectoral Sustainability}, author={Yanming, Guo and Jin, Ma}, booktitle={ICLR 2024 Workshop on Tackling Climate Change with Machine Learning}, year={2024} }

🌟 Reference

Stadler, Konstantin, et al. "EXIOBASE 3." Zenodo. Retrieved March 22 (2021): 2023.
d
R code used to estimate public supply consumptive water use
catalog.data.gov
data.usgs.gov
+1more
Updated Aug 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). R code used to estimate public supply consumptive water use [Dataset]. https://catalog.data.gov/dataset/r-code-used-to-estimate-public-supply-consumptive-water-use
Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
U.S. Geological Survey
Description
This child item describes R code used to determine public supply consumptive use estimates. Consumptive use was estimated by scaling an assumed fraction of deliveries used for outdoor irrigation by spatially explicit estimates of evaporative demand using estimated domestic and commercial, industrial, and institutional deliveries from the public supply delivery machine learning model child item. This method scales public supply water service area outdoor water use by the relationship between service area gross reference evapotranspiration provided by GridMET and annual continental U.S. (CONUS) growing season maximum evapotranspiration. This relationship to climate at the CONUS scale could result in over- or under-estimation of consumptive use at public supply service areas where local variations differ from national variations in climate. This method also assumes that 50% of deliveries for total domestic and commercial, industrial, and institutional deliveries is used for outdoor purposes. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. This page includes the following file: PS_ConsumptiveUse.zip - a zip file containing input datasets, scripts, and output datasets

Facebook

Twitter

Click to copy link

Link copied

Cite

Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.21967265.v1

Dataset updated

May 30, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE

Clear search

Close search

Google apps

Main menu

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Public datasets for GBSR

Machine learning model that estimates total monthly and annual per capita...

SYNERGY - Open machine learning dataset on study selection in systematic...

Composed Encrypted Malicious Traffic Dataset for machine learning based...

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

AI Training Dataset Market Report | Global Forecast From 2025 To 2033

AI Training Dataset Market Outlook

Data Type Analysis

Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...

A Dataset for Machine Learning Algorithm Development

Detection of Areas with Human Vulnerability Using Public Satellite Images...

Overview

Key Achievements

Citation (Bibtex)

License

Network Traffic Dataset

CANDID-II Dataset

Machine learning model that estimates public-supply deliveries for domestic...

SMDG, A Standardized Fundus Glaucoma Dataset

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19), a standardization of 19 public glaucoma datasets for AI applications.

Citation

Dataset Objective

Data Standardization

Data for: A Realistic and Public Dataset with Rare Undesirable Real Events...

PDMX

UniToBrain Dataset

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

ExioML: Global Sectoral Sustainability Dataset

🙋‍♂️ Introduction

📊 Dataset

☁️ Factor Accounting

🚞 Footprint Network

🔗 Code and Data Availability

💡 Additional Information

📄 Citation

🌟 Reference

R code used to estimate public supply consumptive water use

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python