100+ datasets found

Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
d
A Dataset for Machine Learning Algorithm Development
catalog.data.gov
s.cnmilf.com
+1more
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2
Explore at:
Dataset updated
May 1, 2024
Dataset provided by
(Point of Contact, Custodian)
Description
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
h
Machine-Learning-QA-dataset
huggingface.co
Updated Jan 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Mahamulkar (2024). Machine-Learning-QA-dataset [Dataset]. https://huggingface.co/datasets/prsdm/Machine-Learning-QA-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2025
Authors
Prasad Mahamulkar
Description
prsdm/Machine-Learning-QA-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Z
MISATO - Machine learning dataset for structure-based drug discovery
data.niaid.nih.gov
zenodo.org
Updated May 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Sattler (2023). MISATO - Machine learning dataset for structure-based drug discovery [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7711952
Explore at:
Dataset updated
May 25, 2023
Dataset provided by
Erinc Merdivan
Filipe Menezes
Till Siebenmorgen
Fabian J. Theis
Michael Sattler
Stefan Kesselheim
Marie Piraud
Sabrina Benassou
Grzegorz M. Popowicz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.
h
ML-ArXiv-Papers
huggingface.co
opendatalab.com
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Connor Shorten (2022). ML-ArXiv-Papers [Dataset]. https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Connor Shorten
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.
n
A dataset for machine learning research in the field of stress analyses of...
narcis.nl
data.mendeley.com
Updated Jul 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matej, J (via Mendeley Data) (2020). A dataset for machine learning research in the field of stress analyses of mechanical structures [Dataset]. http://doi.org/10.17632/wzbzznk8z3.2
Explore at:
Unique identifier
https://doi.org/10.17632/wzbzznk8z3.2
Dataset updated
Jul 25, 2020
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Matej, J (via Mendeley Data)
Description
The dataset is prepared and intended as a data source for development of a stress analysis method based on machine learning. It consists of finite element stress analyses of randomly generated mechanical structures. The dataset contains more than 270,794 pairs of stress analyses images (von Mises stress) of randomly generated 2D structures with predefined thickness and material properties. All the structures are fixed at their bottom edges and loaded with gravity force only. See PREVIEW directory with some examples. The zip file contains all the files in the dataset.
Machine Learning Materials Datasets
figshare.com
txt
Updated Sep 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dane Morgan (2018). Machine Learning Materials Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7017254.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7017254.v5
Dataset updated
Sep 11, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Dane Morgan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Three datasets are intended to be used for exploring machine learning applications in materials science. They are formatted in simple form and in particular for easy input into the MAterials Simulation Toolkit - Machine Learning (MAST-ML) package (see https://github.com/uw-cmg/MAST-ML).Each dataset is a materials property of interest and associated descriptors. For detailed information, please see the attached REAME text file.The first dataset for dilute solute diffusion can be used to predict an effective diffusion barrier for a solute element moving through another host element. The dataset has been calculated with DFT methods.The second dataset for perovskite stability gives energies of compostions of potential perovskite materials relative to the convex hull calculated with DFT. The perovskite dataset also includes columns with information about the A site, B site, and X site in the perovskite structure in order to perform more advanced grouping of the data.The third dataset is a metallic glasses dataset which has values of reduced glass transition temperature (Trg) for a variety of metallic alloys. An additional column is included for majority element for each alloy, which can be an interesting property to group on during tests.
ML Datasets
kaggle.com
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bikram Saha (2023). ML Datasets [Dataset]. https://www.kaggle.com/datasets/imbikramsaha/ml-datasets/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bikram Saha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset contains a diverse range of examples, including classification, regression, clustering, and dimensionality reduction problems, with varying levels of complexity and varying numbers of features. Each dataset comes with a detailed description of the problem and the corresponding features, making it easy to understand and work with. Additionally, the dataset provides an opportunity for machine learning enthusiasts to experiment with different SkLearn algorithms and evaluate their performance on different datasets. This dataset is perfect for both beginners and advanced practitioners looking to hone their skills in various machine learning techniques.
IMDB Dataset For Machine Learning
kaggle.com
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KHUSHI YADAV (2023). IMDB Dataset For Machine Learning [Dataset]. https://www.kaggle.com/datasets/khushiyadav2022/imdb-dataset-for-machine-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
KHUSHI YADAV
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
"Movie Recommendation on the IMDB Dataset: A Journey into Machine Learning" is an exciting project focused on leveraging the IMDB Dataset for developing an advanced movie recommendation system. This project aims to explore the vast potential of machine learning techniques in providing personalized movie recommendations to users.

The IMDB Dataset, comprising a wealth of movie information including genres, ratings, and user reviews, serves as the foundation for this project. By harnessing the power of machine learning algorithms and data analysis, the project seeks to build a recommendation system that can accurately suggest movies tailored to each individual's preferences.
Machine learning datasets
figshare.com
xlsx
Updated Mar 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Songbo Wang (2023). Machine learning datasets [Dataset]. http://doi.org/10.6084/m9.figshare.21640544.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21640544.v1
Dataset updated
Mar 29, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Songbo Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning datasets
D
SYNERGY - Open machine learning dataset on study selection in systematic...
dataverse.nl
csv, json, txt, zip
Updated Apr 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot (2023). SYNERGY - Open machine learning dataset on study selection in systematic reviews [Dataset]. http://doi.org/10.34894/HE6NAQ
Explore at:
txt(212), json(702), zip(16028323), json(19426), txt(263), zip(3560967), txt(305), json(470), txt(279), zip(2355371), json(23201), csv(460956), txt(200), json(685), json(546), csv(63996), zip(2989015), zip(5749455), txt(331), txt(315), json(691), json(23775), csv(672721), json(468), txt(415), json(22778), csv(31919), csv(746832), json(18392), zip(62992826), csv(234822), txt(283), zip(34788857), json(475), txt(242), json(533), csv(42227), json(24548), zip(738232), json(22477), json(25491), zip(11463283), json(17741), csv(490660), json(19662), json(578), csv(19786), zip(14708207), zip(24619707), zip(2404439), json(713), json(27224), json(679), json(26426), txt(185), json(906), zip(18534723), json(23550), txt(266), txt(317), zip(6019723), json(33943), txt(436), csv(388378), json(469), zip(2106498), txt(320), csv(451336), txt(338), zip(19428163), json(14326), json(31652), txt(299), csv(96153), txt(220), csv(114789), json(15452), csv(5372708), json(908), csv(317928), csv(150923), json(465), csv(535584), json(26090), zip(8164831), json(19633), txt(316), json(23494), csv(133950), json(18638), csv(3944082), json(15345), json(473), zip(4411063), zip(10396095), zip(835096), txt(255), json(699), csv(654705), txt(294), csv(989865), zip(1028035), txt(322), zip(15085090), txt(237), txt(310), json(756), json(30628), json(19490), json(25908), txt(401), json(701), zip(5543909), json(29397), zip(14007470), json(30058), zip(58869042), csv(852937), json(35711), csv(298011), csv(187163), txt(258), zip(3526740), json(568), json(21552), zip(66466788), csv(215250), json(577), csv(103010), txt(306), zip(11840006)Available download formats
Unique identifier
https://doi.org/10.34894/HE6NAQ
Dataset updated
Apr 24, 2023
Dataset provided by
DataverseNL
Authors
Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
f
Data from: Count-Based Morgan Fingerprint: A More Efficient and...
acs.figshare.com
xlsx
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shifa Zhong; Xiaohong Guan (2023). Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants’ Activities and Properties [Dataset]. http://doi.org/10.1021/acs.est.3c02198.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.3c02198.s002
Dataset updated
Jul 5, 2023
Dataset provided by
ACS Publications
Authors
Shifa Zhong; Xiaohong Guan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In this study, we introduce the count-based Morgan fingerprint (C-MF) to represent chemical structures of contaminants and develop machine learning (ML)-based predictive models for their activities and properties. Compared with the binary Morgan fingerprint (B-MF), C-MF not only qualifies the presence or absence of an atom group but also quantifies its counts in a molecule. We employ six different ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost) to develop models on 10 contaminant-related data sets based on C-MF and B-MF to compare them in terms of the model’s predictive performance, interpretation, and applicability domain (AD). Our results show that C-MF outperforms B-MF in nine of 10 data sets in terms of model predictive performance. The advantage of C-MF over B-MF is dependent on the ML algorithm, and the performance enhancements are proportional to the difference in the chemical diversity of data sets calculated by B-MF and C-MF. Model interpretation results show that the C-MF-based model can elucidate the effect of atom group counts on the target and have a wider range of SHAP values. AD analysis shows that C-MF-based models have an AD similar to that of B-MF-based ones. Finally, we developed a “ContaminaNET” platform to deploy these C-MF-based models for free use.
m
Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...
data.mendeley.com
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2022). Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis [Dataset]. http://doi.org/10.17632/xw7r4tt54g.1
Explore at:
Unique identifier
https://doi.org/10.17632/xw7r4tt54g.1
Dataset updated
Dec 6, 2022
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.

Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.

Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.

The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.
i
UCI datasets
ieee-dataport.org
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuan Sun (2025). UCI datasets [Dataset]. https://ieee-dataport.org/documents/uci-datasets
Explore at:
Dataset updated
May 14, 2025
Authors
Yuan Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
biology
Re-ID Data | 600,000 ID | CCTV Data |Computer Vision Data| Identity Data| AI...
datarade.ai
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). Re-ID Data | 600,000 ID | CCTV Data |Computer Vision Data| Identity Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-re-id-data-60-000-id-image-video-ai-ml-train-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 8, 2023
Dataset authored and provided by
Nexdata
Area covered
Cuba, Turkmenistan, Bolivia (Plurinational State of), United Arab Emirates, Russian Federation, Portugal, Luxembourg, Ecuador, Sri Lanka, Trinidad and Tobago
Description
Specifications Data size : 60,000 ID

Population distribution : the race distribution is Asians, Caucasians and black people, the gender distribution is male and female, the age distribution is from children to the elderly

Collecting environment : including indoor and outdoor scenes (such as supermarket, mall and residential area, etc.)

Data diversity : different ages, different time periods, different cameras, different human body orientations and postures, different ages collecting environment

Device : surveillance cameras, the image resolution is not less than 1,9201,080

Data format : the image data format is .jpg, the annotation file format is .json

Annotation content : human body rectangular bounding boxes, 15 human body attributes

Quality Requirements : A rectangular bounding box of human body is qualified when the deviation is not more than 3 pixels, and the qualified rate of the bounding boxes shall not be lower than 97%;Annotation accuracy of attributes is over 97%

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data.These ready-to-go Identity Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/computervision?source=Datarade
n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
LLM prompts in the context of machine learning
kaggle.com
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Nelson (2024). LLM prompts in the context of machine learning [Dataset]. https://www.kaggle.com/datasets/jordanln/llm-prompts-in-the-context-of-machine-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2024
Dataset provided by
Kaggle
Authors
Jordan Nelson
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is an extension of my previous work on creating a dataset for natural language processing tasks. It leverages binary representation to characterise various machine learning models. The attributes in the dataset are derived from a dictionary, which was constructed from a corpus of prompts typically provided to a large language model (LLM). These prompts reference specific machine learning algorithms and their implementations. For instance, consider a user asking an LLM or a generative AI to create a Multi-Layer Perceptron (MLP) model for a particular application. By applying this concept to multiple machine learning models, we constructed our corpus. This corpus was then transformed into the current dataset using a bag-of-words approach. In this dataset, each attribute corresponds to a word from our dictionary, represented as a binary value: 1 indicates the presence of the word in a given prompt, and 0 indicates its absence. At the end of each entry, there is a label. Each entry in the dataset pertains to a single class, where each class represents a distinct machine learning model or algorithm. This dataset is intended for multi-class classification tasks, not multi-label classification, as each entry is associated with only one label and does not belong to multiple labels simultaneously. This dataset has been utilised with a Convolutional Neural Network (CNN) using the Keras Automodel API, achieving impressive training and testing accuracy rates exceeding 97%. Post-training, the model's predictive performance was rigorously evaluated in a production environment, where it continued to demonstrate exceptional accuracy. For this evaluation, we employed a series of questions, which are listed below. These questions were intentionally designed to be similar to ensure that the model can effectively distinguish between different machine learning models, even when the prompts are closely related.

KNN How would you create a KNN model to classify emails as spam or not spam based on their content and metadata? How could you implement a KNN model to classify handwritten digits using the MNIST dataset? How would you use a KNN approach to build a recommendation system for suggesting movies to users based on their ratings and preferences? How could you employ a KNN algorithm to predict the price of a house based on features such as its location, size, and number of bedrooms etc? Can you create a KNN model for classifying different species of flowers based on their petal length, petal width, sepal length, and sepal width? How would you utilise a KNN model to predict the sentiment (positive, negative, or neutral) of text reviews or comments? Can you create a KNN model for me that could be used in malware classification? Can you make me a KNN model that can detect a network intrusion when looking at encrypted network traffic? Can you make a KNN model that would predict the stock price of a given stock for the next week? Can you create a KNN model that could be used to detect malware when using a dataset relating to certain permissions a piece of software may have access to?

Decision Tree Can you describe the steps involved in building a decision tree model to classify medical images as malignant or benign for cancer diagnosis and return a model for me? How can you utilise a decision tree approach to develop a model for classifying news articles into different categories (e.g., politics, sports, entertainment) based on their textual content? What approach would you take to create a decision tree model for recommending personalised university courses to students based on their academic strengths and weaknesses? Can you describe how to create a decision tree model for identifying potential fraud in financial transactions based on transaction history, user behaviour, and other relevant data? In what ways might you apply a decision tree model to classify customer complaints into different categories determining the severity of language used? Can you create a decision tree classifier for me? Can you make me a decision tree model that will help me determine the best course of action across a given set of strategies? Can you create a decision tree model for me that can recommend certain cars to customers based on their preferences and budget? How can you make a decision tree model that will predict the movement of star constellations in the sky based on data provided by the NASA website? How do I create a decision tree for time-series forecasting?

Random Forest Can you describe the steps involved in building a random forest model to classify different types of anomalies in network traffic data for cybersecurity purposes and return the code for me? In what ways could you implement a random forest model to predict the severity of traffic congestion in urban areas based on historical traffic patterns, weather...

Facebook

Twitter

Click to copy link

Link copied

Cite

Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning

Machine Learning Dataset

Explore at:

.json, .csv, .xlsxAvailable download formats

Dataset updated

Dec 23, 2024

Dataset authored and provided by

Bright Datahttps://brightdata.com/

License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered

Worldwide

Description

Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Clear search

Close search

Google apps

Main menu

Machine Learning Dataset

A Dataset for Machine Learning Algorithm Development

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Machine-Learning-QA-dataset

MISATO - Machine learning dataset for structure-based drug discovery

ML-ArXiv-Papers

A dataset for machine learning research in the field of stress analyses of...

Machine Learning Materials Datasets

ML Datasets

IMDB Dataset For Machine Learning

Machine learning datasets

SYNERGY - Open machine learning dataset on study selection in systematic...

Training dataset for NABat Machine Learning V1.0

Data from: Count-Based Morgan Fingerprint: A More Efficient and...

Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...

UCI datasets

Re-ID Data | 600,000 ID | CCTV Data |Computer Vision Data| Identity Data| AI...

Data from: Assessing predictive performance of supervised machine learning...

LLM prompts in the context of machine learning

Machine Learning Dataset