100+ datasets found

Datasets
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12958037.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Bastian Eichenberger; YinXiu Zhan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
Images in CSV datasets
kaggle.com
zip
Updated Oct 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pascal (2024). Images in CSV datasets [Dataset]. https://www.kaggle.com/datasets/pyim59/images-in-csv-datasets
Explore at:
zip(347504240 bytes)Available download formats
Dataset updated
Oct 14, 2024
Authors
Pascal
Description
Images sous forme de fichiers CSV pour une application de méthodes de machine learning "classiques" Ces datasets sont utilisés pour le cours de Centrale Lille sur le Machine Learning de Pascal Yim

"mnist_big.csv"

Reconnaissance d'images de chiffres manuscrits

Version "mnist_small.csv" avec moins de données pouvant servir aussi d'ensemble de test

Source : https://www.kaggle.com/datasets/oddrationale/mnist-in-csv

"sign_mnist_big.csv"

Reconnaissance d'images de gestes de la langue des signes

Version "sign_mnist_small.csv" avec moins de données pouvant servir aussi d'ensemble de test

Source : https://www.kaggle.com/datasets/datamunge/sign-language-mnist

"zalando_small.csv"

Reconnaissance de vêtements et chaussures (Zalando)

Source : https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000

"hmnist_8_8_RGB.csv"

Reconnaissance de tumeurs de la peau (images en couleurs, trois valeurs R,G,B par pixel)

Autres versions avec des images plus petites et/ou en niveaux de gris

Source : https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000

"cifar10_small.csv"

Reconnaissance de petites images en couleurs dans 10 catégories Version en CSV du dataset CIFAR10

Source : https://www.kaggle.com/datasets/fedesoriano/cifar10-python-in-csv?select=train.csv
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
Top 2500 Kaggle Datasets
kaggle.com
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saket Kumar (2024). Top 2500 Kaggle Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/7637365
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/7637365
Dataset updated
Feb 16, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saket Kumar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.

Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.

Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.

Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.

Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.

Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.

Column Definitions:

Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Multiple Machine Learning Datasets
kaggle.com
zip
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Amoh Adjei (2024). Multiple Machine Learning Datasets [Dataset]. https://www.kaggle.com/datasets/ericamohadjei/trending-public-datasets
Explore at:
zip(15544969 bytes)Available download formats
Dataset updated
Nov 12, 2024
Authors
Eric Amoh Adjei
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Trending Public Datasets Overview

These Datasets contain a diverse collection of datasets intended for machine learning research and practice. Each dataset is curated to support different types of machine learning challenges, including classification, regression, and clustering. Below is a detailed list of the datasets available in this repository, along with descriptions and links to their sources.

Available Datasets

Iris Dataset

Description: This classic dataset includes measurements for 150 iris flowers from three different species. It includes four features: sepal length, sepal width, petal length, and petal width. Source: Iris Dataset Source Files: iris.csv

DHFR Dataset

Description: Contains data for 325 molecules with biological activity against the DHFR enzyme, relevant in anti-malarial drug research. It includes 228 molecular descriptors as features. Source: DHFR Dataset Source Files: dhfr.csv

Heart Disease Dataset (Cleveland)

Description: Comprises diagnostic measurements from 303 patients tested for heart disease at the Cleveland Clinic. It features 13 clinical attributes. Source: UCI Machine Learning Repository Files: heart-disease-cleveland.csv

HCV Data

Description: Detailed datasets related to Hepatitis C Virus (HCV) progression, with features for classification and regression tasks. Files: HCV_NS5B_Curated.csv, hcv_classification.csv, hcv_regression.arff

NBA Seasons Stats

Description: Player statistics from the NBA 2020 and 2021 seasons for detailed sports analytics. Files: NBA_2020.csv, NBA_2021.csv

Boston Housing Dataset

Description: Data concerning housing values in the suburbs of Boston, suitable for regression analysis. Files: BostonHousing.csv, BostonHousing_train.csv, BostonHousing_test.csv

Acetylcholinesterase Inhibitor Bioactivity

Description: Chemical bioactivity data against acetylcholinesterase, a target relevant to Alzheimer's research. It includes raw and processed formats with chemical fingerprints. Files: acetylcholinesterase_01_bioactivity_data_raw.csv to acetylcholinesterase_07_bioactivity_data_2class_pIC50_pubchem_fp.csv

California Housing Dataset

Description: Data aimed at predicting median house prices in California districts. Files: california_housing_train.csv, california_housing_test.csv

Virtual Reality Experiences Data

Description: Data from user experiences with various virtual reality setups to study user engagement and satisfaction. Files: Virtual Reality Experiences-data.csv

Fast-Food Chains in USA

Description: Overview of various fast-food chains operating in the USA, their locations, and popularity. Files: Fast-Food Chains in USA.csv

Contributing We welcome contributions to this dataset repository. If you have a dataset that you believe would be beneficial for the machine learning community, please see our contribution guidelines in CONTRIBUTING.md.

License This dataset is available under the MIT License.
m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
D
Disease Prediction Using Machine Learning
dataandsons.com
csv, zip
Updated Oct 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
test test (2022). Disease Prediction Using Machine Learning [Dataset]. https://www.dataandsons.com/categories/machine-learning/disease-prediction-using-machine-learning
Explore at:
csv, zipAvailable download formats
Dataset updated
Oct 31, 2022
Dataset provided by
Data & Sons
Authors
test test
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
About this Dataset

This dataset will help you apply your existing knowledge to great use. This dataset has 132 parameters on which 42 different types of diseases can be predicted. This dataset consists of 2 CSV files. One of them is for training and the other is for testing your model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and the last column is the prognosis. These symptoms are mapped to 42 diseases you can classify these sets of symptoms. You are required to train your model on training data and test it on testing data.

Category

Machine Learning

Keywords

medicine,disease,Healthcare,ML,Machine Learning

Row Count

4962

Price

$109.00

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

H
Iris dataset for machine learning
dataverse.harvard.edu
search.datacite.org
Updated Oct 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle M. Monahan (2020). Iris dataset for machine learning [Dataset]. http://doi.org/10.7910/DVN/R2RGXR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/R2RGXR
Dataset updated
Oct 19, 2020
Dataset provided by
Harvard Dataverse
Authors
Kyle M. Monahan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is an iris dataset commonly used in machine learning. Accessed on 10-19-2020 from the following URL: http://faculty.smu.edu/tfomby/eco5385_eco6380/data/Iris.xls
Z
PFAM Protein Families Dataset for Machine Learning
data.niaid.nih.gov
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik, Andreas (2023). PFAM Protein Families Dataset for Machine Learning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8132094
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
THM University of Applied Sciences
Authors
Dominik, Andreas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A cleaned dataset of protein sequences and protein families for classification. The dataset is exported from PFAM as of June 2023 and curated to achieve the following characteristics:

only protein families included with >=100 sequences

families with >2000 sequences are truncated and only represented by 2000 sequences (chosen randomly)

only proteins with sequence lengths between 100 and 1000

amino acid sequences are form PDB; chains are concatenated only if not similar

The dataset is not balanced, numbers of sequences per family in PFAM and in in dataset are:

families: 62, sequences: 46872 total (in PFAM) -> included (in dataset)

Number in family ALLERGEN: 122 -> 122 Number in family APOPTOSIS: 381 -> 381 Number in family BIOSYNTHETIC PROTEIN: 346 -> 346 Number in family BIOTIN BINDING PROTEIN: 165 -> 165 Number in family BLOOD CLOTTING: 138 -> 138 Number in family CALCIUM BINDING PROTEIN: 135 -> 135 Number in family CELL ADHESION: 1116 -> 1116 Number in family CELL CYCLE: 511 -> 511 Number in family CHAPERONE: 964 -> 964 Number in family CONTRACTILE PROTEIN: 158 -> 158 Number in family CYTOKINE: 191 -> 191 Number in family DE NOVO PROTEIN: 253 -> 253 Number in family DNA BINDING PROTEIN: 1008 -> 1008 Number in family ELECTRON TRANSPORT: 841 -> 841 Number in family FLUORESCENT PROTEIN: 348 -> 348 Number in family GENE REGULATION: 607 -> 607 Number in family HORMONE: 272 -> 272 Number in family HORMONE GROWTH FACTOR: 159 -> 159 Number in family HORMONE RECEPTOR: 121 -> 121 Number in family HYDROLASE: 19551 -> 2000 Number in family HYDROLASE ANTIBIOTIC: 120 -> 120 Number in family HYDROLASE HYDROLASE INHIBITOR: 2890 -> 2000 Number in family HYDROLASE INHIBITOR: 315 -> 315 Number in family IMMUNE SYSTEM: 3333 -> 2000 Number in family IMMUNOGLOBULIN: 155 -> 155 Number in family ISOMERASE: 2457 -> 2000 Number in family ISOMERASE ISOMERASE INHIBITOR: 139 -> 139 Number in family LECTIN: 139 -> 139 Number in family LIGASE: 1780 -> 1780 Number in family LIGASE LIGASE INHIBITOR: 163 -> 163 Number in family LIPID BINDING PROTEIN: 421 -> 421 Number in family LIPID TRANSPORT: 115 -> 115 Number in family LUMINESCENT PROTEIN: 221 -> 221 Number in family LYASE: 4150 -> 2000 Number in family LYASE LYASE INHIBITOR: 298 -> 298 Number in family MEMBRANE PROTEIN: 1338 -> 1338 Number in family METAL BINDING PROTEIN: 951 -> 951 Number in family METAL TRANSPORT: 409 -> 409 Number in family MOTOR PROTEIN: 195 -> 195 Number in family OXIDOREDUCTASE: 11531 -> 2000 Number in family OXIDOREDUCTASE OXIDOREDUCTASE INHIBITOR: 766 -> 766 Number in family OXYGEN STORAGE: 127 -> 127 Number in family OXYGEN STORAGE TRANSPORT: 260 -> 260 Number in family OXYGEN TRANSPORT: 414 -> 414 Number in family PHOTOSYNTHESIS: 173 -> 173 Number in family PLANT PROTEIN: 255 -> 255 Number in family PROTEIN BINDING: 1613 -> 1613 Number in family PROTEIN TRANSPORT: 693 -> 693 Number in family RECEPTOR: 108 -> 108 Number in family REPLICATION: 161 -> 161 Number in family RNA BINDING PROTEIN: 546 -> 546 Number in family SIGNALING PROTEIN: 2312 -> 2000 Number in family STRUCTURAL PROTEIN: 869 -> 869 Number in family SUGAR BINDING PROTEIN: 1250 -> 1250 Number in family TOXIN: 546 -> 546 Number in family TRANSCRIPTION REGULATION: 3283 -> 2000 Number in family TRANSFERASE: 14724 -> 2000 Number in family TRANSFERASE INHIBITOR: 126 -> 126 Number in family TRANSFERASE TRANSFERASE INHIBITOR: 2465 -> 2000 Number in family TRANSLATION: 370 -> 370 Number in family TRANSPORT PROTEIN: 2782 -> 2000 Number in family VIRAL PROTEIN: 2150 -> 2000

Files:

families.csv: list of protein families with frequencies

pfam_46872x62.csv: full dataset with amino acid sequences as string (one-letter code)

pfam-trn-xy.csv: training dataset with amino acid sequences as tokens (1..25) and padded to a common length of 1000 with padding token 0:

Amino acid | Token | Description -------------------------------- C | 1 | Cysteine S | 2 | Serine T | 3 | Threonine A | 4 | Alanine G | 5 | Glycine P | 6 | Proline D | 7 | Aspartic acid E | 8 | Glutamic acid Q | 9 | Glutamine N | 10 | Asparagine H | 11 | Histidine R | 12 | Arginine K | 13 | Lysine M | 14 | Methionine I | 15 | Isoleucine L | 16 | Leucine V | 17 | Valine W | 18 | Tryptophan Y | 19 | Tyrosine F | 20 | Phenylalanine B | 21 | Aspartic acid or Asparagine Z | 22 | Glutamic acid or Glutamine J | 23 | Leucine or Isoleucine U | 24 | Selenocysteine X | 25 | Unknown amino acid . | 0 | padding token

pfam-trn-labels.csv: plain-text labels for training data

pfam-tst-xy.csv

pfam-tst-labels.csv: test data

pfam-balanced-trn-xy.csv

pfam-balanced-trn-labels.csv:

pfam-balanced-tst-xy.csv

pfam-balanced-tst-labels.csv: balanced datasets, created by oversampling.
D:\machine learning\Datasets BITS\usedcarsales.csv
kaggle.com
zip
Updated May 23, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arun kumar (2018). D:\machine learning\Datasets BITS\usedcarsales.csv [Dataset]. https://www.kaggle.com/arun3591/dmachine-learningdatasets-bitsusedcarsalescsv
Explore at:
zip(16124 bytes)Available download formats
Dataset updated
May 23, 2018
Authors
Arun kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Arun kumar

Released under CC0: Public Domain

Contents
Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |
m
Multi-Laboratory Hematoxylin and Eosin Staining Variance Unsupervised...
data.mendeley.com
figshare.com
Updated Sep 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabi Prezja (2022). Multi-Laboratory Hematoxylin and Eosin Staining Variance Unsupervised Machine Learning Dataset [Dataset]. http://doi.org/10.17632/b9dxsybhm9.1
Explore at:
Unique identifier
https://doi.org/10.17632/b9dxsybhm9.1
Dataset updated
Sep 12, 2022
Authors
Fabi Prezja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We provide the generated dataset used for unsupervised machine learning in [1]. The data is in CSV format and contains all principal components and ground truth labels, per tissue type. Tissue type codes used are; C1 for kidney, C2 for skin, C3 for colon, and 'PC' for the principal component. Please see the original design in [1] for feature extraction specifications. Features have been extracted independently for each tissue type.

Reference: Prezja, F.; Pölönen, I.; Äyrämö, S.; Ruusuvuori, P.; Kuopio, T. H&E Multi-Laboratory Staining Variance Exploration with Machine Learning. Appl. Sci. 2022, 12, 7511. https://doi.org/10.3390/app12157511
d
Machine learning model that estimates total monthly and annual per capita...
catalog.data.gov
data.usgs.gov
+2more
Updated Oct 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-that-estimates-total-monthly-and-annual-per-capita-public-supply-wa
Explore at:
Dataset updated
Oct 8, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. STAT_PS_HUC12_Tot_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply total water use from 2000-2020 STAT_PS_HUC12_GW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply groundwater use for 2000-2020 STAT_PS_HUC12_SW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply surface water use for 2000-2020 public_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the public supply water use machine learning model version_history_MLmodel.txt - a txt file describing changes in this version
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
m
DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS
data.mendeley.com
narcis.nl
+1more
Updated Mar 12, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Constante (2019). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. http://doi.org/10.17632/8gx2fvg2k6.3
Explore at:
Unique identifier
https://doi.org/10.17632/8gx2fvg2k6.3
Dataset updated
Mar 12, 2019
Authors
Fabian Constante
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

Types of Products : Clothing , Sports , and Electronic Supplies

Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.
Dataset 2: Underlying data of the presented results in csv files
figshare.com
xls
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kunal Anand (2024). Dataset 2: Underlying data of the presented results in csv files [Dataset]. http://doi.org/10.6084/m9.figshare.25683600.v2
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25683600.v2
Dataset updated
Nov 25, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kunal Anand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The presented tables belong to the research that implemented several widely used optimization algorithms, such as GA, PSO, DE, and ACO, including the proposed method using supervised learning classifiers such as NB, DT, KNN, and QDA to gauge accuracy. The obtained results went through statistical analysis (Friedman test and Holm procedure), to identify the best predictive model with higher performance while selecting the optimal set of features correlated to alternately studied FS procedures.
m
iris-csv
data.mendeley.com
Updated Oct 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KC Tung (2020). iris-csv [Dataset]. http://doi.org/10.17632/7xwsksdpy3.1
Explore at:
Unique identifier
https://doi.org/10.17632/7xwsksdpy3.1
Dataset updated
Oct 8, 2020
Authors
KC Tung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Iris dataset from open source.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1

Datasets

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12958037.v1

Dataset updated

May 31, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Bastian Eichenberger; YinXiu Zhan

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.

Clear search

Close search

Google apps

Main menu

Datasets

Images in CSV datasets

"mnist_big.csv"

"sign_mnist_big.csv"

"zalando_small.csv"

"hmnist_8_8_RGB.csv"

"cifar10_small.csv"

Machine Learning Dataset

Top 2500 Kaggle Datasets

UCI and OpenML Data Sets for Ordinal Quantification

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Multiple Machine Learning Datasets

Network traffic and code for machine learning classification

Disease Prediction Using Machine Learning

About this Dataset

Category

Keywords

Row Count

Price

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

Iris dataset for machine learning

PFAM Protein Families Dataset for Machine Learning

D:\machine learning\Datasets BITS\usedcarsales.csv

Dataset

Contents

Best Books Ever Dataset

Multi-Laboratory Hematoxylin and Eosin Staining Variance Unsupervised...

Machine learning model that estimates total monthly and annual per capita...

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

Dataset 2: Underlying data of the presented results in csv files

iris-csv

Datasets