100+ datasets found

f
Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taeho Jo; Kwangsik Nho; Andrew J. Saykin (2023). Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification and Prognostic Prediction Using Neuroimaging Data.pdf [Dataset]. http://doi.org/10.3389/fnagi.2019.00220.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fnagi.2019.00220.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Taeho Jo; Kwangsik Nho; Andrew J. Saykin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Deep learning, a state-of-the-art machine learning approach, has shown outstanding performance over traditional machine learning in identifying intricate structures in complex high-dimensional data, especially in the domain of computer vision. The application of deep learning to early detection and automated classification of Alzheimer's disease (AD) has recently gained considerable attention, as rapid progress in neuroimaging techniques has generated large-scale multimodal neuroimaging data. A systematic review of publications using deep learning approaches and neuroimaging data for diagnostic classification of AD was performed. A PubMed and Google Scholar search was used to identify deep learning papers on AD published between January 2013 and July 2018. These papers were reviewed, evaluated, and classified by algorithm and neuroimaging type, and the findings were summarized. Of 16 studies meeting full inclusion criteria, 4 used a combination of deep learning and traditional machine learning approaches, and 12 used only deep learning approaches. The combination of traditional machine learning for classification and stacked auto-encoder (SAE) for feature selection produced accuracies of up to 98.8% for AD classification and 83.7% for prediction of conversion from mild cognitive impairment (MCI), a prodromal stage of AD, to AD. Deep learning approaches, such as convolutional neural network (CNN) or recurrent neural network (RNN), that use neuroimaging data without pre-processing for feature selection have yielded accuracies of up to 96.0% for AD classification and 84.2% for MCI conversion prediction. The best classification performance was obtained when multimodal neuroimaging and fluid biomarkers were combined. Deep learning approaches continue to improve in performance and appear to hold promise for diagnostic classification of AD using multimodal neuroimaging data. AD research that uses deep learning is still evolving, improving performance by incorporating additional hybrid data types, such as—omics data, increasing transparency with explainable approaches that add knowledge of specific disease-related features and mechanisms.
Gender Classification Dataset
kaggle.com
Updated Oct 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jifry Issadeen (2020). Gender Classification Dataset [Dataset]. https://www.kaggle.com/elakiricoder/gender-classification-dataset/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jifry Issadeen
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

While I was practicing machine learning, I wanted to create a simple dataset that is closely aligned to the real world scenario and gives better results to whet my appetite on this domain. If you are a beginner who wants to try solving classification problems in machine learning and if you prefer achieving better results, try using this dataset in your projects which will be a great place to start.

Content

This dataset contains 7 features and a label column.

long_hair - This column contains 0's and 1's where 1 is "long hair" and 0 is "not long hair". forehead_width_cm - This column is in CM's. This is the width of the forehead. forehead_height_cm - This is the height of the forehead and it's in Cm's. nose_wide - This column contains 0's and 1's where 1 is "wide nose" and 0 is "not wide nose". nose_long - This column contains 0's and 1's where 1 is "Long nose" and 0 is "not long nose". lips_thin - This column contains 0's and 1's where 1 represents the "thin lips" while 0 is "Not thin lips". distance_nose_to_lip_long - This column contains 0's and 1's where 1 represents the "long distance between nose and lips" while 0 is "short distance between nose and lips".

gender - This is either "Male" or "Female".

Acknowledgements

Nothing to acknowledge as this is just a made up data.

Inspiration

It's painful to see bad results at the beginning. Don't begin with complicated datasets if you are a beginner. I'm sure that this dataset will encourage you to proceed further in the domain. Good luck.
m
Data from: Classification of Heart Failure Using Machine Learning: A...
data.mendeley.com
Updated Oct 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryan Chulde (2024). Classification of Heart Failure Using Machine Learning: A Comparative Study [Dataset]. http://doi.org/10.17632/959dxmgj8d.1
Explore at:
Unique identifier
https://doi.org/10.17632/959dxmgj8d.1
Dataset updated
Oct 29, 2024
Authors
Bryan Chulde
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Our research demonstrates that machine learning algorithms can effectively predict heart failure, highlighting high-accuracy models that improve detection and treatment. The Kaggle “Heart Failure” dataset, with 918 instances and 12 key features, was preprocessed to remove outliers and features a distribution of cases with and without heart disease (508 and 410). Five models were evaluated: the random forest achieved the highest accuracy (92%) and was consolidated as the most effective at classifying cases. Logistic regression and multilayer perceptron were also quite accurate (89%), while decision tree and k-nearest neighbors performed less well, showing that k-neighbors is less suitable for this data. F1 scores confirmed the random forest as the optimal one, benefiting from preprocessing and hyperparameter tuning. The data analysis revealed that age, blood pressure and cholesterol correlate with disease risk, suggesting that these models may help prioritize patients at risk and improve their preventive management. The research underscores the potential of these models in clinical practice to improve diagnostic accuracy and reduce costs, supporting informed medical decisions and improving health outcomes.
i
Facies-Mark: A Machine Learning Benchmark for Facies Classification
ieee-dataport.org
Updated Feb 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oluwaseun Aribido (2020). Facies-Mark: A Machine Learning Benchmark for Facies Classification [Dataset]. https://ieee-dataport.org/open-access/facies-mark-machine-learning-benchmark-facies-classification
Explore at:
Dataset updated
Feb 18, 2020
Authors
Oluwaseun Aribido
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
such as facies classification
LLM prompts in the context of machine learning
kaggle.com
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Nelson (2024). LLM prompts in the context of machine learning [Dataset]. https://www.kaggle.com/datasets/jordanln/llm-prompts-in-the-context-of-machine-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2024
Dataset provided by
Kaggle
Authors
Jordan Nelson
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is an extension of my previous work on creating a dataset for natural language processing tasks. It leverages binary representation to characterise various machine learning models. The attributes in the dataset are derived from a dictionary, which was constructed from a corpus of prompts typically provided to a large language model (LLM). These prompts reference specific machine learning algorithms and their implementations. For instance, consider a user asking an LLM or a generative AI to create a Multi-Layer Perceptron (MLP) model for a particular application. By applying this concept to multiple machine learning models, we constructed our corpus. This corpus was then transformed into the current dataset using a bag-of-words approach. In this dataset, each attribute corresponds to a word from our dictionary, represented as a binary value: 1 indicates the presence of the word in a given prompt, and 0 indicates its absence. At the end of each entry, there is a label. Each entry in the dataset pertains to a single class, where each class represents a distinct machine learning model or algorithm. This dataset is intended for multi-class classification tasks, not multi-label classification, as each entry is associated with only one label and does not belong to multiple labels simultaneously. This dataset has been utilised with a Convolutional Neural Network (CNN) using the Keras Automodel API, achieving impressive training and testing accuracy rates exceeding 97%. Post-training, the model's predictive performance was rigorously evaluated in a production environment, where it continued to demonstrate exceptional accuracy. For this evaluation, we employed a series of questions, which are listed below. These questions were intentionally designed to be similar to ensure that the model can effectively distinguish between different machine learning models, even when the prompts are closely related.

KNN How would you create a KNN model to classify emails as spam or not spam based on their content and metadata? How could you implement a KNN model to classify handwritten digits using the MNIST dataset? How would you use a KNN approach to build a recommendation system for suggesting movies to users based on their ratings and preferences? How could you employ a KNN algorithm to predict the price of a house based on features such as its location, size, and number of bedrooms etc? Can you create a KNN model for classifying different species of flowers based on their petal length, petal width, sepal length, and sepal width? How would you utilise a KNN model to predict the sentiment (positive, negative, or neutral) of text reviews or comments? Can you create a KNN model for me that could be used in malware classification? Can you make me a KNN model that can detect a network intrusion when looking at encrypted network traffic? Can you make a KNN model that would predict the stock price of a given stock for the next week? Can you create a KNN model that could be used to detect malware when using a dataset relating to certain permissions a piece of software may have access to?

Decision Tree Can you describe the steps involved in building a decision tree model to classify medical images as malignant or benign for cancer diagnosis and return a model for me? How can you utilise a decision tree approach to develop a model for classifying news articles into different categories (e.g., politics, sports, entertainment) based on their textual content? What approach would you take to create a decision tree model for recommending personalised university courses to students based on their academic strengths and weaknesses? Can you describe how to create a decision tree model for identifying potential fraud in financial transactions based on transaction history, user behaviour, and other relevant data? In what ways might you apply a decision tree model to classify customer complaints into different categories determining the severity of language used? Can you create a decision tree classifier for me? Can you make me a decision tree model that will help me determine the best course of action across a given set of strategies? Can you create a decision tree model for me that can recommend certain cars to customers based on their preferences and budget? How can you make a decision tree model that will predict the movement of star constellations in the sky based on data provided by the NASA website? How do I create a decision tree for time-series forecasting?

Random Forest Can you describe the steps involved in building a random forest model to classify different types of anomalies in network traffic data for cybersecurity purposes and return the code for me? In what ways could you implement a random forest model to predict the severity of traffic congestion in urban areas based on historical traffic patterns, weather...
m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
Text Classification Dataset
opendatabay.com
.undefined
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay (2025). Text Classification Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1775ad0d-be0d-49c9-bbc1-f94a8a5c8355
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 6, 2025
Dataset provided by
Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
Authors
Opendatabay
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.

Dataset Features

1. text: Contains individual English-language comments or posts sourced from various online platforms.

2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:

0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment

Distribution

Format: CSV (Comma-Separated Values)

2 Columns: text: The comment content label: Sentiment classification (0 = Negative, 1 = Neutral, 2 = Positive)

File Size: Approximately 23.9 MB

Structure: Each row contains a single comment and its corresponding sentiment label.

Usage

This dataset is ideal for a variety of applications:

1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.

2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.

3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.

Coverage

Geographic Coverage: Primarily English-language content from global online platforms

Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.

Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.

License

CC0

Who Can Use It

Data Scientists: For training machine learning models.

Researchers: For academic or scientific studies.

Businesses: For analysis, insights, or AI development.
d
Benchmark dataset for graph classification
search.dataone.org
dataverse.azure.uit.no
+1more
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bianchi, Filippo Maria (2024). Benchmark dataset for graph classification [Dataset]. http://doi.org/10.18710/TIZ9II
Explore at:
Unique identifier
https://doi.org/10.18710/TIZ9II
Dataset updated
Jan 5, 2024
Dataset provided by
DataverseNO
Authors
Bianchi, Filippo Maria
Description
This repository contains datasets to quickly test graph classification algorithms, such as Graph Kernels and Graph Neural Networks. The purpose of this dataset is to make the features on the nodes and the adjacency matrix to be completely uninformative if considered alone. Therefore, an algorithm that relies only on the node features or on the graph structure will fail to achieve good classification results. A more detailed description of the dataset construction can be found on the Github page (https://github.com/FilippoMB/Benchmark_dataset_for_graph_classification), in the original publication and in the original publication: Bianchi, Filippo Maria, Claudio Gallicchio, and Alessio Micheli. "Pyramidal Reservoir Graph Neural Network." Neurocomputing 470 (2022): 389-404, and in the README.txt file.
Gender Detection & Classification - Face Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). Gender Detection & Classification - Face Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/gender-detection-and-classification-image-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

Dataset Description:

The dataset comprises a collection of photos of people, organized into folders labeled "women" and "men." Each folder contains a significant number of images to facilitate training and testing of gender detection algorithms or models.

The dataset contains a variety of images capturing female and male individuals from diverse backgrounds, age groups, and ethnicities.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1c4708f0b856f7889e3c0eea434fe8e2%2FFrame%2045%20(1).png?generation=1698764294000412&alt=media" alt="">

This labeled dataset can be utilized as training data for machine learning models, computer vision applications, and gender detection algorithms.

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

assignment_id - unique identifier of the media file

worker_id - unique identifier of the person

age - age of the person

true_gender - gender of the person

country - country of the person

ethnicity - ethnicity of the person

photo_1_extension, photo_2_extension, photo_3_extension, photo_4_extension - photo extensions in the dataset

photo_1_resolution, photo_2_resolution, photo_3_extension, photo_4_resolution - photo resolution in the dataset

OTHER BIOMETRIC DATASETS:

Anti Spoofing Real Dataset

Antispoofing Replay Dataset

Selfies, ID Images dataset (5591 sets of 15 files)

Selfies and video dataset (4 052 sets)

Dataset of bald people, 5000 images

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

The dataset is split into train and test folders, each folder includes: - folders women and men - folders with images of people with the corresponding gender, - .csv file - contains information about the images and people in the dataset

File with the extension .csv

file: link to access the file,

gender: gender of a person in the photo (woman/man),

split: classification on train and test

TrainingData provides high-quality data annotation tailored to your needs

keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, gender detection, supervised learning dataset, gender classification dataset, gender recognition dataset
Data from: Metadata Classification Machine Learning Data
osti.gov
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collier, Hannah; Enright, Eric (2024). Metadata Classification Machine Learning Data [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2446583
Explore at:
Dataset updated
Sep 18, 2024
Dataset provided by
United States Department of Energyhttp://energy.gov/
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Atmospheric Radiation Measurement (ARM) Archive; Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Atmospheric Radiation Measurement (ARM) Data Center
Authors
Collier, Hannah; Enright, Eric
Description
This GitLab project contains the training data that was used for the metadata machine learning classification project.
A
‘Job Classification Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Job Classification Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-job-classification-dataset-151c/03ce55a1/?iid=038-911&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Job Classification Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/HRAnalyticRepository/job-classification-dataset on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Context

This is a dataset containing some fictional job class specs information. Typically job class specs have information which characterize the job class- its features, and a label- in this case a pay grade - something to predict that the features are related to.

Content

The data is a static snapshot. The contents are ID column - a sequential number Job Family ID Job Family Description Job Class ID Job Class Description PayGrade- numeric Education Level Experience Organizational Impact Problem Solving Supervision Contact Level Financial Budget PG- Alpha label for PayGrade

Acknowledgements

This data is purely fictional

Inspiration

The intent is to use machine learning classification algorithms to predict PG from Educational level through to Financial budget information.

Typically job classification in HR is time consuming and cumbersome as a manual activity. The intent is to show how machine learning and People Analytics can be brought to bear on this task.

--- Original source retains full ownership of the source dataset ---
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
e
Data for: Deep Learning Classification of Lake Zooplankton - Package - ERIC
opendata.eawag.ch
Updated Aug 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Data for: Deep Learning Classification of Lake Zooplankton - Package - ERIC [Dataset]. https://opendata.eawag.ch/dataset/deep-learning-classification-of-zooplankton-from-lakes
Explore at:
Dataset updated
Aug 12, 2021
Description
Plankton are effective indicators of environmental change and ecosystem health in freshwater habitats, but collection of plankton data using manual microscopic methods is extremely labor-intensive and expensive. Automated plankton imaging offers a promising way forward to monitor plankton communities with high frequency and accuracy in real-time. Yet, manual annotation of millions of images proposes a serious challenge to taxonomists. Deep learning classifiers have been successfully applied in various fields and provided encouraging results when used to categorize marine plankton images. Here, we present a set of deep learning models developed for the identification of lake plankton, and study several strategies to obtain optimal performances, which lead to operational prescriptions for users. To this aim, we annotated into 35 classes over 17900 images of zooplankton and large phytoplankton colonies, detected in Lake Greifensee (Switzerland) with the Dual Scripps Plankton Camera. Our best models were based on transfer learning and ensembling, which classified plankton images with 98% accuracy and 93% F1 score. When tested on freely available plankton datasets produced by other automated imaging tools (ZooScan, FlowCytobot and ISIIS), our models performed better than previously used models. Our annotated data, code and classification models are freely available online.
c
Fruit Tabular Classification Dataset
cubig.ai
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Fruit Tabular Classification Dataset [Dataset]. https://cubig.ai/store/products/563/fruit-tabular-classification-dataset
Explore at:
Dataset updated
Jul 8, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Fruit Classification Dataset is a beginner classification dataset configured to classify fruit types based on fruit name, color, and weight information.

2) Data Utilization (1) Fruit Classification Dataset has characteristics that: • This dataset consists of a total of three columns: categorical variable Color, continuous variable Weight, and target class Fruit, allowing you to pre-process categorical and numerical variables when learning classification models. (2) Fruit Classification Dataset can be used to: • Model learning and evaluation: It can be used as educational and research experimental data to compare and evaluate the performance of various machine learning classification algorithms using color and weight characteristics. • Data preprocessing practice: can be used as hands-on data to learn basic data preprocessing and feature engineering courses such as categorical variable encoding and continuous variable scaling.
notMNIST
kaggle.com
opendatalab.com
+3more
Updated Feb 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jwjohnson314 (2018). notMNIST [Dataset]. https://www.kaggle.com/datasets/jwjohnson314/notmnist/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
jwjohnson314
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.

This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.

Content

notMNIST _large.zip is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.

The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip contains 529,119 images and notMNIST_small.zip contains 18726 images.

Acknowledgements

Thanks to Yaroslav Bulatov for putting together the dataset.
i
A collection of nine multi-label text classification datasets
ieee-dataport.org
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yiming Wang (2024). A collection of nine multi-label text classification datasets [Dataset]. https://ieee-dataport.org/documents/collection-nine-multi-label-text-classification-datasets
Explore at:
Dataset updated
Nov 4, 2024
Authors
Yiming Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RCV1
f
Data_Sheet_1_Benchmarking framework for machine learning classification from...
frontiersin.figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johann Benerradi; Jeremie Clos; Aleksandra Landowska; Michel F. Valstar; Max L. Wilson (2023). Data_Sheet_1_Benchmarking framework for machine learning classification from fNIRS data.zip [Dataset]. http://doi.org/10.3389/fnrgo.2023.994969.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fnrgo.2023.994969.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Johann Benerradi; Jeremie Clos; Aleksandra Landowska; Michel F. Valstar; Max L. Wilson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundWhile efforts to establish best practices with functional near infrared spectroscopy (fNIRS) signal processing have been published, there are still no community standards for applying machine learning to fNIRS data. Moreover, the lack of open source benchmarks and standard expectations for reporting means that published works often claim high generalisation capabilities, but with poor practices or missing details in the paper. These issues make it hard to evaluate the performance of models when it comes to choosing them for brain-computer interfaces.MethodsWe present an open-source benchmarking framework, BenchNIRS, to establish a best practice machine learning methodology to evaluate models applied to fNIRS data, using five open access datasets for brain-computer interface (BCI) applications. The BenchNIRS framework, using a robust methodology with nested cross-validation, enables researchers to optimise models and evaluate them without bias. The framework also enables us to produce useful metrics and figures to detail the performance of new models for comparison. To demonstrate the utility of the framework, we present a benchmarking of six baseline models [linear discriminant analysis (LDA), support-vector machine (SVM), k-nearest neighbours (kNN), artificial neural network (ANN), convolutional neural network (CNN), and long short-term memory (LSTM)] on the five datasets and investigate the influence of different factors on the classification performance, including: number of training examples and size of the time window of each fNIRS sample used for classification. We also present results with a sliding window as opposed to simple classification of epochs, and with a personalised approach (within subject data classification) as opposed to a generalised approach (unseen subject data classification).Results and discussionResults show that the performance is typically lower than the scores often reported in literature, and without great differences between models, highlighting that predicting unseen data remains a difficult task. Our benchmarking framework provides future authors, who are achieving significant high classification scores, with a tool to demonstrate the advances in a comparable way. To complement our framework, we contribute a set of recommendations for methodology decisions and writing papers, when applying machine learning to fNIRS data.
I
Trained models for multi-task multi-dataset learning for text classification...
databank.illinois.edu
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra (2020). Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets [Dataset]. http://doi.org/10.13012/B2IDB-1094364_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-1094364_V1
Dataset updated
Aug 4, 2020
Authors
Shubhanshu Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging. Models were trained using: https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py See https://github.com/socialmediaie/SocialMediaIE and https://socialmediaie.github.io for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
Data from: EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use...
zenodo.org
explore.openaire.eu
+1more
zip
Updated Mar 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Helber; Patrick Helber; Benjamin Bischke; Benjamin Bischke; Andreas Dengel; Andreas Dengel; Damian Borth; Damian Borth (2023). EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification [Dataset]. http://doi.org/10.5281/zenodo.7711097
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7711097
Dataset updated
Mar 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patrick Helber; Patrick Helber; Benjamin Bischke; Benjamin Bischke; Andreas Dengel; Andreas Dengel; Damian Borth; Damian Borth
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
EuroSAT is a land use and land cover classification dataset. The dataset is based on Sentinel-2 satellite imagery covering 13 spectral bands and consists of 10 LULC classes with a total of 27,000 labeled and geo-referenced images. The dataset is associated with the publications "Introducing EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification" and "EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification".

EuroSAT_RGB.zip contains the RGB version of the dataset, which includes the optical R, G and B frequency bands encoded as JPEG images.

EuroSAT_MS.zip contains the multi-spectral version of the EuroSAT dataset, which includes all 13 Sentinel-2 bands in the original value range.

Facebook

Twitter

Click to copy link

Link copied

Cite

Taeho Jo; Kwangsik Nho; Andrew J. Saykin (2023). Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification and Prognostic Prediction Using Neuroimaging Data.pdf [Dataset]. http://doi.org/10.3389/fnagi.2019.00220.s001

Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification and Prognostic Prediction Using Neuroimaging Data.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.3389/fnagi.2019.00220.s001

Dataset updated

May 30, 2023

Dataset provided by

Frontiers

Authors

Taeho Jo; Kwangsik Nho; Andrew J. Saykin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Deep learning, a state-of-the-art machine learning approach, has shown outstanding performance over traditional machine learning in identifying intricate structures in complex high-dimensional data, especially in the domain of computer vision. The application of deep learning to early detection and automated classification of Alzheimer's disease (AD) has recently gained considerable attention, as rapid progress in neuroimaging techniques has generated large-scale multimodal neuroimaging data. A systematic review of publications using deep learning approaches and neuroimaging data for diagnostic classification of AD was performed. A PubMed and Google Scholar search was used to identify deep learning papers on AD published between January 2013 and July 2018. These papers were reviewed, evaluated, and classified by algorithm and neuroimaging type, and the findings were summarized. Of 16 studies meeting full inclusion criteria, 4 used a combination of deep learning and traditional machine learning approaches, and 12 used only deep learning approaches. The combination of traditional machine learning for classification and stacked auto-encoder (SAE) for feature selection produced accuracies of up to 98.8% for AD classification and 83.7% for prediction of conversion from mild cognitive impairment (MCI), a prodromal stage of AD, to AD. Deep learning approaches, such as convolutional neural network (CNN) or recurrent neural network (RNN), that use neuroimaging data without pre-processing for feature selection have yielded accuracies of up to 96.0% for AD classification and 84.2% for MCI conversion prediction. The best classification performance was obtained when multimodal neuroimaging and fluid biomarkers were combined. Deep learning approaches continue to improve in performance and appear to hold promise for diagnostic classification of AD using multimodal neuroimaging data. AD research that uses deep learning is still evolving, improving performance by incorporating additional hybrid data types, such as—omics data, increasing transparency with explainable approaches that add knowledge of specific disease-related features and mechanisms.

Clear search

Close search

Google apps

Main menu

Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification...

Gender Classification Dataset

Context

Content

Acknowledgements

Inspiration

Data from: Classification of Heart Failure Using Machine Learning: A...

Facies-Mark: A Machine Learning Benchmark for Facies Classification

LLM prompts in the context of machine learning

Network traffic and code for machine learning classification

Text Classification Dataset

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

Benchmark dataset for graph classification

Gender Detection & Classification - Face Dataset

Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

OTHER BIOMETRIC DATASETS:

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

File with the extension .csv

TrainingData provides high-quality data annotation tailored to your needs

Data from: Metadata Classification Machine Learning Data

‘Job Classification Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

UCI and OpenML Data Sets for Ordinal Quantification

Data for: Deep Learning Classification of Lake Zooplankton - Package - ERIC

Fruit Tabular Classification Dataset

notMNIST

Context

Content

Acknowledgements

A collection of nine multi-label text classification datasets

Data_Sheet_1_Benchmarking framework for machine learning classification from...

Trained models for multi-task multi-dataset learning for text classification...

Data from: Assessing predictive performance of supervised machine learning...

Data from: EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use...

Data_Sheet_1_Deep Learning in Alzheimer's Disease: Diagnostic Classification and Prognostic Prediction Using Neuroimaging Data.pdf