93 datasets found

Machine Learning Basics for Beginners🤖🧠
kaggle.com
zip
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
Explore at:
zip(492015 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Bhanupratap Biswas
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
d
MLRegTest: A benchmark for the machine learning of regular languages
dataone.org
search.dataone.org
+1more
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sam van der Poel; Dakotah Lambert; Kalina Kostyszyn; Tiantian Gao; Rahul Verma; Derek Andersen; Joanne Chau; Emily Peterson; Cody St. Clair; Paul Fodor; Chihiro Shibata; Jeffrey Heinz (2025). MLRegTest: A benchmark for the machine learning of regular languages [Dataset]. http://doi.org/10.5061/dryad.dncjsxm4h
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.dncjsxm4h
Dataset updated
Jul 22, 2025
Dataset provided by
Dryad Digital Repository
Authors
Sam van der Poel; Dakotah Lambert; Kalina Kostyszyn; Tiantian Gao; Rahul Verma; Derek Andersen; Joanne Chau; Emily Peterson; Cody St. Clair; Paul Fodor; Chihiro Shibata; Jeffrey Heinz
Description
MLRegTest is a benchmark for machine learning systems on sequence classification, which contains training, development, and test sets from 1,800 regular languages. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies., The languages were generated by creating finite-state acceptors and the datasets were generated by sampling from these finite-state acceptors. The scripts and software used for these processes are open source and available. For details, see https://github.com/heinz-jeffrey/subregular-learning. Details are described in the arxiv preprint "MLRegTest: A Benchmark for the Machine Learning of Regular Languages"., , # MLRegTest: A benchmark for the machine learning of regular languages

https://doi.org/10.5061/dryad.dncjsxm4h

MLRegTest provides training and testing data for 1800 regular languages.

This repository contains three gzipped tar archives.

> data.tar.gz (21GB) > languages.tar.gz (4.5MB) > models.tar.gz (76GB)

When uncompressed, these yield three directories, described in detail below.

> data (43GB) > languages (38MB) > models (87GB)

Languages

Languages are named according to the scheme Sigma.Tau.class.k.t.i.plebby, where Sigma is a two-digit alphabet size, Tau a two-digit number of salient symbols (the 'tier'), class the named subregular class, k the width of factors used (if applicable), t the threshold counted to (if applicable), and i a unique identifier. The table below unabbreviates the class names, and shows how many languages of each class there are.

| class | name | amo...,
NYC_building_energy_data
kaggle.com
zip
Updated Mar 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maksym Dubovyi (2020). NYC_building_energy_data [Dataset]. https://www.kaggle.com/maxbrain/nyc-building-energy-data
Explore at:
zip(9476304 bytes)Available download formats
Dataset updated
Mar 4, 2020
Authors
Maksym Dubovyi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
New York
Description
In this notebook, we will walk through solving a complete machine learning problem using a real-world dataset. This was a "homework" assignment given to me for a job application over summer 2018. The entire assignment can be viewed here and the one sentence summary is:

Use the provided building energy data to develop a model that can predict a building's Energy Star score, and then interpret the results to find the variables that are most predictive of the score.

This is a supervised, regression machine learning task: given a set of data with targets (in this case the score) included, we want to train a model that can learn to map the features (also known as the explanatory variables) to the target.

Supervised problem: we are given both the features and the target Regression problem: the target is a continous variable, in this case ranging from 0-100 During training, we want the model to learn the relationship between the features and the score so we give it both the features and the answer. Then, to test how well the model has learned, we evaluate it on a testing set where it has never seen the answers!

Machine Learning Workflow Although the exact implementation details can vary, the general structure of a machine learning project stays relatively constant:

Data cleaning and formatting Exploratory data analysis Feature engineering and selection Establish a baseline and compare several machine learning models on a performance metric Perform hyperparameter tuning on the best model to optimize it for the problem Evaluate the best model on the testing set Interpret the model results to the extent possible Draw conclusions and write a well-documented report Setting up the structure of the pipeline ahead of time lets us see how one step flows into the other. However, the machine learning pipeline is an iterative procedure and so we don't always follow these steps in a linear fashion. We may revisit a previous step based on results from further down the pipeline. For example, while we may perform feature selection before building any models, we may use the modeling results to go back and select a different set of features. Or, the modeling may turn up unexpected results that mean we want to explore our data from another angle. Generally, you have to complete one step before moving on to the next, but don't feel like once you have finished one step the first time, you cannot go back and make improvements!

This notebook will cover the first three (and a half) steps of the pipeline with the other parts discussed in two additional notebooks. Throughout this series, the objective is to show how all the different data science practices come together to form a complete project. I try to focus more on the implementations of the methods rather than explaining them at a low-level, but have provided resources for those who want to go deeper. For the single best book (in my opinion) for learning the basics and implementing machine learning practices in Python, check out Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurelion Geron.

With this outline in place to guide us, let's get started!
Statistics and Evaluation Data for Publication "Using Supervised Learning to...
zenodo.org
application/gzip
Updated May 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Weber; Tobias Weber; Michael Fromm; Michael Fromm; Nelson Tavares de Sousa; Nelson Tavares de Sousa (2020). Statistics and Evaluation Data for Publication "Using Supervised Learning to Classify Metadata of Research Data by Field of Study" [Dataset]. http://doi.org/10.5281/zenodo.3841797
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3841797
Dataset updated
May 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tobias Weber; Tobias Weber; Michael Fromm; Michael Fromm; Nelson Tavares de Sousa; Nelson Tavares de Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records. This publication contains aggregated data for the paper. It also contains the evaluation data of all model/hyper-parameter training and test runs.
d
Systematic review of validation of supervised machine learning models in...
search.dataone.org
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oakleigh Wilson (2025). Systematic review of validation of supervised machine learning models in accelerometer-based animal behaviour classification literature [Dataset]. http://doi.org/10.5061/dryad.fxpnvx14d
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.fxpnvx14d
Dataset updated
Jun 25, 2025
Dataset provided by
Dryad Digital Repository
Authors
Oakleigh Wilson
Description
Supervised machine learning has been used to detect fine-scale animal behaviour from accelerometer data, but a standardised protocol for implementing this workflow is currently lacking. As the application of machine learning to ecological problems expands, it is essential to establish technical protocols and validation standards that align with those in other "big data" fields. Overfitting is a prevalent and often misunderstood challenge in machine learning. Overfit models overly adapt to the training data to memorise specific instances rather than to discern the underlying signal. Associated results can indicate high performance on the training set, yet these models are unlikely to generalise to new data. Overfitting can be detected through rigorous validation using independent test sets. Our systematic review of 119 studies using accelerometer-based supervised machine learning to classify animal behaviour reveals that 79% (94 papers) did not validate their models sufficiently wel..., We defined eligibility criteria as 'peer-reviewed primary research papers published 2013-present that use supervised machine learning to identify specific behaviours from raw, non-livestock animal accelerometer data'. We elected to ignore analysis of livestock behaviour as agricultural methods often operate within different constraints to the analyses conducted on wild animals and this body of literature has mostly developed in isolation to wild animal research. Our search was conducted on 27/09/2024. Initial keyword search across 3 databases (Google Scholar, PubMed, and Scopus) yielded 249 unique papers. Papers outside of the search criteria â€” including hardware and software advances, non-ML analysis, insufficient accelerometry application (e.g., research focused on other sensors with accelerometry providing minimal support), unsupervised methods, and research limited to activity intensity or active and inactive statesâ€” were excluded, resulting in 119 papers., , # Systematic review of validation of supervised machine learning models in accelerometer-based animal behaviour classification literature

https://doi.org/10.5061/dryad.fxpnvx14d

Description of the data and file structure

Files and variables

File: Systematic_Review_Supplementary.xlsx

Description:Â Methods information from animal accelerometer-based behaviour classification literature utilising supervised machine learning techniques.

Variables

Citation: Citation information for paper

Title: Extracted title from citation information

Year: Year of publication

ModelCategory: General category of the supervised machine learning model used (e.g., all Support Vector Machines are listed as SVM)

DT â€” Decision Tree

EM â€” Expectation Maximisation

Ensemble â€” Ensemble methods (e.g., boosting, bagging)

HMM â€” Hidden Markov Model

Isolation Forest â€” Anomaly detection using Isolation Forest ...,
Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinseok Kim; Jenna Kim; Jason Owen-Smith (2023). Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.14043791.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14043791.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jinseok Kim; Jenna Kim; Jason Owen-Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6
Data from: color classification
kaggle.com
zip
Updated Apr 20, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aydin Ayanzadeh (2018). color classification [Dataset]. https://www.kaggle.com/ayanzadeh93/color-classification
Explore at:
zip(169343980 bytes)Available download formats
Dataset updated
Apr 20, 2018
Authors
Aydin Ayanzadeh
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Introduction

Color classification is an important application that is used in many areas. For example, systems that perform daily. SVM classifier with an optimal hyperplane life analysis can benefit from this classification process. For the classification process, lots of classification algorithms can be used. Among them, the most popular machine learning algorithms are neural networks, decision trees, k-nearest neighbors, Bayes network, support vector machines. In this work for training, SVMs are used and a classifier model was tried to be obtained. SVMs algorithm is one of the supervised learning methods. SVM calls for solutions to regression and classification problems as in all supervised learning methods. This algorithm is usually used to training for separate and classify different labeled samples. As a result of training with SVM, it is aimed to create an optimum hyperplane and classify the data in different classes. This hyperplane is located as far away from the data as possible to avoid error conditions.

Dataset

The datasets have contained about 80 images for trainset datasets for whole color classes and 90 images for the test set. colors which are prepared for this application is y yellow, black, white, green, red, orange, blue a and violet. In this implementation, basic colors are preferred for classification. and created a dataset containing images of these basic colors. The dataset also includes masks for all images. we create these masks by binarizing the image. we did the masking on the images I collected and painted the pixels belonging to the class color to white and remaining pixels to the black color.
Datasheet1_Improved stacking ensemble learning based on feature selection to...
frontiersin.figshare.com
pdf
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingyuan Wang; Yiyi Qian; Yaodong Yang; Haobin Chen; Wei-Feng Rao (2024). Datasheet1_Improved stacking ensemble learning based on feature selection to accurately predict warfarin dose.pdf [Dataset]. http://doi.org/10.3389/fcvm.2023.1320938.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2023.1320938.s001
Dataset updated
Jan 19, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Mingyuan Wang; Yiyi Qian; Yaodong Yang; Haobin Chen; Wei-Feng Rao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundWith the rapid development of artificial intelligence, prediction of warfarin dose via machine learning has received more and more attention. Since the dose prediction involve both linear and nonlinear problems, traditional machine learning algorithms are ineffective to solve such problems at one time.ObjectiveBased on the characteristics of clinical data of Chinese warfarin patients, an improved stacking ensemble learning can achieve higher prediction accuracy.MethodsInformation of 641 patients from southern China who had reached a steady state on warfarin was collected, including demographic information, medical history, genotype, and co-medication status. The dataset was randomly divided into a training set (90%) and a test set (10%). The predictive capability is evaluated on a new test set generated by stacking ensemble learning. Additional factors associated with warfarin dose were discovered by feature selection methods.ResultsA newly proposed heuristic-stacking ensemble learning performs better than traditional-stacking ensemble learning in key metrics such as accuracy of ideal dose (73.44%, 71.88%), mean absolute errors (0.11 mg/day, 0.13 mg/day), root mean square errors (0.18 mg/day, 0.20 mg/day) and R2 (0.87, 0.82).ConclusionsThe developed heuristic-stacking ensemble learning can satisfactorily predict warfarin dose with high accuracy. A relationship between hypertension, a history of severe preoperative embolism, and warfarin dose is found, which provides a useful reference for the warfarin dose administration in the future.
f
Data Sheet 1_Contrastive self-supervised learning for neurodegenerative...
frontiersin.figshare.com
pdf
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vadym Gryshchuk; Devesh Singh; Stefan Teipel; Martin Dyrba; the ADNI, AIBL, FTLDNI study groups (2025). Data Sheet 1_Contrastive self-supervised learning for neurodegenerative disorder classification.pdf [Dataset]. http://doi.org/10.3389/fninf.2025.1527582.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2025.1527582.s001
Dataset updated
Feb 17, 2025
Dataset provided by
Frontiers
Authors
Vadym Gryshchuk; Devesh Singh; Stefan Teipel; Martin Dyrba; the ADNI, AIBL, FTLDNI study groups
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionNeurodegenerative diseases such as Alzheimer's disease (AD) or frontotemporal lobar degeneration (FTLD) involve specific loss of brain volume, detectable in vivo using T1-weighted MRI scans. Supervised machine learning approaches classifying neurodegenerative diseases require diagnostic-labels for each sample. However, it can be difficult to obtain expert labels for a large amount of data. Self-supervised learning (SSL) offers an alternative for training machine learning models without data-labels.MethodsWe investigated if the SSL models can be applied to distinguish between different neurodegenerative disorders in an interpretable manner. Our method comprises a feature extractor and a downstream classification head. A deep convolutional neural network, trained with a contrastive loss, serves as the feature extractor that learns latent representations. The classification head is a single-layer perceptron that is trained to perform diagnostic group separation. We used N = 2,694 T1-weighted MRI scans from four data cohorts: two ADNI datasets, AIBL and FTLDNI, including cognitively normal controls (CN), cases with prodromal and clinical AD, as well as FTLD cases differentiated into its phenotypes.ResultsOur results showed that the feature extractor trained in a self-supervised way provides generalizable and robust representations for the downstream classification. For AD vs. CN, our model achieves 82% balanced accuracy on the test subset and 80% on an independent holdout dataset. Similarly, the Behavioral variant of frontotemporal dementia (BV) vs. CN model attains an 88% balanced accuracy on the test subset. The average feature attribution heatmaps obtained by the Integrated Gradient method highlighted hallmark regions, i.e., temporal gray matter atrophy for AD, and insular atrophy for BV.ConclusionOur models perform comparably to state-of-the-art supervised deep learning approaches. This suggests that the SSL methodology can successfully make use of unannotated neuroimaging datasets as training data while remaining robust and interpretable.
Complete Blood Count (CBC)
kaggle.com
zip
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Noukhez (2024). Complete Blood Count (CBC) [Dataset]. https://www.kaggle.com/datasets/mdnoukhej/complete-blood-count-cbc
Explore at:
zip(12859168 bytes)Available download formats
Dataset updated
Aug 1, 2024
Authors
Muhammad Noukhez
Description
Dataset Description:

This dataset is a comprehensive collection of Complete Blood Count (CBC) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.

Dataset Breakdown:

Training Images: 300

Validation Images: 60

Test Images: 60

Annotations: Detailed annotations included for all images

Overview:

The Complete Blood Count (CBC) is a crucial test used in medical diagnostics to evaluate the overall health and detect a variety of disorders, including anemia, infection, and many other diseases. This dataset provides a rich source of CBC images that can be used to train machine learning models to automate the analysis and interpretation of these tests.

Data Composition:

Training Set:

Contains 300 images

These images are used to train machine learning models, enabling them to learn and recognize patterns associated with various blood cell types and conditions.

Validation Set:

Contains 60 images

Used to tune the models and optimize their performance, ensuring that the models generalize well to new, unseen data.

Test Set:

Contains 60 images

Used to evaluate the final model performance, providing an unbiased assessment of how well the model performs on new data.

Annotations:

Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.

Key Features:

High-Quality Images: All images are of high quality, making them suitable for a variety of machine learning tasks, including image classification, object detection, and segmentation.

Comprehensive Annotations: Each image is thoroughly annotated, providing valuable information that can be used to train and validate models.

Balanced Dataset: The dataset is carefully balanced with distinct sets for training, validation, and testing, ensuring that models trained on this data will be robust and generalizable.

Applications:

This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated CBC Analysis: Developing algorithms to automatically analyze CBC images and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of CBC image analysis.

Usage Notes:

Data Augmentation: Users may consider applying data augmentation techniques to increase the diversity of the training data and improve model robustness.

Preprocessing: Proper preprocessing, such as normalization and noise reduction, can enhance model performance.

Evaluation Metrics: It is recommended to use standard evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

Conclusion:

This CBC dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for CBC analysis.
t
Clay, Viviane (2021). Dataset: Data from neural network training in the...
service.tib.eu
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Clay, Viviane (2021). Dataset: Data from neural network training in the obstacle tower environment to investigate embodied, weakly supervised learning. https://doi.org/10.26249/FK2/BFDUZO [Dataset]. https://service.tib.eu/ldmservice/dataset/osn-doi-10-26249-fk2-bfduzo
Explore at:
Dataset updated
May 16, 2025
Description
Abstract: Description: This repository presents data collected to investigate the role of embodiment and supervision in learning. This is done inside a simulated 3D maze world with a navigation task using mainly visual input in the form of RGB images. The main contribution of this data repository is to provide a network model trained in this environment with weak supervision and a closed loop between action and perception. Additionally, control networks are provided which were trained with varying degrees of supervision and embodiment. In the corresponding paper [1] the representations of these networks are compared based on sparsity measures and well as content of the encodings and the possibility to extract semantic labels. For the training of the control conditions several new data sets were created which are also included here. They contain a collection of images from the simulated world with corresponding semantic labels. Overall, they provide a good basis for further analysis and a more in-depth investigation of representation learning and the effect of embodiment and supervision on representations. Abstract: Steps to reproduce: Data was generated through a 3D simulation of a maze environment called Obstacle Tower. The data of interest are the trained neural network weights and the networks activations corresponding with different input frames. Three main networks were trained. A reinforcement learning agent which trained through interaction with the simulated environment, an autoencoder trained to reconstruct images collected by the agent and a classifier, trained to classify objects in the images. Exact training and testing conditions, hyperparameter and network structure are provided in the corresponding paper. For the training of the reinforcement learning agent the Unity ml-agents toolkit PPO implementation is used with small modifications for extra data collection and control experiments. The code we used can be found here: https://github.com/vkakerbeck/ml-agents-dev . Model checkpoint files are saved for different points in training but mostly the final version of the network is analysed in the corresponding paper [1] . The autoencoder and classifier are trained using Python with TensorFlow and Keras. The corresponding code can be found here: https://github.com/vkakerbeck/Learning-World-Representations/tree/master/DataAnalysis . The data also contains activations in the hidden layer of the network corresponding to 4000 test images for all three networks. Code for this can be found in the same GitHub repository. The datasets used for training the autoencoder and classifier were created by collecting observations in the Obstacle Tower environment using the trained agent. These observations were then labelled automatically, and the labels were cross checked by hand. A Description of the individual files is included in the data folder (Description.txt). Due to storage constraints no all model checkpoint files used to create figure 6 of the paper could be uploaded. However, feel free to contact me (vkakerbeck[at]uos.de) if you are intrested in these detailed checkpoint files of the controll runs and I will make them available to you.
a
2023 Irrigated Lands for the Eastern Snake River Plain Aquifer: Machine...
hub.arcgis.com
arc-gis-hub-home-arcgishub.hub.arcgis.com
+1more
Updated Sep 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Idaho Department of Water Resources (2025). 2023 Irrigated Lands for the Eastern Snake River Plain Aquifer: Machine Learning Generated [Dataset]. https://hub.arcgis.com/documents/1a7fe009dc7d4979ac834873e99f6985
Explore at:
Dataset updated
Sep 12, 2025
Dataset authored and provided by
Idaho Department of Water Resources
Area covered
Snake River Plain
Description
ESPA Irrigated Lands 2023 was created for use in water budget studies within the ESPA study boundary. The area of interest was determined by Hydrology Section staff at IDWR, and a study boundary was given to GIS staff and used to clip the model output. The random forest (RF) model is a type of supervised machine learning algorithm requiring GIS staff to provide manually labeled training data. GIS staff also provide the RF model with several input features, typically raster datasets that help distinguish characteristics of irrigated lands. ESPA Irrigated Lands 2023 used the following as input features: • Harmonized Landsat 8 and 9 OLI and Sentinel-2A and -2B satellites (HLS-2 Landsat Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30 m [2], HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30 m [3]; bands: SWIR-2, NIR, Blue, and calculated NDVI)• 10-meter digital elevation model 4• Height Above Nearest Drainage 5• OpenET Ensemble monthly evapotranspiration [6]• PRISM Climate Dataset 7• Topographic Wetness Index, derived from the digital elevation model4For additional information on processing Landsat and Sentinel-2 surface reflectance imagery, please see below. Additional datasets used only for labeling training data include Mapping EvapoTranspiration at high Resolution with Internalized Calibration (METRIC) [8], IDWR-provided Active Water Rights Place of Use, the Cropland Data Layer [9] for 2023, and the National Agriculture Imagery Program (NAIP) imagery [10] for Idaho 2023.The accuracy of the ESPA Irrigated Lands 2023 dataset was verified by several methods. Firstly, a validation test was conducted by withholding a subset of the training data to evaluate how well the model classified unseen information. Second, GIS staff ran several iterations of the model with variations of training data, with the goal of improving classification for areas consistently misclassified. This process requires GIS staff knowledge, aided by supplementary datasets, to review the area and make decisions. Once a model iteration was determined as ‘final’, a manual mask was created to correct any remaining misclassification in the dataset.Manual corrections for the ESPA Irrigated Lands 2023 dataset were focused on the area between Ashton and Lamont, where false positive labels of “irrigated” occurred on dryland-managed fields. Some areas classified as irrigated near Bellevue were masked out due to suspected wetland. A general wetland mask for the entire ESPA study boundary was also applied. Other manual corrections were made throughout the study area, specifically for pivot-irrigated fields not matching the NAIP field boundaries. Decisions made during manual masking were conservative, relying heavily on both the presence of an active water right and clear indications of artificial application of water as observed in satellite imagery.References:[1] https://developers.google.com/earth-engine/apidocs/ee-classifier-smilerandomforest[2] https://developers.google.com/earth-engine/datasets/catalog/NASA_HLS_HLSS30_v002[3] https://developers.google.com/earth-engine/datasets/catalog/NASA_HLS_HLSL30_v002[4] https://developers.google.com/earth-engine/datasets/catalog/USGS_3DEP_10m[5] Donchyts, G., Winsemius, H., Schellekens, J., Erickson, T., Gao, H., Savenije, H., & van de Giesen, N. (2016). Global 30m height above the nearest drainage (HAND). Geophysical Research Abstracts, 18, EGU2016-17445-3. EGU General Assembly 2016.[6] https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0[7] Daly, C., Halbleib, M., Smith, J.I., Gibson, W.P., Doggett, M.K., Taylor, G.H., Curtis, J. & Pasteris, P.A. (2008). Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States. International Journal of Climatology, 28, 2031-2064. doi:10.1002/joc.1688[8] https://data-idwr.hub.arcgis.com/documents/4defd5144b314fdcb010717cc6936648/about[9] https://developers.google.com/earth-engine/datasets/catalog/USDA_NASS_CDL[10] https://developers.google.com/earth-engine/datasets/catalog/USDA_NAIP_DOQQ
d
Machine learning feature data from EHR, labels, and estimates for next...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grace Y. E. Kim; Matthew Schwede; Conor K. Corbin; Sajjad Fouladvand; Rondeep Brar; David Iberri; William Shomali; Jean Oak; Dita Gratzinger; Henning Stehr; Jonathan H. Chen (2024). Machine learning feature data from EHR, labels, and estimates for next generation sequencing-based assay [Dataset]. http://doi.org/10.5061/dryad.nzs7h450b
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.nzs7h450b
Dataset updated
Nov 29, 2024
Dataset provided by
Dryad Digital Repository
Authors
Grace Y. E. Kim; Matthew Schwede; Conor K. Corbin; Sajjad Fouladvand; Rondeep Brar; David Iberri; William Shomali; Jean Oak; Dita Gratzinger; Henning Stehr; Jonathan H. Chen
Description
Next-generation sequencing-based tests have advanced the field of medical diagnostics, but their novelty and cost can lead to uncertainty in clinical deployment. The Heme-STAMP is one such assay that tracks mutations in genes implicated in hematolymphoid neoplasms. Rather than limiting its clinical usage or imposing rule-based criteria, we propose leveraging machine learning to guide clinical decision-making on whether this test should be ordered. We trained a machine learning model to predict the outcome of Heme-STAMP testing using 3,472 orders placed between May 2018 and September 2021 from an academic medical center and demonstrated how to integrate a custom machine learning model into a live clinical environment to obtain real-time model and physician estimates. The model predicted the results of a complex next-generation sequencing test with discriminatory power comparable to expert hematologists (AUC score: 0.77 [0.66, 0.87], 0.78 [0.68, 0.86] respectively) and with capacity to im..., The feature data was pulled from the STAnford medicine Research data Repository (STARR) and further processed to meet the needs of this study and privacy guidelines. Labels were obtained through the Stanford Pathology Department. Ordering physician estimates were generated by participating physicians and model estimates were generated by the machine learning model used in the study., , # Machine learning feature data from EHR, labels, and estimates for next generation sequencing-based assay

https://doi.org/10.5061/dryad.nzs7h450b

Description of the data and file structure

These datasets were utilized to train and evaluate a machine learning model that predicts the outcome of the Heme-STAMP test, a next generation sequencing assay that tracks mutations in genes implicated in hematolymphoid neoplasms. The feature_data_anon.csv was used to train/test a Random Forest model and uses features such as demographics, lab results, medications, diagnoses, etc. Numerical values were binned by their distribution. For example, "Age0" would correspond to the 1st bucket of values while "Age_3" would correspond to the 4th bucket. The estimates.csv contains the estimations generated by the ordering physician and the machine learning model on the orders that were prospectively collected.Â

Files and variables

File: Feature_data...
a
2004 Irrigated Lands for the Mountain Home Plateau: Machine Learning...
hub.arcgis.com
gis-idaho.hub.arcgis.com
+2more
Updated Oct 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Idaho Department of Water Resources (2025). 2004 Irrigated Lands for the Mountain Home Plateau: Machine Learning Generated [Dataset]. https://hub.arcgis.com/documents/20215b3d9cdf4cfc8b11ab98c924a51a
Explore at:
Dataset updated
Oct 11, 2025
Dataset authored and provided by
Idaho Department of Water Resources
Description
Mountain Home Irrigated Lands 2004 was created for use in water budget studies in Mountain Home. The area of interest was determined by Hydrology Section staff at IDWR, and a study boundary was given to GIS staff and used to clip the model output. The random forest (RF) model is a type of supervised machine learning algorithm requiring GIS staff to provide manually labeled training data. GIS staff also provide the RF model with several input features, typically raster datasets that help distinguish characteristics of irrigated lands. Mountain Home Irrigated Lands 2004 used the following as input features: • Landsat 5 [2] and Landsat 7 [3] averaged surface reflectance imagery (bands: SWIR 2, NIR, Blue, and calculated NDVI)• 10-meter digital elevation model 4• Height Above Nearest Drainage (HAND) [5]• PRISM Climate Dataset 6• Topographic Wetness Index, derived from the digital elevation model [4]For additional information on the interpolation process for Landsat imagery, please see below. Additional datasets used only for labeling training data include IDWR-provided Active Water Rights Place of Use and National Agriculture Imagery Program (NAIP) aerial imagery for 2004 [7].The accuracy of Mountain Home Irrigated Lands 2004 dataset was verified by several methods. Firstly, a validation test is done by withholding a subset of the training data to evaluate how well the model classifies unseen information. Second, GIS staff will run several iterations of the model with variations of training data, with the goal of improving classification for areas consistently misclassified. This process requires GIS staff knowledge, aided by supplementary datasets, to review the area and make decisions. Once a model iteration is determined as ‘final’, a manual mask is created to correct any remaining misclassification in the dataset. Misclassification within the Mountain Home Irrigated Lands 2004 dataset was minimal, occurring primarily in the southern areas near the Snake River, as well as around reservoirs and stream channels. GIS staff manually reviewed potential misclassifications by examining Landsat 5 and Landsat 7 imagery, NAIP aerial imagery, and IDWR Active Irrigation Water Rights. References:[1] https://developers.google.com/earth-engine/apidocs/ee-classifier-smilerandomforest[2] https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LC05_C02_T1_L2[3] https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LE07_C02_T1_L2[4] https://developers.google.com/earth-engine/datasets/catalog/USGS_3DEP_10m[5] Donchyts, G., Winsemius, H., Schellekens, J., Erickson, T., Gao, H., Savenije, H., & van de Giesen, N. (2016). Global 30m height above the nearest drainage (HAND). Geophysical Research Abstracts, 18, EGU2016-17445-3. EGU General Assembly 2016.[6] Daly, C., Halbleib, M., Smith, J.I., Gibson, W.P., Doggett, M.K., Taylor, G.H., Curtis, J. & Pasteris, P.A. (2008). Physiographically sensitive mapping of climatologicaltemperature and precipitation across the conterminous United States. International Journal of Climatology, 28, 2031-2064. doi:10.1002/joc.1688[7] U.S. Department of Agriculture, Farm Service Agency. (2004). National Agriculture Imagery Program (NAIP) imagery [Digital image]. U.S. Department of Agriculture. https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/Information interpolated imagery:GIS staff prepared averaged Landsat images to reduce missing data from cloud cover. Images were averaged across four periods: March 1–May 1, May 1–July 1, July 1–September 1, and September 1–November 1. These same periods were also used to average PRISM climate data. The temporal extent of other input features was filtered to March 1–November 30, 2004, where applicable.
m
Handwritten Arabic Numerals (0-9) Image Dataset
data.mendeley.com
Updated May 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huzain Azis (2024). Handwritten Arabic Numerals (0-9) Image Dataset [Dataset]. http://doi.org/10.17632/5hpkf8v7bg.1
Explore at:
Unique identifier
https://doi.org/10.17632/5hpkf8v7bg.1
Dataset updated
May 20, 2024
Authors
Huzain Azis
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
This dataset contains images of handwritten Arabic numerals ranging from 0 to 9. It comprises a total of 9350 samples, with 935 images for each numeral class. The images were collected from various individuals to ensure diversity in handwriting styles.

Key Features:

Classes: 10 (Arabic numerals 0-9) Total Samples: 9350 Samples per Class: 935 Image Format: Grayscale Image Size: 28x28 pixels (adjust if different) Data Collection and Labeling:

The dataset was created by collecting handwritten numerals from participants with different handwriting styles. Each image was manually labeled to ensure accurate and consistent annotations. The data collection and labeling process was meticulously carried out by one of the authors. Usage:

This dataset is suitable for training and testing machine learning models for handwritten digit recognition. It can be used in various applications such as optical character recognition (OCR) systems, pattern recognition, and other related fields.

Contributors:

Author 1: Conducted the data collection and labeling process, ensuring accurate and consistent annotations for all samples. Author 2: Handled the data labelling process. Acknowledgments:

We would like to thank all the participants who contributed their handwritten numerals for this dataset.

License:

CC BY NC 3.0 You are free to adapt, copy or redistribute the material, providing you attribute appropriately and do not use the material for commercial purposes.
m
MoLa RGB CovSurv
data.mendeley.com
Updated Dec 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
César Melo (2021). MoLa RGB CovSurv [Dataset]. http://doi.org/10.17632/vzf939jbxy.1
Explore at:
Unique identifier
https://doi.org/10.17632/vzf939jbxy.1
Dataset updated
Dec 20, 2021
Authors
César Melo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository presents one of the datasets described in the article "AI based monitoring of different risk levels in Covid19 context", published in the Multidisciplinary Digital Publishing Institute special issue "Human Activity Recognition Based on Image Sensors and Deep Learning".

The repository includes the complete dataset used for the training, validation and testing tasks, in order to detect the presence or absence of mask by people in public areas.

There are two different folders: images and labels, each divided in three different subdatasets (train, valid, test). For each image, there is a text document with the exactly same name, where is present the information about each object (in this case, people's faces).

This labels information uses the class associated to the object (0: With_Mask and 1: Without_Mask), and the correspondent normalized values of the bounding box of the face (x_center, y_center, width, height).
Image for the dataset in "Extraction of stratigraphic exposures on visible...
zenodo.org
zip
Updated Oct 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rina Noguchi; Rina Noguchi; Daigo Shoji; Daigo Shoji (2023). Image for the dataset in "Extraction of stratigraphic exposures on visible images using a supervised machine learning technique" [Dataset]. http://doi.org/10.5281/zenodo.8396332
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8396332
Dataset updated
Oct 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rina Noguchi; Rina Noguchi; Daigo Shoji; Daigo Shoji
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the original and hand-masked image for the dataset used in a research paper "Extraction of stratigraphic exposures on visible images using a supervised machine learning technique".

The content is

Original images with hand-masked images (original_images_NOGUCHIandShoji.zip)

training/* : original images used for the training dataset generation (60 files)

training_masks/* : hand-masked images for the training dataset generation (60 files)

validation/* : original images used for the training dataset generation (10 files)

validation_masks/* : hand-masked images for the validation dataset generation (10 files)

test/* : original images used as the test data (5 files)

test_masks/* : hand-masked images used as the test data (5 files).

Note that original images include images obtained using google-image-download, a Python script published on GitHub (https://github.com/Joeclinton1/google-images-download/tree/patch-1, Copyright © 2015-2019 Hardik Vasa). The whole images we obtained by google-image-download were labeled as noncommercial reuse with modification.

For more details, please refer to a research paper "Extraction of stratigraphic exposures on visible images using a supervised machine learning technique".

Correspondence: Rina Noguchi (r-noguchi@env.sc.niigata-u.ac.jp)
AI Research Instructions and Outputs
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). AI Research Instructions and Outputs [Dataset]. https://www.kaggle.com/datasets/thedevastator/ai-research-instructions-and-outputs/discussion
Explore at:
zip(32193107 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
AI Research Instructions and Outputs

Driving Innovation in Machine Learning and AI Exploration

By Huggingface Hub [source]

About this dataset

This dataset contains 80,000 unique pairs of instructions and outputs to be used for Machine Learning and AI research. Instructions such as 'run', 'walk', 'jump', and 'dance' have outputs that represent the results of executing each instruction. It provides a groundbreaking collection of knowledge that can be leveraged in ways such as training AI agents, building intelligent natural language applications, exploring autonomous navigation possibilities, developing dialogues between bots and humans, replicating robotic tasks and research into sophisticated AI models able to understand instructions in various domains like engineering, medicine, finance or law. This dataset has the potential to revolutionize how we approach Artificial Intelligence by pushing boundaries when it comes to data-driven machine learning strategies. With its powerful combination of detailed information from multiple angles – language comprehension from verbal commands alongside increased contextual understanding – we can pave the way for more comprehensive applications of AI technology with exponentially enhanced accuracy when compared to existing methods

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains 80,000 pairs of instructions and outputs for Machine Learning and AI research. This data can be used to teach a variety of AI agents, as well as for tasks like autonomous navigation, dialogue, language modelling, natural language processing (NLP), robotics applications and more. The following guide outlines the steps you'll need to take in order to get the most out of this incredible resource.

Download the dataset from Kaggle - Once downloaded you'll have access to two files: instruction.csv & output.csv.

Examine the data - Take some time familiarizing yourself with the dataset- The columns will contain instructions/verbs such as 'run', walk', 'jump' etc., along with accompanying output results that have been generated from executing those instructions.

Transform the data - Utilize feature engineering techniques appropriate for your project/proposed application in order to transform or extract relevant features from this dataset that can be utilized downstream by either supervised algorithms such as neural networks or unsupervised methods such as clustering algorithms.
4 Train & Test models – Develop predictive models using either supervised or unsupervised techniques according; adjust hyperparameters until desired results are obtained; split into a training set (80%) and validation set (20%) first before running on full dataset so that model performance can be properly assessed against validation/test datasets; additional notes here about repeatability vs randomization etc… 5 Deploy Models – Deploy model onto real world scenarios/environments where appropriate .e.. an autonomous car relying on natural language inputs when driving through town; a domestic robot understanding sentences given by its user etc…

Research Ideas

Training virtual assistants with specific domain knowledge (e.g. medical, finance, etc).

Develop autonomous navigation systems that respond to verbal instructions given by a user in natural language format.

Creating dialogue agents that can answer questions based on a pre-defined set of rules pertaining to the instructions given by the user

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without...
d
Data from: Expression-based machine learning models for predicting plant...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Chitwood; Sourabh Palande (2025). Expression-based machine learning models for predicting plant tissue identity [Dataset]. http://doi.org/10.5061/dryad.4b8gthtn7
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.4b8gthtn7
Dataset updated
Aug 5, 2025
Dataset provided by
Dryad Digital Repository
Authors
Daniel Chitwood; Sourabh Palande
Description
The selection of Arabidopsis as a model organism played a pivotal role in advancing genomic science. Competing frameworks to select an agricultural- or ecological-based model species were selected against in favor of building knowledge in a species that would facilitate genome-enabled research. Here, we examine the ability of models based on Arabidopsis gene expression data to predict tissue identity in other flowering plants. Comparing different machine learning algorithms, models trained and tested on Arabidopsis data achieved near-perfect precision and recall values, whereas when tissue identity is predicted across the flowering plants using models trained on Arabidopsis data, precision values range from 0.69 to 0.74 and recall from 0.54 to 0.64. Below-ground tissue is more predictable than other tissue types, and the ability to predict tissue identity is not correlated with phylogenetic distance from Arabidopsis. K-Nearest Neighbors is the most successful algorithm and suggests that..., We analyzed gene expression data from two sources. The first (Zhang et al., 2020) contains 28,165 Arabidopsis gene expression profiles across 37,334 genes. The second (Palande et al., 2023) contains 2,671 flowering plant gene expression profiles across 6,327 orthogroups. Originally gene expression profiles were classified into 23 tissue types based on their original designations: â€œanther,â€ â€œcarpel,â€ â€œcotyledon,â€ â€œflower,â€ â€œhypocotyl,â€ â€œinflorescence,â€ â€œinternode,â€ â€œleaf,â€ â€œother,â€ â€œpetal,â€ â€œpetiole,â€ â€œpistil,â€ â€œreproductive-other,â€ â€œroot,â€ â€œroot cell,â€ â€œseed,â€ â€œseedling,â€ â€œsepal,â€ â€œshoot,â€ â€œstamen,â€ â€œstigma,â€ â€œvasculature,â€ or â€œwhole plant.â€ Due to large differences in sample size between these categories, they were aggregated into four tissue type labels: "aboveground", "below ground", "whole plant", and "other". The categories are purposefully encompassing and were chosen to facilitate accurate assignment across the broad categories of experimental data we analyzed, focusing on aboveg..., , # Expression-based machine learning models for predicting plant tissue identity

Arabidopsis Gene Expression Dataset

https://doi.org/10.5061/dryad.4b8gthtn7

The dataset contains three .parquet files:

1) gene_FPKM_200501.parquet: The original gene expression database was downloaded from theÂ Arabidopsis RNA-Seq DatabaseÂ (Zhang et al, 2020). The original dataset contains 28,165 Arabidopsis gene expression profiles across 37,334 genes. 2) gene_FPKM_transposed.parquet: Simply the transposed version of gene_FPKM_200501.parquet which is better aligned with typical machine learning datasets where samples are represented in rows. 3) gene_FPKM_transposed_UMR75.parquet: The gene expression profiles (gene_FPKM_transposed.parquet) were filtered to remove samples with a unique mapped rate below 75%. This dataset is used to train and test machine learning model...
Table_1_Tracking financing for global common goods for health: A machine...
frontiersin.figshare.com
docx
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siddharth Dixit; Wenhui Mao; Kaci Kennedy McDade; Marco Schäferhoff; Osondu Ogbuoji; Gavin Yamey (2023). Table_1_Tracking financing for global common goods for health: A machine learning approach using natural language processing techniques.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2022.1031147.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2022.1031147.s001
Dataset updated
Jun 21, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Siddharth Dixit; Wenhui Mao; Kaci Kennedy McDade; Marco Schäferhoff; Osondu Ogbuoji; Gavin Yamey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveTracking global health funding is a crucial but time consuming and labor-intensive process. This study aimed to develop a framework to automate the tracking of global health spending using natural language processing (NLP) and machine learning (ML) algorithms. We used the global common goods for health (CGH) categories developed by Schäferhoff et al. to design and evaluate ML models.MethodsWe used data curated by Schäferhoff et al., which tracked the official development assistance (ODA) disbursements to global CGH for 2013, 2015, and 2017, for training and validating the ML models. To process raw text, we implemented different NLP techniques, such as removing stop words, lemmatization, and creation of synthetic text, to balance the dataset. We used four supervised learning ML algorithms—random forest (RF), XGBOOST, support vector machine (SVM), and multinomial naïve Bayes (MNB) (see Glossary)—to train and test the pre-coded dataset, and applied the best model on dataset that hasn't been manually coded to predict the financing for CGH in 2019.ResultsAfter we trained the machine on the training dataset (n = 10,534), the weighted average F1-scores (a measure of a ML model's performance) on the testing dataset (n = 2,634) ranked 0.79–0.83 among four models, and the RF model had the best performance (F1-score = 0.83). The predicted total donor support for CGH projects by the RF model was $2.24 billion across 3 years, which was very close to the finding of $2.25 billion derived from coding and classification by humans. By applying the trained RF model on the 2019 dataset, we predicted that the total funding for global CGH was about $2.7 billion for 730 CGH projects.ConclusionWe have demonstrated that NLP and ML can be a feasible and efficient way to classify health projects into different global CGH categories, and thus track health funding for CGH routinely using data from publicly available databases.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners

Machine Learning Basics for Beginners🤖🧠

Machine Learning Basics

Explore at:

zip(492015 bytes)Available download formats

Dataset updated

Jun 22, 2023

Authors

Bhanupratap Biswas

License

ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically

Description

Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

Clear search

Close search

Google apps

Main menu

Machine Learning Basics for Beginners🤖🧠

MLRegTest: A benchmark for the machine learning of regular languages

Languages

NYC_building_energy_data

Statistics and Evaluation Data for Publication "Using Supervised Learning to...

Systematic review of validation of supervised machine learning models in...

Description of the data and file structure

Files and variables

File: Systematic_Review_Supplementary.xlsx

Variables

Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...

Data from: color classification

Datasheet1_Improved stacking ensemble learning based on feature selection to...

Data Sheet 1_Contrastive self-supervised learning for neurodegenerative...

Complete Blood Count (CBC)

Dataset Breakdown:

Overview:

Data Composition:

Annotations:

Key Features:

Applications:

Usage Notes:

Conclusion:

Clay, Viviane (2021). Dataset: Data from neural network training in the...

2023 Irrigated Lands for the Eastern Snake River Plain Aquifer: Machine...

Machine learning feature data from EHR, labels, and estimates for next...

Description of the data and file structure

Files and variables

File: Feature_data...

2004 Irrigated Lands for the Mountain Home Plateau: Machine Learning...

Handwritten Arabic Numerals (0-9) Image Dataset

MoLa RGB CovSurv

Image for the dataset in "Extraction of stratigraphic exposures on visible...

AI Research Instructions and Outputs

AI Research Instructions and Outputs

Driving Innovation in Machine Learning and AI Exploration

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Data from: Expression-based machine learning models for predicting plant...

Table_1_Tracking financing for global common goods for health: A machine...

Machine Learning Basics for Beginners🤖🧠

Machine Learning Basics