61 datasets found
  1. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  2. d

    Overview Metadata of Water-Quality Field Blank Data, Replicate Sample Data,...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Overview Metadata of Water-Quality Field Blank Data, Replicate Sample Data, Discharge Data, and Dissolved Solids Data [Dataset]. https://catalog.data.gov/dataset/overview-metadata-of-water-quality-field-blank-data-replicate-sample-data-discharge-data-a
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Water quality replicate sample data and field blank data was collected at the Colorado River above Imperial Dam, Colorado River below Cooper Wasteway, Yuma Main Drain, and 242 Lateral during 2017 and 2018. Instantaneous discharge data was collected at the Cooper Wasteway, Yuma Main Drain, and 242 Lateral from January 2017 to March 2019. Instantaneous discharge readings were recorded at a fixed interval of 5 minutes. Mean daily discharge data was collected at the Colorado River above Imperial Dam, Cooper Wasteway, Yuma Main Drain, and 242 Lateral from January 2017 to March 2019. Instantaneous discharge and mean daily discharge data was provided to the USGS by the International Boundary and Water Commission (IBWC). Discrete water-quality samples were collected at the Colorado River above Imperial Dam, Colorado River below Cooper Wasteway, Yuma Main Drain, and 242 Lateral during 2017, 2018, through March 2019 and values were used to compute dissolved solids concentrations using BOR's method.

  3. Supplementary material to the journal article: An active learning framework...

    • figshare.com
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John van Osta (2023). Supplementary material to the journal article: An active learning framework and assessment of inter-annotator agreement facilitate automated recogniser development for vocalisations of a rare species, the southern black-throated finch (Poephila cincta cincta) [Dataset]. http://doi.org/10.6084/m9.figshare.23053382.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    John van Osta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Primary author details

    John van Osta ORCID: 0000-0001-6196-1241 Institution: Griffith University and E2M Pty Ltd Queensland, Australia Email: john.vanosta@griffithuni.edu.au

    Researchers and practictioners applying and adapting the data and code provided here are encouraged to contact the primary author should they require further information.

    Sharing/Access information Licence CC BY 4.0. You are free to share and adapt the material provided provided appropriate attribution is given to the authors.

    Data and File Overview

    This repository provides the code base, supplementary results and example audio data to reproduce the findings of the research article: 'An active learning framework and assessment of inter-annotator agreement facilitate automated recogniser development for vocalisations of a rare species, the southern black-throated finch (Poephila cincta cincta)', published in the Journal of Ecological Informatics. Data included within this repository are listed below.

    Code base The code base includes: - train_resnet.ipynb: Trains a resnet34 model on target and non-target audio segments (each 1.8 seconds in duration). Outputs a trained model (as a pth file). - predict.ipynb: Applies the trained model to unlabelled data. - BTF_detector_v1.5 is the latest version of the model, termed the 'final model' in the the research article. - audio_file_extract.ipynb to extract audio frames in accordance with the active learning function. For the purpose of manual review and inclusion in the next iteration of model training. - stratified_subsample.ipynb: Used to subsample predictions on unlabelled data that are stratified across the model prediction confidence scores (aka logit). - macro_averaged_error.ipynb: Calculate and plot macro averaged error of the model predictions against annotator labels. - inter_annotator_agreement.ipynb: Calculate and plot Krippendorff's alpha (a measure of inter-annotator agreeement) among the model's active learning iterations and human annotators. - requirements.txt: Python package requirements to run the code base.

    Note: The code base has been written in Jupyter Notebooks and tested in Python version 3.6.9

    Supplementary files The file Stratified_subsample_inter_annotator_agreement.xlsx contains predictions from each model iteration and annotator labels for each of the 12,278 audio frames included in the model evaluation process, as described in the research article.

    Example audio data Example audio data provided include: - Target audio files (containing black-throated finch (BTF) calls) and non-target audio files (containing other environmental noises). These are split into Training and Validation sets. To follow an active learning process, each active learning 'iteration' gets added to a new folder (i.e. IT_1, IT_2, etc..). - Field recordings (10 minutes each), the majority of which contain BTF calls. These audio data were collected from a field site within the Desert Uplands Bioregion of Queensland, Austrlaia, as described and mapped in the research article. Audio data were collected using two devices: Audiomoths and Bioacoustic records (Frontier Labs), which have been separated into separate folders in the 'Field_recordings'.

    Steps to reproduce

    General recommendations The code base has been written in Jupyter Notebooks and tested in Python version 3.6.9. 1. Download the .zip file and extract to a folder on your machine. 2. Open a code editor that is suitable for working with Jupyter Notebook files. We recommend Microsoft's free software: Visual Studio Code (https://code.visualstudio.com/). If using Visual Studio Code, ensure the 'Python' and 'Jupyter' extensions are installed (https://code.visualstudio.com/docs/datascience/jupyter-notebooks). 3. Within the code editor open the downloaded file. 4. Setup the python environment by installing the package requirements identified within the requirements.txt file contained within the repository. The steps to setup a python environment in Visual Studio Code are described here: https://code.visualstudio.com/docs/python/environments, or more generally for python described here: https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/. This will download the necessary python packages to support the below code.

    Note: We recommend running the following steps on a Windows computer with an Nvidia graphics processing unit (GPU). The code has also been tested on a Windows computer with an Intel computer processing unit (CPU), with a substantially slower runtime. Edits to the code may be required to run on a Macintosh computer or a non-Nvidia GPU; however, the core functionality will remain the same.

    Active learning iterations to develop the final model: 1. Run train_resnet.ipynb to train a model from the initial target (BTF) and non-target (other environmental sounds) audio provided. The default name for the output model will be 'model.pth', however this may be adjusted manually by changing the 'MODEL_NAME' variable. The script also provides performance metrics and a confusion matrix against the validation dataset. 2. Run predict.ipynb to make predictions on unlabelled data. The default code uses the final model (BTF_trained_model_v1.5.pth), as described in the research article, however this may be adjusted to link to the model created in step 4 (by changing the 'model_path' variable). Results of this step are saved in the Sample_files\Predict_results folder. 3. Run audio_file_extract.ipynb to extract 1.8 second audio snips that have a 'BTF' confidence score of >= 0.5. these are the sounds that range from most uncertain to the model to most likely to be BTF. The logic for this cutoff is discussed in the research article's methods section. The default extraction location is 'Sample_files\Predict_results\Audio_frames_for_review'. 4. Manually review extracted audio frames and move them to the appropriate folder of the training data. E.g. for audio frames that are reviewed to contain: - BTF calls, move them to the filepath 'Sample_files\Training_clips\Train\BTF\IT_2' - Not BTF calls, move them to the filepath 'Sample_files\Training_clips\Train\Not BTF\IT_2' IT_2 represents the second active learning iteration. Ensure 30% of the files are allocated to the validation set ('Sample_files\Training_clips\Val). Note that users will need to create subfolders for each successive iteration. 5. Repeat steps 1 to 4, making sure to update the 'iterations' variable in the train_resnet.ipynb code to include all active learning iterations undertaken. For example, to include iterations 1 and 2 in the model, set the variable 'iterations'to equal ['IT_1', 'IT_2']. An example is provided in the train_resnet.ipynb code. 6. Stop the active learning process when the stopping criterion is reached (e.g. when the F1 score plateaus).

    Model evaluation steps 1. Run predict.ipynb using the final model on an unlabelled test dataset. By default the unlabelled audio data that will be used is saved at the filepath: example data saved in the filepath 'Sample_files\Field_recordings\Audiomoth'. However, this should be changed to data not used to train the model, such as 'Sample_files\Field_recordings\BAR', or your own audio data. 2. Run stratified_subsample.ipynb to subsample the predictions that the final model made on the unlabelled data. A stratified subsample approach is used, whereby samples are stratified across confidence scores, which is described in the research article. The default output file is 'stratified_subsample_predictions.csv' 3. We then manually reviewed the subsamples, including a cross review by experts on the species, as detailed in the research article. We have provide the results of our model evaluation: 'Study_results\Stratified_subsample_inter_annotator_agreement.xlsx' 4. Run macro_averaged_error.ipynb and inter_annotator_agreement.ipynb to reproduce the results and plots contained within the paper.

    Using the model on your own data The predict.ipynb code may be adapted to run the BTF call detection model on data outside of this repository.

    Notes for running on your own data: - Accepts wav or flac files - Accepts files from Audiomoth devices, using the file naming format: 'AM###_YYYMMDD_HHMMSS' - Accepts files from Bioacoustic Recorder Devices (Frontier Labs), using the file naming format: 'BAR##_YYYMMDDTHHMMSS+TZ_REC'

  4. ZEW Data Purchasing Challenge 2022

    • kaggle.com
    zip
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manish Tripathi (2022). ZEW Data Purchasing Challenge 2022 [Dataset]. https://www.kaggle.com/datasets/manishtripathi86/zew-data-purchasing-challenge-2022
    Explore at:
    zip(1162256319 bytes)Available download formats
    Dataset updated
    Feb 8, 2022
    Authors
    Manish Tripathi
    Description

    Dataset Source: https://www.aicrowd.com/challenges/data-purchasing-challenge-2022

    🕵️ Introduction Data for machine learning tasks usually does not come for free but has to be purchased. The costs and benefits of data have to be weighed against each other. This is challenging. First, data usually has combinatorial value. For instance, different observations might complement or substitute each other for a given machine learning task. In such cases, the decision to purchase one group of observations has to be made conditional on the decision to purchase another group of observations. If these relationships are high-dimensional, finding the optimal bundle becomes computationally hard. Second, data comes at different quality, for instance, with different levels of noise. Third, data has to be acquired under the assumption of being valuable out-of-sample. Distribution shifts have to be anticipated.

    In this competition, you face these data purchasing challenges in the context of an multi-label image classification task in a quality control setting.

    📑 Problem Statement

    In short: You have to classify images. Some images in your training set are labelled but most of them aren't. How do you decide which images to label if you have a limited budget to do so?

    In more detail: You face a multi-label image classification task. The dataset consists of synthetically generated images of painted metal sheets. A classifier is meant to predict whether the sheets have production damages and if so which ones. You have access to a set of images, a subset of which are labelled with respect to production damages. Because labeling is costly and your budget is limited, you have to decide for which of the unlabelled images labels should be purchased in order to maximize prediction accuracy.

    Each of the images have a 4 dimensional label representing the presence or the absence of ['scratch_small', 'scratch_large', 'dent_small', 'dent_large'] in the images.

    You are required to submit code, which can be run in three different phases:

    Pre-Training Phase

    In the Pre-Training Phase, your code will have access to 5,000 labelled images on a multi-label image classification task with 4 classes. It is up to you, how you wish to use this data. For instance, you might want to pre-train a classification model. Purchase Phase

    In the Purchase Phase, your code, after going through the Pre-Training Phase will have access to an unlabelled dataset of 10,000 images. You will have a budget of 3,000 label purchases, that you can freely use across any of the images in the unlabelled dataset to obtain their labels. You are tasked with designing your own approach on how to select the optimal subset of 3,000 images in the unlabelled dataset, which would help you optimize your model's performance on the prediction task. You can then continue training your model (which has been pre-trained in the pre-training phase) using the newly purchased labels. Prediction Phase

    In the Prediction Phase, your code will have access to a test set of 3,000 unlabelled images, for which you have to generate and submit predictions. Your submission will be evaluated based on the performance of your predictions on this test set. Your code will have access to a node with 4 CPUS, 16 GB RAM, 1 NVIDIA T4 GPU and 3 hours of runtime per submission. In the final round of this challenge, your code will be evaluated across multiple budget-runtime constraints.

    💾 Dataset

    The datasets for this challenge can be accessed in the Resources Section.

    training.tar.gz: The training set containing 5,000 images with their associated labels. During your local experiments you are allowed to use the data as you please. unlabelled.tar.gz: The unlabelled set containing 10,000 images, and their associated labels. During your local experiments you are only allowed to access the labels through the provided purchase_label function. validation.tar.gz: The validation set containing 3,000 images, and their associated labels. During your local experiments you are only allowed to use the labels of the validation set to measure the performance of your models and experiments. debug.tar.gz.: A small set of 100 images with their associated labels, that you can use for integration testing, and for trying out the provided starter kit. NOTE While you run your local experiments on this dataset, your submissions will be evaluated on a dataset which might be sampled from a different distribution, and is not the same as this publicly released version.

    👥 Participation

    🖊 Evaluation Criteria The challenge will use the Accuracy Score, Hamming Loss and the Exact Match Ratio during evaluation. The primary score will be the Accuracy Score.

    📅 Timeline This challenge has two Rounds.

    Round 1 : Feb 4th – Feb 28th, 2022

    The first round submissions will be evaluated based on one budget-compute constraint pair (max. of 3,00...

  5. d

    Unlabelled training datasets of AIS Trajectories from Danish Waters for...

    • data.dtu.dk
    bin
    Updated Jul 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristoffer Vinther Olesen; Line Katrine Harder Clemmensen; Anders Nymark Christensen (2023). Unlabelled training datasets of AIS Trajectories from Danish Waters for Abnormal Behavior Detection [Dataset]. http://doi.org/10.11583/DTU.21511842.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 10, 2023
    Dataset provided by
    Technical University of Denmark
    Authors
    Kristoffer Vinther Olesen; Line Katrine Harder Clemmensen; Anders Nymark Christensen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This item is part of the collection "AIS Trajectories from Danish Waters for Abnormal Behavior Detection"

    DOI: https://doi.org/10.11583/DTU.c.6287841

    Using Deep Learning for detection of maritime abnormal behaviour in spatio temporal trajectories is a relatively new and promising application. Open access to the Automatic Identification System (AIS) has made large amounts of maritime trajectories publically avaliable. However, these trajectories are unannotated when it comes to the detection of abnormal behaviour.

    The lack of annotated datasets for abnormality detection on maritime trajectories makes it difficult to evaluate and compare suggested models quantitavely. With this dataset, we attempt to provide a way for researchers to evaluate and compare performance.

    We have manually labelled trajectories which showcase abnormal behaviour following an collision accident. The annotated dataset consists of 521 data points with 25 abnormal trajectories. The abnormal trajectories cover amoung other; Colliding vessels, vessels engaged in Search-and-Rescue activities, law enforcement, and commercial maritime traffic forced to deviate from the normal course

    These datasets consists of unlabelled trajectories for the purpose of training unsupervised models. For labelled datasets for evaluation please refer to the collection. Link in Related publications.

    The data is saved using the pickle format for Python Each dataset is split into 2 files with naming convention:

    datasetInfo_XXX
    data_XXX

    Files named "data_XXX" contains the extracted trajectories serialized sequentially one at a time and must be read as such. Please refer to provided utility functions for examples. Files named "datasetInfo" contains Metadata related to the dataset and indecies at which trajectories begin in "data_XXX" files.

    The data are sequences of maritime trajectories defined by their; timestamp, latitude/longitude position, speed, course, and unique ship identifer MMSI. In addition, the dataset contains metadata related to creation parameters. The dataset has been limited to a specific time period, ship types, moving AIS navigational statuses, and filtered within an region of interest (ROI). Trajectories were split if exceeding an upper limit and short trajectories were discarded. All values are given as metadata in the dataset and used in the naming syntax.

    Naming syntax: data_AIS_Custom_STARTDATE_ENDDATE_SHIPTYPES_MINLENGTH_MAXLENGTH_RESAMPLEPERIOD.pkl

    See datasheet for more detailed information and we refer to provided utility functions for examples on how to read and plot the data.

  6. r

    Data from: Unlabeled samples generated by GAN improve the person...

    • resodate.org
    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Z. Zheng; L. Zheng; Y. Yang (2024). Unlabeled samples generated by GAN improve the person re-identification baseline in vitro [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdW5sYWJlbGVkLXNhbXBsZXMtZ2VuZXJhdGVkLWJ5LWdhbi1pbXByb3ZlLXRoZS1wZXJzb24tcmUtaWRlbnRpZmljYXRpb24tYmFzZWxpbmUtaW4tdml0cm8=
    Explore at:
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Z. Zheng; L. Zheng; Y. Yang
    Description

    A dataset for unsupervised person re-identification using Generative Adversarial Networks (GANs).

  7. H

    Replication Data for: Measuring the Significance of Policy Outputs with...

    • dataverse.harvard.edu
    Updated Oct 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radoslaw Zubek; Abhishek Dasgupta; David Doyle (2020). Replication Data for: Measuring the Significance of Policy Outputs with Positive Unlabeled Learning [Dataset]. http://doi.org/10.7910/DVN/1XXDMW
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Radoslaw Zubek; Abhishek Dasgupta; David Doyle
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1XXDMWhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/1XXDMW

    Description

    Identifying important policy outputs has long been of interest to political scientists. In this work, we propose a novel approach to the classification of policies. Instead of obtaining and aggregating expert evaluations of significance for a finite set of policy outputs, we use experts to identify a small set of significant outputs and then employ positive unlabeled (PU) learning to search for other similar examples in a large unlabeled set. We further propose to automate the first step by harvesting ‘seed’ sets of significant outputs from web data. We offer an application of the new approach by classifying over 9,000 government regulations in the United Kingdom. The obtained estimates are successfully validated against human experts, by forecasting web citations, and with a construct validity test.

  8. STL-10 Image Recognition Dataset

    • kaggle.com
    zip
    Updated Jun 11, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica Li (2018). STL-10 Image Recognition Dataset [Dataset]. https://www.kaggle.com/jessicali9530/stl10
    Explore at:
    zip(2017846807 bytes)Available download formats
    Dataset updated
    Jun 11, 2018
    Authors
    Jessica Li
    Description

    Context

    STL-10 is an image recognition dataset inspired by CIFAR-10 dataset with some improvements. With a corpus of 100,000 unlabeled images and 500 training images, this dataset is best for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Unlike CIFAR-10, the dataset has a higher resolution which makes it a challenging benchmark for developing more scalable unsupervised learning methods.

    Content

    Data overview:

    • There are three files: train_image.zips, test_images.zip and unlabeled_images.zip
    • 10 classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck
    • Images are 96x96 pixels, color
    • 500 training images (10 pre-defined folds), 800 test images per class
    • 100,000 unlabeled images for unsupervised learning. These examples are extracted from a similar but broader distribution of images. For instance, it contains other types of animals (bears, rabbits, etc.) and vehicles (trains, buses, etc.) in addition to the ones in the labeled set
    • Images were acquired from labeled examples on ImageNet

    The original data source recommends the following standardized testing protocol for reporting results:

    1. Perform unsupervised training on the unlabeled data
    2. Perform supervised training on the labeled data using 10 (pre-defined) folds of 100 examples from the training data. The indices of the examples to be used for each fold are provided
    3. Report average accuracy on the full test set

    Acknowledgements

    Original data source and banner image: https://cs.stanford.edu/~acoates/stl10/

    Please cite the following reference when using this dataset:

    Adam Coates, Honglak Lee, Andrew Y. Ng An Analysis of Single Layer Networks in Unsupervised Feature Learning AISTATS, 2011.

    Inspiration

    • Can you train a model to accurately identify what animal or transportation object is in each image?
  9. f

    Execution time average over all the data sets per unlabelled sample for the...

    • plos.figshare.com
    xls
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlotte Nachtegael; Jacopo De Stefani; Tom Lenaerts (2023). Execution time average over all the data sets per unlabelled sample for the two first iteration of the AL process for each AL strategy with their standard deviation. [Dataset]. http://doi.org/10.1371/journal.pone.0292356.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 15, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Charlotte Nachtegael; Jacopo De Stefani; Tom Lenaerts
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Execution time average over all the data sets per unlabelled sample for the two first iteration of the AL process for each AL strategy with their standard deviation.

  10. f

    Data_Sheet_1_Building One-Shot Semi-Supervised (BOSS) Learning Up to Fully...

    • frontiersin.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leslie N. Smith; Adam Conovaloff (2023). Data_Sheet_1_Building One-Shot Semi-Supervised (BOSS) Learning Up to Fully Supervised Performance.pdf [Dataset]. http://doi.org/10.3389/frai.2022.880729.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Leslie N. Smith; Adam Conovaloff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reaching the performance of fully supervised learning with unlabeled data and only labeling one sample per class might be ideal for deep learning applications. We demonstrate for the first time the potential for building one-shot semi-supervised (BOSS) learning on CIFAR-10 and SVHN up to attain test accuracies that are comparable to fully supervised learning. Our method combines class prototype refining, class balancing, and self-training. A good prototype choice is essential and we propose a technique for obtaining iconic examples. In addition, we demonstrate that class balancing methods substantially improve accuracy results in semi-supervised learning to levels that allow self-training to reach the level of fully supervised learning performance. Our experiments demonstrate the value with computing and analyzing test accuracies for every class, rather than only a total test accuracy. We show that our BOSS methodology can obtain total test accuracies with CIFAR-10 images and only one labeled sample per class up to 95% (compared to 94.5% for fully supervised). Similarly, the SVHN images obtains test accuracies of 97.8%, compared to 98.27% for fully supervised. Rigorous empirical evaluations provide evidence that labeling large datasets is not necessary for training deep neural networks. Our code is available at https://github.com/lnsmith54/BOSS to facilitate replication.

  11. Brazilian Legal Proceedings

    • kaggle.com
    zip
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Maia Polo (2021). Brazilian Legal Proceedings [Dataset]. https://www.kaggle.com/felipepolo/brazilian-legal-proceedings
    Explore at:
    zip(124024147 bytes)Available download formats
    Dataset updated
    May 14, 2021
    Authors
    Felipe Maia Polo
    Description

    The Dataset

    These datasets were used while writing the following work:

    Polo, F. M., Ciochetti, I., and Bertolo, E. (2021). Predicting legal proceedings status: approaches based on sequential text data. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, pages 264–265.
    

    Please cite us if you use our datasets in your academic work:

    @inproceedings{polo2021predicting,
     title={Predicting legal proceedings status: approaches based on sequential text data},
     author={Polo, Felipe Maia and Ciochetti, Itamar and Bertolo, Emerson},
     booktitle={Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law},
     pages={264--265},
     year={2021}
    }
    

    More details below!

    Context

    Every legal proceeding in Brazil is one of three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. The three possible classes are given in a specific instant in time, which may be temporary or permanent. Moreover, they are decided by the courts to organize their workflow, which in Brazil may reach thousands of simultaneous cases per judge. Developing machine learning models to classify legal proceedings according to their status can assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency.

    In this dataset, each proceeding is made up of a sequence of short texts called “motions” written in Portuguese by the courts’ administrative staff. The motions relate to the proceedings, but not necessarily to their legal status.

    Content

    Our data is composed of two datasets: a dataset of ~3*10^6 unlabeled motions and a dataset containing 6449 legal proceedings, each with an individual and a variable number of motions, but which have been labeled by lawyers. Among the labeled data, 47.14% is classified as archived (class 1), 45.23% is classified as active (class 2), and 7.63% is classified as suspended (class 3).

    The datasets we use are representative samples from the first (São Paulo) and third (Rio de Janeiro) most significant state courts. State courts handle the most variable types of cases throughout Brazil and are responsible for 80% of the total amount of lawsuits. Therefore, these datasets are a good representation of a very significant portion of the use of language and expressions in Brazilian legal vocabulary.

    Regarding the labels dataset, the key "-1" denotes the most recent text while "-2" the second most recent and so on.

    Acknowledgements

    We would like to thank Ana Carolina Domingues Borges, Andrews Adriani Angeli, and Nathália Caroline Juarez Delgado from Tikal Tech for helping us to obtain the datasets. This work would not be possible without their efforts.

    Inspiration

    Can you develop good machine learning classifiers for text sequences? :)

  12. e

    Data from: Performance of unmarked abundance models with data from...

    • experts.esf.edu
    • data.niaid.nih.gov
    • +3more
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cameron Fiss; Samuel Lapp; Jonathan Cohen; Halie Parker; Jeffery T. Larkin; Jeffery L. Larkin; Justin Kitzes (2024). Data from: Performance of unmarked abundance models with data from machine-learning classification of passive acoustic recordings [Dataset]. https://experts.esf.edu/esploro/outputs/dataset/Data-from-Performance-of-unmarked-abundance/99944582404826
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Dryad
    Authors
    Cameron Fiss; Samuel Lapp; Jonathan Cohen; Halie Parker; Jeffery T. Larkin; Jeffery L. Larkin; Justin Kitzes
    Time period covered
    Jul 11, 2024
    Description

    The ability to conduct cost-effective wildlife monitoring at scale is rapidly increasing due to availability of inexpensive autonomous recording units (ARUs) and automated species recognition, presenting a variety of advantages over human-based surveys. However, estimating abundance with such data collection techniques remains challenging because most abundance models require data that are difficult for low-cost monoaural ARUs to gather (e.g., counts of individuals, distance to individuals), especially when using the output of automated species recognition. Statistical models that do not require counting or measuring distances to target individuals in combination with low-cost ARUs provide a promising way of obtaining abundance estimates for large-scale wildlife monitoring projects but remain untested. We present a case study using avian field data collected in forests of Pennsylvania during the Spring of 2020 and 2021 using both traditional point counts and passive acoustic monitoring at the same locations. We tested the ability of the Royle-Nichols and time-to-detection models to estimate abundance of two species from detection histories generated by applying a machine-learning classifier to ARU-gathered data. We compared abundance estimates from these models to estimates from the same models fit using point-count data and to two additional models appropriate for point counts, the N-mixture model and distance models. We found that the Royle-Nichols and time-to-detection models can be used with ARU data to produce abundance estimates similar to those generated by a point-count based study but with greater precision. ARU-based models produced confidence or credible intervals that were on average 31.9% ( 11.9 SE) smaller than their point-count counterpart. Our findings were consistent across two species with differing relative abundance and habitat use patterns. The higher precision of models fit using ARU data is likely due to higher cumulative detection probability, which itself may be the result of greater survey effort using ARUs and machine-learning classifiers to sample significantly more time for focal species at any given point. Our results provide preliminary support the use of ARUs in abundance-based study applications, and thus may afford researchers a better understanding of habitat quality and population trends, while allowing them to make more informed conservation actions and recommendations.

  13. Self-supervised retinal thickness prediction enables deep learning from...

    • zenodo.org
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olle Holmberg; Olle Holmberg; Niklas D. Köhler; Thiago Martins; Jakob Siedlecki; Tina Herold; Leonie Keidel; Ben Asani; Johannes Schiefelbein; Siegfried Priglinger; Karsten U. Kortuem; Fabian J. Theis; Niklas D. Köhler; Thiago Martins; Jakob Siedlecki; Tina Herold; Leonie Keidel; Ben Asani; Johannes Schiefelbein; Siegfried Priglinger; Karsten U. Kortuem; Fabian J. Theis (2020). Self-supervised retinal thickness prediction enables deep learning from unlabeled data to boost classification of diabetic retinopathy [Dataset]. http://doi.org/10.5281/zenodo.3625996
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Olle Holmberg; Olle Holmberg; Niklas D. Köhler; Thiago Martins; Jakob Siedlecki; Tina Herold; Leonie Keidel; Ben Asani; Johannes Schiefelbein; Siegfried Priglinger; Karsten U. Kortuem; Fabian J. Theis; Niklas D. Köhler; Thiago Martins; Jakob Siedlecki; Tina Herold; Leonie Keidel; Ben Asani; Johannes Schiefelbein; Siegfried Priglinger; Karsten U. Kortuem; Fabian J. Theis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data repository contains the OCT images and binary annotations for segmentation of retinal tissue using deep learning. To use, please refer to the Github repository https://github.com/theislab/DeepRT.

    #######

    Access to large, annotated samples represents a considerable challenge for training accurate deep-learning models in medical imaging. While current leading-edge transfer learning from pre-trained models can help with cases lacking data, it limits design choices, and generally results in the use of unnecessarily large models. We propose a novel, self-supervised training scheme for obtaining high-quality, pre-trained networks from unlabeled, cross-modal medical imaging data, which will allow for creating accurate and efficient models. We demonstrate this by accurately predicting optical coherence tomography (OCT)-based retinal thickness measurements from simple infrared (IR) fundus images. Subsequently, learned representations outperformed advanced classifiers on a separate diabetic retinopathy classification task in a scenario of scarce training data. Our cross-modal, three-staged scheme effectively replaced 26,343 diabetic retinopathy annotations with 1,009 semantic segmentations on OCT and reached the same classification accuracy using only 25% of fundus images, without any drawbacks, since OCT is not required for predictions. We expect this concept will also apply to other multimodal clinical data-imaging, health records, and genomics data, and be applicable to corresponding sample-starved learning problems.

    #######

  14. d

    Data from USGS National Water Quality Laboratory methods used to calculate...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data from USGS National Water Quality Laboratory methods used to calculate and compare detection limits estimated using single- and multi-concentration spike-based and blank-based procedures [Dataset]. https://catalog.data.gov/dataset/data-from-usgs-national-water-quality-laboratory-methods-used-to-calculate-and-compare-det
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This dataset provides the expected and determined concentrations of selected inorganic and organic analytes for spiked reagent-water samples (calibration standards and limit of quantitation standards) that were used to calculate detection limits by using the United States Environmental Protection Agency’s (USEPA) Method Detection Limit (MDL) version 1.11 or 2.0 procedures, ASTM International’s Within-Laboratory Critical Level standard procedure D7783-13, and, for five pharmaceutical compounds, by USEPA’s Lowest Concentration Minimum Reporting Level procedure. Also provided are determined concentration data for reagent-water laboratory blank samples, classified as either instrument blank or set blank samples, and reagent-water blind-blank samples submitted by the USGS Quality System Branch, that were used to calculate blank-based detection limits by using the USEPA MDL version 2.0 procedure or procedures described in National Water Quality Laboratory Technical Memorandum 2016.02, http://wwwnwql.cr.usgs.gov/tech_memos/nwql.2016-02.pdf. The determined detection limits are provided and compared in the related external publication at https://doi.org/10.1016/j.talanta.2021.122139.

  15. N

    Point Blank, TX Age Group Population Dataset: A Complete Breakdown of Point...

    • neilsberg.com
    csv, json
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Point Blank, TX Age Group Population Dataset: A Complete Breakdown of Point Blank Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/453fb3cc-f122-11ef-8c1b-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Point Blank, Texas
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Point Blank population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Point Blank. The dataset can be utilized to understand the population distribution of Point Blank by age. For example, using this dataset, we can identify the largest age group in Point Blank.

    Key observations

    The largest age group in Point Blank, TX was for the group of age 60 to 64 years years with a population of 106 (13.04%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Point Blank, TX was the 10 to 14 years years with a population of 6 (0.74%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the Point Blank is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of Point Blank total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Point Blank Population by Age. You can refer the same here

  16. f

    Metric and attribute data for blank sample (LU2 and LU3).

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Aug 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johnson, Corey L.; Gunchinsuren, Byambaa; Grote, Mark N.; Lkhundev, Guunii; Odsuren, Davaakhuu; Izuho, Masami; Bolorbat, Tsedendorj; Paine, Clea H.; Zwyns, Nicolas (2024). Metric and attribute data for blank sample (LU2 and LU3). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001454498
    Explore at:
    Dataset updated
    Aug 16, 2024
    Authors
    Johnson, Corey L.; Gunchinsuren, Byambaa; Grote, Mark N.; Lkhundev, Guunii; Odsuren, Davaakhuu; Izuho, Masami; Bolorbat, Tsedendorj; Paine, Clea H.; Zwyns, Nicolas
    Description

    Metric and attribute data for blank sample (LU2 and LU3).

  17. Amos: A large-scale abdominal multi-organ benchmark for versatile medical...

    • zenodo.org
    zip
    Updated Nov 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ji yuanfeng; ji yuanfeng (2022). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part III) [Dataset]. http://doi.org/10.5281/zenodo.7295816
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 7, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    ji yuanfeng; ji yuanfeng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf

    In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in:

    if you found this dataset useful for your research, please cite:

    @article{ji2022amos,
     title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation},
     author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others},
     journal={arXiv preprint arXiv:2206.08023},
     year={2022}
    }
  18. Data from: DeepMoney: Counterfeit Money Detection Using Generative...

    • figshare.com
    application/x-rar
    Updated Aug 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toqeer Ali; Salman Jan (2019). DeepMoney: Counterfeit Money Detection Using Generative Adversarial Networks [Dataset]. http://doi.org/10.6084/m9.figshare.9164510.v3
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Aug 8, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Toqeer Ali; Salman Jan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Conventional paper currency and modern electronic currency are two important modes of transactions. In several parts of the world, conventional methodology has clear precedence over its electronic counterpart. However, the identification of forged currency paper notes is now becoming an increasingly crucial problem because of the new and improved tactics employed by counterfeiters. In this paper, a machine assisted system – dubbed DeepMoney– is proposed which has been developed to discriminate fake notes from genuine ones. For this purpose, state-of-the-art models of machine learning called Generative Adversarial Networks (GANs) are employed. GANs use an unsupervised learning to train a model that can then be used to perform supervised predictions. This flexibility provides the best of both worlds by allowing unlabelled data to be trained on whilst still making concrete predictions. This technique was applied to Pakistani banknotes. State-of-the-art image processing and feature recognition techniques were used to design the overall approach of a valid input. Augmented samples of images were used in the experiments which show that a high-precision machine can be developed to recognize genuine paper money. An accuracy of 80% has been achieved. The code is available as an open source to allow others to reproduce and build upon the efforts already made.

  19. f

    Data from: Benchmarking Machine Learning Models for Polymer Informatics: An...

    • acs.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lei Tao; Vikas Varshney; Ying Li (2023). Benchmarking Machine Learning Models for Polymer Informatics: An Example of Glass Transition Temperature [Dataset]. http://doi.org/10.1021/acs.jcim.1c01031.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    ACS Publications
    Authors
    Lei Tao; Vikas Varshney; Ying Li
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In the field of polymer informatics, utilizing machine learning (ML) techniques to evaluate the glass transition temperature Tg and other properties of polymers has attracted extensive attention. This data-centric approach is much more efficient and practical than the laborious experimental measurements when encountered a daunting number of polymer structures. Various ML models are demonstrated to perform well for Tg prediction. Nevertheless, they are trained on different data sets, using different structure representations, and based on different feature engineering methods. Thus, the critical question arises on selecting a proper ML model to better handle the Tg prediction with generalization ability. To provide a fair comparison of different ML techniques and examine the key factors that affect the model performance, we carry out a systematic benchmark study by compiling 79 different ML models and training them on a large and diverse data set. The three major components in setting up an ML model are structure representations, feature representations, and ML algorithms. In terms of polymer structure representation, we consider the polymer monomer, repeat unit, and oligomer with longer chain structure. Based on that feature, representation is calculated, including Morgan fingerprinting with or without substructure frequency, RDKit descriptors, molecular embedding, molecular graph, etc. Afterward, the obtained feature input is trained using different ML algorithms, such as deep neural networks, convolutional neural networks, random forest, support vector machine, LASSO regression, and Gaussian process regression. We evaluate the performance of these ML models using a holdout test set and an extra unlabeled data set from high-throughput molecular dynamics simulation. The ML model’s generalization ability on an unlabeled data set is especially focused, and the model’s sensitivity to topology and the molecular weight of polymers is also taken into consideration. This benchmark study provides not only a guideline for the Tg prediction task but also a useful reference for other polymer informatics tasks.

  20. f

    DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jan 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang, Yue; Zhang, Liuchao; He, Jia; Li, Kang; Rong, Zhiwei; Xu, Zhenyi; Ji, Jianxin; Hou, Yan; Liu, Weisha (2023). DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based framework for classification and feature selection in drug research and development.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000994299
    Explore at:
    Dataset updated
    Jan 26, 2023
    Authors
    Huang, Yue; Zhang, Liuchao; He, Jia; Li, Kang; Rong, Zhiwei; Xu, Zhenyi; Ji, Jianxin; Hou, Yan; Liu, Weisha
    Description

    The prediction of response to drugs before initiating therapy based on transcriptome data is a major challenge. However, identifying effective drug response label data costs time and resources. Methods available often predict poorly and fail to identify robust biomarkers due to the curse of dimensionality: high dimensionality and low sample size. Therefore, this necessitates the development of predictive models to effectively predict the response to drugs using limited labeled data while being interpretable. In this study, we report a novel Hierarchical Graph Random Neural Networks (HiRAND) framework to predict the drug response using transcriptome data of few labeled data and additional unlabeled data. HiRAND completes the information integration of the gene graph and sample graph by graph convolutional network (GCN). The innovation of our model is leveraging data augmentation strategy to solve the dilemma of limited labeled data and using consistency regularization to optimize the prediction consistency of unlabeled data across different data augmentations. The results showed that HiRAND achieved better performance than competitive methods in various prediction scenarios, including both simulation data and multiple drug response data. We found that the prediction ability of HiRAND in the drug vorinostat showed the best results across all 62 drugs. In addition, HiRAND was interpreted to identify the key genes most important to vorinostat response, highlighting critical roles for ribosomal protein-related genes in the response to histone deacetylase inhibition. Our HiRAND could be utilized as an efficient framework for improving the drug response prediction performance using few labeled data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
Organization logo

Machine Learning Basics for Beginners🤖🧠

Machine Learning Basics

Explore at:
zip(492015 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Bhanupratap Biswas
License

ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically

Description

Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

  1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

  2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

  3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

  4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

  5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

  6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

  7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

  8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

  9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

  10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

Search
Clear search
Close search
Google apps
Main menu