100+ datasets found
  1. Titanic Dataset - Machine Learning from Disaster

    • kaggle.com
    zip
    Updated Sep 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Titanic Dataset - Machine Learning from Disaster [Dataset]. https://www.kaggle.com/datasets/whenamancodes/titanic-dataset-machine-learning-from-disaster
    Explore at:
    zip(34877 bytes)Available download formats
    Dataset updated
    Sep 20, 2022
    Authors
    Aman Chauhan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    The data has been split into two groups:

    • training set (train.csv)
    • test set (test.csv)

    The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

    The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

    We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

    Data Dictionary:

    | Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

    Variable Notes

    pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

    age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

    sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

    parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

    More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe

  2. d

    Machine learning model that estimates total monthly and annual per capita...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-that-estimates-total-monthly-and-annual-per-capita-public-supply-wa
    Explore at:
    Dataset updated
    Oct 8, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. STAT_PS_HUC12_Tot_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply total water use from 2000-2020 STAT_PS_HUC12_GW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply groundwater use for 2000-2020 STAT_PS_HUC12_SW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply surface water use for 2000-2020 public_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the public supply water use machine learning model version_history_MLmodel.txt - a txt file describing changes in this version

  3. Fraudulent Financial Transaction Prediction

    • kaggle.com
    zip
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2025). Fraudulent Financial Transaction Prediction [Dataset]. https://www.kaggle.com/datasets/younusmohamed/fraudulent-financial-transaction-prediction
    Explore at:
    zip(41695207 bytes)Available download formats
    Dataset updated
    Feb 15, 2025
    Authors
    Younus_Mohamed
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Fraud Detection with Imbalanced Data

    Overview
    This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.

    Key Points
    - Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
    - Multiple Feature Files: Combine them by matching on id or Group.
    - Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
    - Goal: Predict which transactions in test_share.csv might be fraudulent.

    Files in this Dataset

    1. train.csv

      • Rows: 227,845 (example size)
      • Columns: 28
      • Description: Contains historical transaction data for training a fraud detection model.
      • Important: The Target column (0 = Clean, 1 = Fraud).
    2. test_share.csv

      • Rows: 56,962 (example size)
      • Columns: 27
      • Description: Test dataset, with the same structure as train.csv but without the Target column.
    3. Geo_scores.csv

      • Columns: (id, geo_score)
      • Description: Location-based geospatial scores for each transaction.
    4. Lambda_wts.csv

      • Columns: (Group, lambda_wt)
      • Description: Proprietary “lambda” weights associated with each Group.
    5. Qset_tats.csv

      • Columns: (id, qsets_normalized_tat)
      • Description: Network turn-around times (TAT) for each transaction.
    6. instance_scores.csv

      • Columns: (id, instance_scores)
      • Description: Vulnerability or risk qualification scores for each transaction.

    Suggested Usage

    1. Load all CSVs into dataframes.
    2. Merge additional files (Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group.
    3. Explore the severe class imbalance in train.csv (Target ~1% is fraud).
    4. Train any suitable classification model (Random Forest, XGBoost, etc.) on train.csv.
    5. Predict on test_share.csv or your own external data.

    Possible Tools:
    - Python: pandas, NumPy, scikit-learn
    - Imbalance Handling: SMOTE, Random Oversampler, or class weights
    - Metrics: Precision, Recall, F1-score, ROC-AUC, etc.

    Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!

    Tags

    • fraud-detection
    • classification
    • imbalanced-data
    • financial-transactions
    • machine-learning
    • python
    • beginner-friendly

    License: CC BY-NC-SA 4.0

  4. Training CNNs with Low-Rank Filters for Efficient Image Classification:...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip, csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yani Ioannou; Yani Ioannou (2020). Training CNNs with Low-Rank Filters for Efficient Image Classification: Trained Models [Dataset]. http://doi.org/10.5281/zenodo.53189
    Explore at:
    application/gzip, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yani Ioannou; Yani Ioannou
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Models from experiments referenced in the paper "Training CNNs with Low-Rank Filters for Efficient Image Classification", https://arxiv.org/abs/1511.06744

    Model names differ from those in the paper, but the csv files for each set of experiments relates the paper's name for the model and the real name of the model here:

    • cifarma.csv: Network-in-Network CIFAR10 Models
    • mitma.csv: MIT Places Models
    • googlenetma.csv: GoogLeNet ILSVRC2012 Models
    • vggma.csv: VGG-11 ILSVRC2012 Models

  5. d

    Data and scripts from: “Denoising autoencoder for reconstructing sensor...

    • search.dataone.org
    Updated Aug 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Johnsen; Xiangyu Bi; Chunwei Chou; Charuleka Varadharajan; Yuxin Wu; Jonathan Skone; Lavanya Ramakrishnan (2025). Data and scripts from: “Denoising autoencoder for reconstructing sensor observation data and predicting evapotranspiration: noisy and missing values repair and uncertainty quantification” [Dataset]. http://doi.org/10.15485/2561511
    Explore at:
    Dataset updated
    Aug 25, 2025
    Dataset provided by
    ESS-DIVE
    Authors
    Timothy Johnsen; Xiangyu Bi; Chunwei Chou; Charuleka Varadharajan; Yuxin Wu; Jonathan Skone; Lavanya Ramakrishnan
    Time period covered
    Oct 12, 2019 - Apr 11, 2023
    Area covered
    Description

    This data package includes data and scripts from the manuscript “Denoising autoencoder for reconstructing sensor observation data and predicting evapotranspiration: noisy and missing values repair and uncertainty quantification”. The study addressed common challenges faced in environmental sensing and modeling, including uncertain input data, missing sensor observations, and high-dimensional datasets with interrelated but redundant variables. Point-scaled meteorological and soil sensor observations were perturbed with noises and missing values, and denoising autoencoder (DAE) neural networks were developed to reconstruct the perturbed data and further predict evapotranspiration. This study concluded that (1) the reconstruction quality of each variable depends on its cross-correlation and alignment to the underlying data structure, (2) uncertainties from the models were overall stronger than those from the data corruption, and (3) there was a tradeoff between reducing bias and reducing variance when evaluating the uncertainty of the machine learning models. This package includes: (1) Four ipython scripts (.ipynb): “DAE_train.ipynb” trains and evaluates DAE neural networks, “DAE_predict.ipynb” makes predictions from the trained DAE models, “ET_train.ipynb” trains and evaluates ET prediction neural networks, and “ET_predict.ipynb” makes predictions from trained ET models. (2) One python file (.py): “methods.py” includes all user-defined functions and python codes used in the ipython scripts. (3) A “sub_models” folder that includes five trained DAE neural networks (in pytorch format, .pt), which could be used to ingest input data before being fed to the downstream ET models in ‘ET_train.ipynb” or ‘ET_predict.ipynb’. (4) Two data files (.csv). Daily meteorological, vegetation, and soil data is in “df_data.csv”, where “df_meta.csv” contains the location and time information of “df_data.csv”. Each row (index) in “df_meta.csv” corresponds to each row in “df_data.csv”. These data files are formatted to follow the data structure requirements and be directly used in the ipython scripts, and they have been shuffled chronologically to train machine learning models. The meteorological and soil data was collected using point sensors between 2019-2023 at (4.a) Three shrub-dominated field sites in East River, Colorado (named “ph1”, “ph2” and “sg5” in “df_meta.csv”, where “ph1” and “ph2” were located at PumpHouse Hillslopes, and “sg5” was at Snodgrass Mountain meadow) and (4.b) One outdoor, mesoscale, and herbaceous-dominated experiment in Berkeley, California (named “tb” in “df_meta.csv”, short for Smartsoils Testbed at Lawrence Berkeley National Lab). - See "df_data_dd.csv" and "df_meta_dd.csv" for variable descriptions and the Methods section for additional data processing steps. See "flmd.csv" and "README.txt" for brief file descriptions. - All ipython scripts and python files are written in and require PYTHON language software.

  6. m

    Ransomware and user samples for training and validating ML models

    • data.mendeley.com
    Updated Sep 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Berrueta (2021). Ransomware and user samples for training and validating ML models [Dataset]. http://doi.org/10.17632/yhg5wk39kf.2
    Explore at:
    Dataset updated
    Sep 17, 2021
    Authors
    Eduardo Berrueta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

    This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

    The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

    Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

    In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

    The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

    Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.

  7. Caltech-256: Pre-Processed 80/20 Train-Test Split

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
    Explore at:
    zip(1138799273 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    KUSHAGRA MATHUR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

    The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

    A clean, pre-defined 80/20 train-test split.

    Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

    A flat directory structure (train/, test/) for simplified file access.

    File Content The dataset is organized into a single top-level folder and two CSV files:

    train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

    test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

    Caltech-256_Train_Test/: The primary data folder.

    train/: This directory contains 80% of the images from all 257 categories, intended for model training.

    test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

    Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

    Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

    Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

    Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.

  8. Disease Prediction Using Machine Learning

    • kaggle.com
    zip
    Updated May 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KAUSHIL268 (2020). Disease Prediction Using Machine Learning [Dataset]. https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning/discussion
    Explore at:
    zip(30490 bytes)Available download formats
    Dataset updated
    May 15, 2020
    Authors
    KAUSHIL268
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    During the time when Machine Learning and Deep Learning are booming so much , it is very important to understand that all this knowledge is not of any use if we cant apply it to different areas and impact the humanity.

    This dataset will help you apply your existing knowledge to great use. Applying Knowledge to field of Medical Science and making the task of Physician easy is the main purpose of this dataset. This dataset has 132 parameters on which 42 different types of diseases can be predicted.

    All the best !

    Content

    Complete Dataset consists of 2 CSV files . One of them is training and other is for testing your model.

    Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and last column is the prognosis.

    These symptoms are mapped to 42 diseases you can classify these set of symptoms to.

    You are required to train your model on training data and test it on testing data

    Inspiration

    Just make your best effort to make world a better place by applying all the knowledge you have to different fields.

  9. D

    Disease Prediction Using Machine Learning

    • dataandsons.com
    csv, zip
    Updated Oct 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    test test (2022). Disease Prediction Using Machine Learning [Dataset]. https://www.dataandsons.com/categories/machine-learning/disease-prediction-using-machine-learning
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Oct 31, 2022
    Dataset provided by
    Data & Sons
    Authors
    test test
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    About this Dataset

    This dataset will help you apply your existing knowledge to great use. This dataset has 132 parameters on which 42 different types of diseases can be predicted. This dataset consists of 2 CSV files. One of them is for training and the other is for testing your model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and the last column is the prognosis. These symptoms are mapped to 42 diseases you can classify these sets of symptoms. You are required to train your model on training data and test it on testing data.

    Category

    Machine Learning

    Keywords

    medicine,disease,Healthcare,ML,Machine Learning

    Row Count

    4962

    Price

    $109.00

  10. Z

    Bio-logger Ethogram Benchmark: A benchmark for computational analysis of...

    • data.niaid.nih.gov
    • portalcienciaytecnologia.jcyl.es
    • +4more
    Updated Apr 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoffman, Benjamin; Cusimano, Maddie; Baglione, Vittorio; Canestrari, Daniela; Chevallier, Damien; DeSantis, Dominic L.; Jeantet, Lorène; Ladds, Monique A.; Maekawa, Takuya; Mata-Silva, Vicente; Moreno-González, Víctor; Trapote, Eva; Vainio, Outi; Vehkaoja, Antti; Yoda, Ken; Zacarian, Katherine; Friedlaender, Ari (2024). Bio-logger Ethogram Benchmark: A benchmark for computational analysis of animal behavior, using animal-borne tags [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7807280
    Explore at:
    Dataset updated
    Apr 19, 2024
    Dataset provided by
    Universidad de León
    University of California, Santa Cruz
    Earth Species Project
    Centre national de la recherche scientifique Borea
    University of Texas, El Paso
    Tampere University
    Georgia College & State University
    University of Helsinki
    Department of Conservation, New Zealand
    Nagoya University
    Osaka University
    African Institute for Mathematical Sciences, Stellenbosch University
    Authors
    Hoffman, Benjamin; Cusimano, Maddie; Baglione, Vittorio; Canestrari, Daniela; Chevallier, Damien; DeSantis, Dominic L.; Jeantet, Lorène; Ladds, Monique A.; Maekawa, Takuya; Mata-Silva, Vicente; Moreno-González, Víctor; Trapote, Eva; Vainio, Outi; Vehkaoja, Antti; Yoda, Ken; Zacarian, Katherine; Friedlaender, Ari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the datasets and experiment results presented in our arxiv paper:

    B. Hoffman, M. Cusimano, V. Baglione, D. Canestrari, D. Chevallier, D. DeSantis, L. Jeantet, M. Ladds, T. Maekawa, V. Mata-Silva, V. Moreno-González, A. Pagano, E. Trapote, O. Vainio, A. Vehkaoja, K. Yoda, K. Zacarian, A. Friedlaender, "A benchmark for computational analysis of animal behavior, using animal-borne tags," 2023.

    Standardized code to implement, train, and evaluate models can be found at https://github.com/earthspecies/BEBE/.

    Please note the licenses in each dataset folder.

    Zip folders beginning with "formatted": These are the datasets we used to run the experiments reported in the benchmark paper.

    Zip folders beginning with "raw": These are the unprocessed datasets used in BEBE. Code to process these raw datasets into the formatted ones used by BEBE can be found at https://github.com/earthspecies/BEBE-datasets/.

    Zip folders beginning with "experiments": Results of the cross-validation experiments reported in the paper, as well as hyperparameter optimization. Confusion matrices for all experiments can also be found here. Note that dt, rf, and svm refer to the feature set from Nathan et al., 2012.

    Results used in Fig. 4 of arxiv paper (deep neural networks vs. classical models){dataset}_ harnet_nogyr{dataset}_CRNN{dataset}_CNN{dataset}_dt{dataset}_rf{dataset}_svm{dataset}_wavelet_dt{dataset}_wavelet_rf{dataset}_wavelet_svm

    Results used in Fig. 5D of arxiv paper (full data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_nogyr{dataset}_harnet_random_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_RNN_nogyr{dataset}_CRNN_nogyr{dataset}_rf_nogyrOtherwise:{dataset}_harnet_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_harnet_random_nogyr{dataset}_RNN_nogyr{dataset}_CRNN{dataset}_rf

    Results used in Fig. 5E of arxiv paper (reduced data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data_nogyr{dataset}_rf_low_data_nogyr

    Otherwise:{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data{dataset}_rf_low_data

    CSV files: we also include summaries of the experimental results in experiments_summary.csv, experiments_by_fold_individual.csv, experiments_by_fold_behavior.csv.

    experiments_summary.csv - results averaged over individuals and behavior classesdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperf1_mean (float): mean of macro-averaged F1 score, averaged over individuals in test foldsf1_std (float): standard deviation of macro-averaged F1 score, computed over individuals in test foldsprec_mean, prec_std (float): analogous for precisionrec_mean, rec_std (float): analogous for recallexperiments_by_fold_individual.csv - results per individual in the test foldsdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexindividual (int): individuals are numbered zero-indexed, starting from fold 1f1 (float): macro-averaged f1 score for this individualprecision (float): macro-averaged precision for this individualrecall (float): macro-averaged recall for this individual

    experiments_by_fold_behavior.csv - results per behavior class, for each test folddataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexbehavior_class (str): name of behavior classf1 (float): f1 score for this behavior, averaged over individuals in the test foldprecision (float): precision for this behavior, averaged over individuals in the test foldrecall (float): recall for this behavior, averaged over individuals in the test foldtrain_ground_truth_label_counts (int): number of timepoints labeled with this behavior class, in the training set

  11. Prediction of Personality Traits using the Big 5 Framework

    • zenodo.org
    csv, text/x-python
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
    Explore at:
    text/x-python, csvAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Neelima Brahmbhatt; Neelima Brahmbhatt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

    1. Acquire Personality Dataset

    The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

    2. Data preprocessing

    After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

    3. Feature Extraction

    The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

            EXT1 I am the life of the party.
            EXT2  I don't talk a lot.
            EXT3  I feel comfortable around people.
            EXT4  I am quiet around strangers.
            EST1  I get stressed out easily.
            EST2  I get irritated easily.
            EST3  I worry about things.
            EST4  I change my mood a lot.
            AGR1  I have a soft heart.
            AGR2  I am interested in people.
            AGR3  I insult people.
            AGR4  I am not really interested in others.
            CSN1  I am always prepared.
            CSN2  I leave my belongings around.
            CSN3  I follow a schedule.
            CSN4  I make a mess of things.
            OPN1  I have a rich vocabulary.
            OPN2  I have difficulty understanding abstract ideas.
            OPN3  I do not have a good imagination.
            OPN4  I use difficult words.

    4. Training the Model

    Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

    5. Personality Prediction Output

    After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.

  12. Z

    DFT Calculated xyz and log Files as well as csv Files for Machine Learning...

    • data.niaid.nih.gov
    Updated May 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang, Tianbai; Geitner, Robert; Croy, Alexander; Gräfe, Stefanie (2024). DFT Calculated xyz and log Files as well as csv Files for Machine Learning in Support of "Tailoring Phosphine Ligands for Improved C H Activation: Insights from Δ-Machine Learning" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10529636
    Explore at:
    Dataset updated
    May 3, 2024
    Dataset provided by
    Friedrich Schiller University Jena
    Friedrich-Schiller-Universität Jena
    Technische Universität Ilmenau
    Authors
    Huang, Tianbai; Geitner, Robert; Croy, Alexander; Gräfe, Stefanie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Transition metal complexes have played crucial roles in various homogeneous catalytic processes due to their exceptional versatility. This adaptability stems not only from the central metal ions but also from the vast array of choices of the ligand spheres, which form an enormously large chemical space. For example, Rh complexes, with a well-designed ligand sphere, are known to be efficient in catalyzing the C-H activation process in alkanes. To investigate the structure-property relation of the Rh complex and identify the optimal ligand that minimizes the calculated reaction energy ΔE of an alkane C-H activation, we have applied a Δ-Machine Learning method trained on various features to study 1,743 pairs of reactants (Rh(PLP)(Cl)(CO)) and intermediates (Rh(PLP)(Cl)(CO)(H)(propyl)). Our findings demonstrate that the models exhibit robust predictive performance when trained on features derived from electron density (R2 = 0.816), and SOAPs (R2 = 0.819), a set of position-based descriptors. Leveraging the model trained on xTB-SOAPs that only depend on the xTB-equilibrium structures, we propose an efficient and accurate screening procedure to explore the extensive chemical space of bisphosphine ligands. By applying this screening procedure, we identify ten newly selected reactant-intermediate pairs with an average ΔE of 33.2 kJ mol-1, remarkably lower than the average ΔE of the original data set of 68.0 kJ mol-1. This underscores the efficacy of our screening procedure in pinpointing structures with significantly lower energy levels.

    The dataset contains three file types:

    Version 1.0:

    xyz files of the final optimized Rh-phosphine complexes; one set for the starting materials denoted as "molecule-XXXX_4-times" and one set for the intermediates after C-H activation denoted as "molecule-XXXX_6-times"

    Gaussian16 log files for the optimization process; one set for the starting materials denoted as "molecule-XXXX_4-times" and one set for the intermediates after C-H activation denoted as "molecule-XXXX_6-times"

    csv files containing the per molecule features used for training the different machine learning models. The name of the csv files indicates which property was predicted and which model was used

    New in version 1.1 (other data is unchanged):

    Gaussian16 log files for the ten newly identified bisphosphine ligands; one set for the product material denoted as "LXX_6-times-axial" and one set for the transition state for the C-H activation denoted as "LXX_C-H-activation_TS"

  13. Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    ## Root directory

    - `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

    - `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)

    - `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    ## Dataset

    - `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed

    - `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library

    - `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model

    - `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project

    - `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    ## RQ1

    - `RQ1/RQ1_dataset-list.txt`: list of HF datasets

    - `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets

    - `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script

    - `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

    - `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`

    - `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

    ## RQ2

    - `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task

    - `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling

    - `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias

    - `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories

    - `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    ## RQ3

    - `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses

    - `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness

    - `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name

    - `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

    - `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)

    - `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

    ## scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  14. Table1_Improving prediction of tacrolimus concentration using a combination...

    • frontiersin.figshare.com
    application/csv
    Updated May 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu-Ping Wang; Xiao-Ling Lu; Kun Shao; Hao-Qiang Shi; Pei-Jun Zhou; Bing Chen (2024). Table1_Improving prediction of tacrolimus concentration using a combination of population pharmacokinetic modeling and machine learning in chinese renal transplant recipients.csv [Dataset]. http://doi.org/10.3389/fphar.2024.1389271.s002
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    May 9, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Yu-Ping Wang; Xiao-Ling Lu; Kun Shao; Hao-Qiang Shi; Pei-Jun Zhou; Bing Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AimsThe population pharmacokinetic (PPK) model-based machine learning (ML) approach offers a novel perspective on individual concentration prediction. This study aimed to establish a PPK-based ML model for predicting tacrolimus (TAC) concentrations in Chinese renal transplant recipients.MethodsConventional TAC monitoring data from 127 Chinese renal transplant patients were divided into training (80%) and testing (20%) datasets. A PPK model was developed using the training group data. ML models were then established based on individual pharmacokinetic data derived from the PPK basic model. The prediction performances of the PPK-based ML model and Bayesian forecasting approach were compared using data from the test group.ResultsThe final PPK model, incorporating hematocrit and CYP3A5 genotypes as covariates, was successfully established. Individual predictions of TAC using the PPK basic model, postoperative date, CYP3A5 genotype, and hematocrit showed improved rankings in ML model construction. XGBoost, based on the TAC PPK, exhibited the best prediction performance.ConclusionThe PPK-based machine learning approach emerges as a superior option for predicting TAC concentrations in Chinese renal transplant recipients.

  15. Z

    Dataset to Train Intrusion Detection Systems based on Machine Learning...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gutierrez Mlot, Esteban Damian; Saldana, Jose; Rodríguez, Ricardo J.; Kotsiuba, Igor; Hernández Gañán, Carlos (2025). Dataset to Train Intrusion Detection Systems based on Machine Learning Models for Electrical Substations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_13898982
    Explore at:
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Durham University
    CIRCE - Centro Tecnológico
    Delft University of Technology
    CIRCE Centro Tecnológico
    Universidad de Zaragoza
    Authors
    Gutierrez Mlot, Esteban Damian; Saldana, Jose; Rodríguez, Ricardo J.; Kotsiuba, Igor; Hernández Gañán, Carlos
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    DATASET

    This dataset is part of the research work titled "A dataset to train intrusion detection systems based on machine learning models for electrical substations". The dataset has been meticulously curated to support the development and evaluation of machine learning models tailored for detecting cyber intrusions in the context of electrical substations. It is intended to facilitate research and advancements in cybersecurity for critical infrastructure, specifically focusing on real-world scenarios within electrical substation environments. We encourage its use for experimentation and benchmarking in related areas of study.

    The following sections list the content of the dataset generated.

    Data

    raw

    iec6180

    attack-free-data

    capture61850-attackfree.pcap (from real substation)

    capture61850-attackfree_PTP.pcap

    capture61850-attackfree_normalfault.pcap

    attack-data

    capture61850-floodattack_withfault.pcap

    capture61850-floodattack_withoutfault.pcap

    capture61850-fuzzyattack_withfault.pcap

    capture61850-fuzzyattack_withoutfault.pcap

    capture61850-replay.pcap

    capture61850-ptpattack.pcap

    iec104

    attack-free-data

    capture104-attackfree.pcap (from real substation)

    attack-data

    capture104-dosattack.pcap

    capture104-floodattack.pcap

    capture104-fuzzyattack.pcap

    capture104-iec104starvationattack.pcap

    capture104-mitmattack.pcap

    capture104-ntpddosattack.pcap

    capture104-portscanattack.pcap

    processed

    iec6180

    attack-free-data

    capture61850-attackfree.csv

    capture61850-attackfree_PTP.csv

    capture61850-attackfree_normalfault.csv

    attack-data

    capture61850-floodattack_withfault.csv

    capture61850-floodattack_withoutfault.csv

    capture61850-fuzzyattack_withfault.csv

    capture61850-fuzzyattack_withoutfault.csv

    capture61850-replay.csv

    capture61850-ptpattack.csv

    headers_iec61850[all].txt

    iec104

    attack-free-data

    capture104-attackfree.csv

    attack-data

    capture104-dosattack.csv

    capture104-floodattack.csv

    capture104-fuzzyattack.csv

    capture104-iec104starvationattack.csv

    capture104-mitmattack.csv

    capture104-ntpddosattack.csv

    capture104-portscanattack.csv

    headers_iec104[all].txt

    Description

    file type: it may be captured61850 or captured104 depending on whether it contains network captures of the protocol IEC61850 or IEC104.

    attack: attack free (attackfree) or attack name is added to the file name.

    function: optionally, if there are some details about functionality captured (normalfault) or specific protocol capture (PTP).

    file extension: the type can be PCAP (network capture) or CSV (flow file).

    Results

    results

    test1-iec104

    model-test1-iec104.pkl

    test1-iec104.log

    test1-iec61850

    model-test1-iec61850.pkl

    test1-iec61850.log

    test2-iec61850

    model-test2-iec61850.pkl

    test2-iec61850.log

    Description

    The outcomes of different test executions are available as follows:

    test1-iec104: IEC 104 protocol for all attacks and attack free scenario

    test1-iec61850: IEC 61850 protocol for fuzzy attack with fault injection and attack free scenario

    test2-iec61850: IEC 61850 protocol for fuzzy attack normal operation and attack free scenario

    Each test consists of the model results in Python pickle format (with a .pkl extension) and a detailed description of the execution conditions in an output log file (with a .log extension).

    Source Code

    Tools to process network captures from IEC61850 and IEC104 can be found at github repository.

  16. Yelp Reviews Sentiment Dataset

    • kaggle.com
    zip
    Updated Nov 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Yelp Reviews Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/yelp-reviews-sentiment-dataset/code
    Explore at:
    zip(169587518 bytes)Available download formats
    Dataset updated
    Nov 25, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Yelp Reviews Sentiment Dataset

    A Challenge for Natural Language Processing

    By Huggingface Hub [source]

    About this dataset

    The Yelp Reviews Polarity dataset is a collection of Yelp reviews that have been labeled as positive or negative. This dataset is perfect for natural language processing tasks such as sentiment analysis

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This YELP reviews dataset is a great natural language processing dataset for anyone looking to get started with text classification. The data is split into two files: train.csv and test.csv. The training set contains 7,000 reviews with labels (0 = negative, 1 = positive), and the test set contains 3,000 unlabeled reviews.

    To get started with this dataset, download the two CSV files and put them in the same directory. Then, open up train.csv in your favorite text editor or spreadsheet software (I like using Microsoft Excel). Next, take a look at the first few rows of data to get a feel for what you're working with:

    textlabel
    So there is no way for me to plug it in here in the US unless I go by...0

    Research Ideas

    • This dataset could be used to train a machine learning model to classify Yelp reviews as positive or negative.
    • This dataset could be used to train a machine learning model to predict the star rating of a Yelp review based on the text of the review.
    • This dataset could be used to build a natural language processing system that generates fake Yelp reviews

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |

    File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  17. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  18. Store Sales - T.S Forecasting...Merged Dataset

    • kaggle.com
    zip
    Updated Dec 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shramana Bhattacharya (2021). Store Sales - T.S Forecasting...Merged Dataset [Dataset]. https://www.kaggle.com/shramanabhattacharya/store-sales-ts-forecastingmerged-dataset
    Explore at:
    zip(2847585 bytes)Available download formats
    Dataset updated
    Dec 15, 2021
    Authors
    Shramana Bhattacharya
    Description

    This dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.

    ##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)

    Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)

    **Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*

    submission.csv - A sample submission file in the correct format.

  19. m

    Datasets used to train and test prediction model to predict scores in terms...

    • data.mendeley.com
    Updated Mar 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarosław Wątróbski (2025). Datasets used to train and test prediction model to predict scores in terms of SDG 7 realization [Dataset]. http://doi.org/10.17632/6c8fm7s4y2.1
    Explore at:
    Dataset updated
    Mar 5, 2025
    Authors
    Jarosław Wątróbski
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.

  20. o

    Models, data, and scripts associated with “Prediction of Distributed River...

    • osti.gov
    • dataone.org
    • +1more
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. DOE > Office of Science > Biological and Environmental Research (BER) (2024). Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning” [Dataset]. http://doi.org/10.15485/2318723
    Explore at:
    Dataset updated
    Feb 22, 2024
    Dataset provided by
    U.S. DOE > Office of Science > Biological and Environmental Research (BER)
    Environmental System Science Data Infrastructure for a Virtual Ecosystem
    Description

    This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs. A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality.Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aman Chauhan (2022). Titanic Dataset - Machine Learning from Disaster [Dataset]. https://www.kaggle.com/datasets/whenamancodes/titanic-dataset-machine-learning-from-disaster
Organization logo

Titanic Dataset - Machine Learning from Disaster

Predict survival on the Titanic and get familiar with ML basics

Explore at:
zip(34877 bytes)Available download formats
Dataset updated
Sep 20, 2022
Authors
Aman Chauhan
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Overview

The data has been split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary:

| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe

Search
Clear search
Close search
Google apps
Main menu