100+ datasets found

Titanic Dataset - Machine Learning from Disaster
kaggle.com
zip
Updated Sep 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Chauhan (2022). Titanic Dataset - Machine Learning from Disaster [Dataset]. https://www.kaggle.com/datasets/whenamancodes/titanic-dataset-machine-learning-from-disaster
Explore at:
zip(34877 bytes)Available download formats
Dataset updated
Sep 20, 2022
Authors
Aman Chauhan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The data has been split into two groups:

training set (train.csv)

test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary:

| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
d
Machine learning model that estimates total monthly and annual per capita...
catalog.data.gov
data.usgs.gov
+2more
Updated Oct 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-that-estimates-total-monthly-and-annual-per-capita-public-supply-wa
Explore at:
Dataset updated
Oct 8, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. STAT_PS_HUC12_Tot_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply total water use from 2000-2020 STAT_PS_HUC12_GW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply groundwater use for 2000-2020 STAT_PS_HUC12_SW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply surface water use for 2000-2020 public_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the public supply water use machine learning model version_history_MLmodel.txt - a txt file describing changes in this version
Fraudulent Financial Transaction Prediction
kaggle.com
zip
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younus_Mohamed (2025). Fraudulent Financial Transaction Prediction [Dataset]. https://www.kaggle.com/datasets/younusmohamed/fraudulent-financial-transaction-prediction
Explore at:
zip(41695207 bytes)Available download formats
Dataset updated
Feb 15, 2025
Authors
Younus_Mohamed
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Fraud Detection with Imbalanced Data

Overview
This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.

Key Points
- Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
- Multiple Feature Files: Combine them by matching on id or Group.
- Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
- Goal: Predict which transactions in test_share.csv might be fraudulent.

Files in this Dataset

train.csv

Rows: 227,845 (example size)

Columns: 28

Description: Contains historical transaction data for training a fraud detection model.

Important: The Target column (0 = Clean, 1 = Fraud).

test_share.csv

Rows: 56,962 (example size)

Columns: 27

Description: Test dataset, with the same structure as train.csv but without the Target column.

Geo_scores.csv

Columns: (id, geo_score)

Description: Location-based geospatial scores for each transaction.

Lambda_wts.csv

Columns: (Group, lambda_wt)

Description: Proprietary “lambda” weights associated with each Group.

Qset_tats.csv

Columns: (id, qsets_normalized_tat)

Description: Network turn-around times (TAT) for each transaction.

instance_scores.csv

Columns: (id, instance_scores)

Description: Vulnerability or risk qualification scores for each transaction.

Suggested Usage

Load all CSVs into dataframes.

Merge additional files (Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group.

Explore the severe class imbalance in train.csv (Target ~1% is fraud).

Train any suitable classification model (Random Forest, XGBoost, etc.) on train.csv.

Predict on test_share.csv or your own external data.

Possible Tools:
- Python: pandas, NumPy, scikit-learn
- Imbalance Handling: SMOTE, Random Oversampler, or class weights
- Metrics: Precision, Recall, F1-score, ROC-AUC, etc.

Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!

Tags

fraud-detection

classification

imbalanced-data

financial-transactions

machine-learning

python

beginner-friendly

License: CC BY-NC-SA 4.0
Training CNNs with Low-Rank Filters for Efficient Image Classification:...
zenodo.org
data.niaid.nih.gov
+1more
application/gzip, csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yani Ioannou; Yani Ioannou (2020). Training CNNs with Low-Rank Filters for Efficient Image Classification: Trained Models [Dataset]. http://doi.org/10.5281/zenodo.53189
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.53189
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yani Ioannou; Yani Ioannou
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Models from experiments referenced in the paper "Training CNNs with Low-Rank Filters for Efficient Image Classification", https://arxiv.org/abs/1511.06744

Model names differ from those in the paper, but the csv files for each set of experiments relates the paper's name for the model and the real name of the model here:

cifarma.csv: Network-in-Network CIFAR10 Models

mitma.csv: MIT Places Models

googlenetma.csv: GoogLeNet ILSVRC2012 Models

vggma.csv: VGG-11 ILSVRC2012 Models
d
Data and scripts from: “Denoising autoencoder for reconstructing sensor...
search.dataone.org
Updated Aug 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Johnsen; Xiangyu Bi; Chunwei Chou; Charuleka Varadharajan; Yuxin Wu; Jonathan Skone; Lavanya Ramakrishnan (2025). Data and scripts from: “Denoising autoencoder for reconstructing sensor observation data and predicting evapotranspiration: noisy and missing values repair and uncertainty quantification” [Dataset]. http://doi.org/10.15485/2561511
Explore at:
Unique identifier
https://doi.org/10.15485/2561511
Dataset updated
Aug 25, 2025
Dataset provided by
ESS-DIVE
Authors
Timothy Johnsen; Xiangyu Bi; Chunwei Chou; Charuleka Varadharajan; Yuxin Wu; Jonathan Skone; Lavanya Ramakrishnan
Time period covered
Oct 12, 2019 - Apr 11, 2023
Area covered

Description
This data package includes data and scripts from the manuscript “Denoising autoencoder for reconstructing sensor observation data and predicting evapotranspiration: noisy and missing values repair and uncertainty quantification”. The study addressed common challenges faced in environmental sensing and modeling, including uncertain input data, missing sensor observations, and high-dimensional datasets with interrelated but redundant variables. Point-scaled meteorological and soil sensor observations were perturbed with noises and missing values, and denoising autoencoder (DAE) neural networks were developed to reconstruct the perturbed data and further predict evapotranspiration. This study concluded that (1) the reconstruction quality of each variable depends on its cross-correlation and alignment to the underlying data structure, (2) uncertainties from the models were overall stronger than those from the data corruption, and (3) there was a tradeoff between reducing bias and reducing variance when evaluating the uncertainty of the machine learning models. This package includes: (1) Four ipython scripts (.ipynb): “DAE_train.ipynb” trains and evaluates DAE neural networks, “DAE_predict.ipynb” makes predictions from the trained DAE models, “ET_train.ipynb” trains and evaluates ET prediction neural networks, and “ET_predict.ipynb” makes predictions from trained ET models. (2) One python file (.py): “methods.py” includes all user-defined functions and python codes used in the ipython scripts. (3) A “sub_models” folder that includes five trained DAE neural networks (in pytorch format, .pt), which could be used to ingest input data before being fed to the downstream ET models in ‘ET_train.ipynb” or ‘ET_predict.ipynb’. (4) Two data files (.csv). Daily meteorological, vegetation, and soil data is in “df_data.csv”, where “df_meta.csv” contains the location and time information of “df_data.csv”. Each row (index) in “df_meta.csv” corresponds to each row in “df_data.csv”. These data files are formatted to follow the data structure requirements and be directly used in the ipython scripts, and they have been shuffled chronologically to train machine learning models. The meteorological and soil data was collected using point sensors between 2019-2023 at (4.a) Three shrub-dominated field sites in East River, Colorado (named “ph1”, “ph2” and “sg5” in “df_meta.csv”, where “ph1” and “ph2” were located at PumpHouse Hillslopes, and “sg5” was at Snodgrass Mountain meadow) and (4.b) One outdoor, mesoscale, and herbaceous-dominated experiment in Berkeley, California (named “tb” in “df_meta.csv”, short for Smartsoils Testbed at Lawrence Berkeley National Lab). - See "df_data_dd.csv" and "df_meta_dd.csv" for variable descriptions and the Methods section for additional data processing steps. See "flmd.csv" and "README.txt" for brief file descriptions. - All ipython scripts and python files are written in and require PYTHON language software.
m
Ransomware and user samples for training and validating ML models
data.mendeley.com
Updated Sep 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Berrueta (2021). Ransomware and user samples for training and validating ML models [Dataset]. http://doi.org/10.17632/yhg5wk39kf.2
Explore at:
Unique identifier
https://doi.org/10.17632/yhg5wk39kf.2
Dataset updated
Sep 17, 2021
Authors
Eduardo Berrueta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.
Caltech-256: Pre-Processed 80/20 Train-Test Split
kaggle.com
zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
Explore at:
zip(1138799273 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
KUSHAGRA MATHUR
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

A clean, pre-defined 80/20 train-test split.

Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

A flat directory structure (train/, test/) for simplified file access.

File Content The dataset is organized into a single top-level folder and two CSV files:

train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

Caltech-256_Train_Test/: The primary data folder.

train/: This directory contains 80% of the images from all 257 categories, intended for model training.

test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Disease Prediction Using Machine Learning
kaggle.com
zip
Updated May 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KAUSHIL268 (2020). Disease Prediction Using Machine Learning [Dataset]. https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning/discussion
Explore at:
zip(30490 bytes)Available download formats
Dataset updated
May 15, 2020
Authors
KAUSHIL268
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

During the time when Machine Learning and Deep Learning are booming so much , it is very important to understand that all this knowledge is not of any use if we cant apply it to different areas and impact the humanity.

This dataset will help you apply your existing knowledge to great use. Applying Knowledge to field of Medical Science and making the task of Physician easy is the main purpose of this dataset. This dataset has 132 parameters on which 42 different types of diseases can be predicted.

All the best !

Content

Complete Dataset consists of 2 CSV files . One of them is training and other is for testing your model.

Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and last column is the prognosis.

These symptoms are mapped to 42 diseases you can classify these set of symptoms to.

You are required to train your model on training data and test it on testing data

Inspiration

Just make your best effort to make world a better place by applying all the knowledge you have to different fields.
D
Disease Prediction Using Machine Learning
dataandsons.com
csv, zip
Updated Oct 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
test test (2022). Disease Prediction Using Machine Learning [Dataset]. https://www.dataandsons.com/categories/machine-learning/disease-prediction-using-machine-learning
Explore at:
csv, zipAvailable download formats
Dataset updated
Oct 31, 2022
Dataset provided by
Data & Sons
Authors
test test
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
About this Dataset

This dataset will help you apply your existing knowledge to great use. This dataset has 132 parameters on which 42 different types of diseases can be predicted. This dataset consists of 2 CSV files. One of them is for training and the other is for testing your model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and the last column is the prognosis. These symptoms are mapped to 42 diseases you can classify these sets of symptoms. You are required to train your model on training data and test it on testing data.

Category

Machine Learning

Keywords

medicine,disease,Healthcare,ML,Machine Learning

Row Count

4962

Price

$109.00
Z
Bio-logger Ethogram Benchmark: A benchmark for computational analysis of...
data.niaid.nih.gov
portalcienciaytecnologia.jcyl.es
+4more
Updated Apr 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoffman, Benjamin; Cusimano, Maddie; Baglione, Vittorio; Canestrari, Daniela; Chevallier, Damien; DeSantis, Dominic L.; Jeantet, Lorène; Ladds, Monique A.; Maekawa, Takuya; Mata-Silva, Vicente; Moreno-González, Víctor; Trapote, Eva; Vainio, Outi; Vehkaoja, Antti; Yoda, Ken; Zacarian, Katherine; Friedlaender, Ari (2024). Bio-logger Ethogram Benchmark: A benchmark for computational analysis of animal behavior, using animal-borne tags [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7807280
Explore at:
Dataset updated
Apr 19, 2024
Dataset provided by
Universidad de León
University of California, Santa Cruz
Earth Species Project
Centre national de la recherche scientifique Borea
University of Texas, El Paso
Tampere University
Georgia College & State University
University of Helsinki
Department of Conservation, New Zealand
Nagoya University
Osaka University
African Institute for Mathematical Sciences, Stellenbosch University
Authors
Hoffman, Benjamin; Cusimano, Maddie; Baglione, Vittorio; Canestrari, Daniela; Chevallier, Damien; DeSantis, Dominic L.; Jeantet, Lorène; Ladds, Monique A.; Maekawa, Takuya; Mata-Silva, Vicente; Moreno-González, Víctor; Trapote, Eva; Vainio, Outi; Vehkaoja, Antti; Yoda, Ken; Zacarian, Katherine; Friedlaender, Ari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the datasets and experiment results presented in our arxiv paper:

B. Hoffman, M. Cusimano, V. Baglione, D. Canestrari, D. Chevallier, D. DeSantis, L. Jeantet, M. Ladds, T. Maekawa, V. Mata-Silva, V. Moreno-González, A. Pagano, E. Trapote, O. Vainio, A. Vehkaoja, K. Yoda, K. Zacarian, A. Friedlaender, "A benchmark for computational analysis of animal behavior, using animal-borne tags," 2023.

Standardized code to implement, train, and evaluate models can be found at https://github.com/earthspecies/BEBE/.

Please note the licenses in each dataset folder.

Zip folders beginning with "formatted": These are the datasets we used to run the experiments reported in the benchmark paper.

Zip folders beginning with "raw": These are the unprocessed datasets used in BEBE. Code to process these raw datasets into the formatted ones used by BEBE can be found at https://github.com/earthspecies/BEBE-datasets/.

Zip folders beginning with "experiments": Results of the cross-validation experiments reported in the paper, as well as hyperparameter optimization. Confusion matrices for all experiments can also be found here. Note that dt, rf, and svm refer to the feature set from Nathan et al., 2012.

Results used in Fig. 4 of arxiv paper (deep neural networks vs. classical models){dataset}_ harnet_nogyr{dataset}_CRNN{dataset}_CNN{dataset}_dt{dataset}_rf{dataset}_svm{dataset}_wavelet_dt{dataset}_wavelet_rf{dataset}_wavelet_svm

Results used in Fig. 5D of arxiv paper (full data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_nogyr{dataset}_harnet_random_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_RNN_nogyr{dataset}_CRNN_nogyr{dataset}_rf_nogyrOtherwise:{dataset}_harnet_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_harnet_random_nogyr{dataset}_RNN_nogyr{dataset}_CRNN{dataset}_rf

Results used in Fig. 5E of arxiv paper (reduced data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data_nogyr{dataset}_rf_low_data_nogyr

Otherwise:{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data{dataset}_rf_low_data

CSV files: we also include summaries of the experimental results in experiments_summary.csv, experiments_by_fold_individual.csv, experiments_by_fold_behavior.csv.

experiments_summary.csv - results averaged over individuals and behavior classesdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperf1_mean (float): mean of macro-averaged F1 score, averaged over individuals in test foldsf1_std (float): standard deviation of macro-averaged F1 score, computed over individuals in test foldsprec_mean, prec_std (float): analogous for precisionrec_mean, rec_std (float): analogous for recallexperiments_by_fold_individual.csv - results per individual in the test foldsdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexindividual (int): individuals are numbered zero-indexed, starting from fold 1f1 (float): macro-averaged f1 score for this individualprecision (float): macro-averaged precision for this individualrecall (float): macro-averaged recall for this individual

experiments_by_fold_behavior.csv - results per behavior class, for each test folddataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexbehavior_class (str): name of behavior classf1 (float): f1 score for this behavior, averaged over individuals in the test foldprecision (float): precision for this behavior, averaged over individuals in the test foldrecall (float): recall for this behavior, averaged over individuals in the test foldtrain_ground_truth_label_counts (int): number of timepoints labeled with this behavior class, in the training set
Prediction of Personality Traits using the Big 5 Framework
zenodo.org
csv, text/x-python
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
Explore at:
text/x-python, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7596072
Dataset updated
Feb 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Neelima Brahmbhatt; Neelima Brahmbhatt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

1. Acquire Personality Dataset

The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

2. Data preprocessing

After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

3. Feature Extraction

The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

EXT1 I am the life of the party. EXT2 I don't talk a lot. EXT3 I feel comfortable around people. EXT4 I am quiet around strangers. EST1 I get stressed out easily. EST2 I get irritated easily. EST3 I worry about things. EST4 I change my mood a lot. AGR1 I have a soft heart. AGR2 I am interested in people. AGR3 I insult people. AGR4 I am not really interested in others. CSN1 I am always prepared. CSN2 I leave my belongings around. CSN3 I follow a schedule. CSN4 I make a mess of things. OPN1 I have a rich vocabulary. OPN2 I have difficulty understanding abstract ideas. OPN3 I do not have a good imagination. OPN4 I use difficult words.

4. Training the Model

Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

5. Personality Prediction Output

After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Z
DFT Calculated xyz and log Files as well as csv Files for Machine Learning...
data.niaid.nih.gov
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huang, Tianbai; Geitner, Robert; Croy, Alexander; Gräfe, Stefanie (2024). DFT Calculated xyz and log Files as well as csv Files for Machine Learning in Support of "Tailoring Phosphine Ligands for Improved C H Activation: Insights from Δ-Machine Learning" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10529636
Explore at:
Dataset updated
May 3, 2024
Dataset provided by
Friedrich Schiller University Jena
Friedrich-Schiller-Universität Jena
Technische Universität Ilmenau
Authors
Huang, Tianbai; Geitner, Robert; Croy, Alexander; Gräfe, Stefanie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Transition metal complexes have played crucial roles in various homogeneous catalytic processes due to their exceptional versatility. This adaptability stems not only from the central metal ions but also from the vast array of choices of the ligand spheres, which form an enormously large chemical space. For example, Rh complexes, with a well-designed ligand sphere, are known to be efficient in catalyzing the C-H activation process in alkanes. To investigate the structure-property relation of the Rh complex and identify the optimal ligand that minimizes the calculated reaction energy ΔE of an alkane C-H activation, we have applied a Δ-Machine Learning method trained on various features to study 1,743 pairs of reactants (Rh(PLP)(Cl)(CO)) and intermediates (Rh(PLP)(Cl)(CO)(H)(propyl)). Our findings demonstrate that the models exhibit robust predictive performance when trained on features derived from electron density (R2 = 0.816), and SOAPs (R2 = 0.819), a set of position-based descriptors. Leveraging the model trained on xTB-SOAPs that only depend on the xTB-equilibrium structures, we propose an efficient and accurate screening procedure to explore the extensive chemical space of bisphosphine ligands. By applying this screening procedure, we identify ten newly selected reactant-intermediate pairs with an average ΔE of 33.2 kJ mol-1, remarkably lower than the average ΔE of the original data set of 68.0 kJ mol-1. This underscores the efficacy of our screening procedure in pinpointing structures with significantly lower energy levels.

The dataset contains three file types:

Version 1.0:

xyz files of the final optimized Rh-phosphine complexes; one set for the starting materials denoted as "molecule-XXXX_4-times" and one set for the intermediates after C-H activation denoted as "molecule-XXXX_6-times"

Gaussian16 log files for the optimization process; one set for the starting materials denoted as "molecule-XXXX_4-times" and one set for the intermediates after C-H activation denoted as "molecule-XXXX_6-times"

csv files containing the per molecule features used for training the different machine learning models. The name of the csv files indicates which property was predicted and which model was used

New in version 1.1 (other data is unchanged):

Gaussian16 log files for the ten newly identified bisphosphine ligands; one set for the product material denoted as "LXX_6-times-axial" and one set for the transition state for the C-H activation denoted as "LXX_C-H-activation_TS"
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Table1_Improving prediction of tacrolimus concentration using a combination...
frontiersin.figshare.com
application/csv
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu-Ping Wang; Xiao-Ling Lu; Kun Shao; Hao-Qiang Shi; Pei-Jun Zhou; Bing Chen (2024). Table1_Improving prediction of tacrolimus concentration using a combination of population pharmacokinetic modeling and machine learning in chinese renal transplant recipients.csv [Dataset]. http://doi.org/10.3389/fphar.2024.1389271.s002
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2024.1389271.s002
Dataset updated
May 9, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Yu-Ping Wang; Xiao-Ling Lu; Kun Shao; Hao-Qiang Shi; Pei-Jun Zhou; Bing Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AimsThe population pharmacokinetic (PPK) model-based machine learning (ML) approach offers a novel perspective on individual concentration prediction. This study aimed to establish a PPK-based ML model for predicting tacrolimus (TAC) concentrations in Chinese renal transplant recipients.MethodsConventional TAC monitoring data from 127 Chinese renal transplant patients were divided into training (80%) and testing (20%) datasets. A PPK model was developed using the training group data. ML models were then established based on individual pharmacokinetic data derived from the PPK basic model. The prediction performances of the PPK-based ML model and Bayesian forecasting approach were compared using data from the test group.ResultsThe final PPK model, incorporating hematocrit and CYP3A5 genotypes as covariates, was successfully established. Individual predictions of TAC using the PPK basic model, postoperative date, CYP3A5 genotype, and hematocrit showed improved rankings in ML model construction. XGBoost, based on the TAC PPK, exhibited the best prediction performance.ConclusionThe PPK-based machine learning approach emerges as a superior option for predicting TAC concentrations in Chinese renal transplant recipients.
Z
Dataset to Train Intrusion Detection Systems based on Machine Learning...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gutierrez Mlot, Esteban Damian; Saldana, Jose; Rodríguez, Ricardo J.; Kotsiuba, Igor; Hernández Gañán, Carlos (2025). Dataset to Train Intrusion Detection Systems based on Machine Learning Models for Electrical Substations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_13898982
Explore at:
Dataset updated
Feb 13, 2025
Dataset provided by
Durham University
CIRCE - Centro Tecnológico
Delft University of Technology
CIRCE Centro Tecnológico
Universidad de Zaragoza
Authors
Gutierrez Mlot, Esteban Damian; Saldana, Jose; Rodríguez, Ricardo J.; Kotsiuba, Igor; Hernández Gañán, Carlos
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
DATASET

This dataset is part of the research work titled "A dataset to train intrusion detection systems based on machine learning models for electrical substations". The dataset has been meticulously curated to support the development and evaluation of machine learning models tailored for detecting cyber intrusions in the context of electrical substations. It is intended to facilitate research and advancements in cybersecurity for critical infrastructure, specifically focusing on real-world scenarios within electrical substation environments. We encourage its use for experimentation and benchmarking in related areas of study.

The following sections list the content of the dataset generated.

Data

raw

iec6180

attack-free-data

capture61850-attackfree.pcap (from real substation)

capture61850-attackfree_PTP.pcap

capture61850-attackfree_normalfault.pcap

attack-data

capture61850-floodattack_withfault.pcap

capture61850-floodattack_withoutfault.pcap

capture61850-fuzzyattack_withfault.pcap

capture61850-fuzzyattack_withoutfault.pcap

capture61850-replay.pcap

capture61850-ptpattack.pcap

iec104

attack-free-data

capture104-attackfree.pcap (from real substation)

attack-data

capture104-dosattack.pcap

capture104-floodattack.pcap

capture104-fuzzyattack.pcap

capture104-iec104starvationattack.pcap

capture104-mitmattack.pcap

capture104-ntpddosattack.pcap

capture104-portscanattack.pcap

processed

iec6180

attack-free-data

capture61850-attackfree.csv

capture61850-attackfree_PTP.csv

capture61850-attackfree_normalfault.csv

attack-data

capture61850-floodattack_withfault.csv

capture61850-floodattack_withoutfault.csv

capture61850-fuzzyattack_withfault.csv

capture61850-fuzzyattack_withoutfault.csv

capture61850-replay.csv

capture61850-ptpattack.csv

headers_iec61850[all].txt

iec104

attack-free-data

capture104-attackfree.csv

attack-data

capture104-dosattack.csv

capture104-floodattack.csv

capture104-fuzzyattack.csv

capture104-iec104starvationattack.csv

capture104-mitmattack.csv

capture104-ntpddosattack.csv

capture104-portscanattack.csv

headers_iec104[all].txt

Description

file type: it may be captured61850 or captured104 depending on whether it contains network captures of the protocol IEC61850 or IEC104.

attack: attack free (attackfree) or attack name is added to the file name.

function: optionally, if there are some details about functionality captured (normalfault) or specific protocol capture (PTP).

file extension: the type can be PCAP (network capture) or CSV (flow file).

Results

results

test1-iec104

model-test1-iec104.pkl

test1-iec104.log

test1-iec61850

model-test1-iec61850.pkl

test1-iec61850.log

test2-iec61850

model-test2-iec61850.pkl

test2-iec61850.log

Description

The outcomes of different test executions are available as follows:

test1-iec104: IEC 104 protocol for all attacks and attack free scenario

test1-iec61850: IEC 61850 protocol for fuzzy attack with fault injection and attack free scenario

test2-iec61850: IEC 61850 protocol for fuzzy attack normal operation and attack free scenario

Each test consists of the model results in Python pickle format (with a .pkl extension) and a detailed description of the execution conditions in an output log file (with a .log extension).

Source Code

Tools to process network captures from IEC61850 and IEC104 can be found at github repository.
Yelp Reviews Sentiment Dataset
kaggle.com
zip
Updated Nov 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Yelp Reviews Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/yelp-reviews-sentiment-dataset/code
Explore at:
zip(169587518 bytes)Available download formats
Dataset updated
Nov 25, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Yelp Reviews Sentiment Dataset

A Challenge for Natural Language Processing

By Huggingface Hub [source]

About this dataset

The Yelp Reviews Polarity dataset is a collection of Yelp reviews that have been labeled as positive or negative. This dataset is perfect for natural language processing tasks such as sentiment analysis

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This YELP reviews dataset is a great natural language processing dataset for anyone looking to get started with text classification. The data is split into two files: train.csv and test.csv. The training set contains 7,000 reviews with labels (0 = negative, 1 = positive), and the test set contains 3,000 unlabeled reviews.

To get started with this dataset, download the two CSV files and put them in the same directory. Then, open up train.csv in your favorite text editor or spreadsheet software (I like using Microsoft Excel). Next, take a look at the first few rows of data to get a feel for what you're working with:

text label
So there is no way for me to plug it in here in the US unless I go by... 0

Research Ideas

This dataset could be used to train a machine learning model to classify Yelp reviews as positive or negative.

This dataset could be used to train a machine learning model to predict the star rating of a Yelp review based on the text of the review.

This dataset could be used to build a natural language processing system that generates fake Yelp reviews

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |

File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Store Sales - T.S Forecasting...Merged Dataset
kaggle.com
zip
Updated Dec 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shramana Bhattacharya (2021). Store Sales - T.S Forecasting...Merged Dataset [Dataset]. https://www.kaggle.com/shramanabhattacharya/store-sales-ts-forecastingmerged-dataset
Explore at:
zip(2847585 bytes)Available download formats
Dataset updated
Dec 15, 2021
Authors
Shramana Bhattacharya
Description
This dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.

##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)

Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)

**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*

submission.csv - A sample submission file in the correct format.
m
Datasets used to train and test prediction model to predict scores in terms...
data.mendeley.com
Updated Mar 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarosław Wątróbski (2025). Datasets used to train and test prediction model to predict scores in terms of SDG 7 realization [Dataset]. http://doi.org/10.17632/6c8fm7s4y2.1
Explore at:
Unique identifier
https://doi.org/10.17632/6c8fm7s4y2.1
Dataset updated
Mar 5, 2025
Authors
Jarosław Wątróbski
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.
o
Models, data, and scripts associated with “Prediction of Distributed River...
osti.gov
dataone.org
+1more
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. DOE > Office of Science > Biological and Environmental Research (BER) (2024). Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning” [Dataset]. http://doi.org/10.15485/2318723
Explore at:
Unique identifier
https://doi.org/10.15485/2318723
Dataset updated
Feb 22, 2024
Dataset provided by
U.S. DOE > Office of Science > Biological and Environmental Research (BER)
Environmental System Science Data Infrastructure for a Virtual Ecosystem
Description
This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs. A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality.Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.

text	label
So there is no way for me to plug it in here in the US unless I go by...	0

Facebook

Twitter

Click to copy link

Link copied

Cite

Aman Chauhan (2022). Titanic Dataset - Machine Learning from Disaster [Dataset]. https://www.kaggle.com/datasets/whenamancodes/titanic-dataset-machine-learning-from-disaster

Titanic Dataset - Machine Learning from Disaster

Predict survival on the Titanic and get familiar with ML basics

Explore at:

zip(34877 bytes)Available download formats

Dataset updated

Sep 20, 2022

Authors

Aman Chauhan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Overview

The data has been split into two groups:

training set (train.csv)
test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary:

| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe

Clear search

Close search

Google apps

Main menu

Titanic Dataset - Machine Learning from Disaster

Overview

The data has been split into two groups:

Data Dictionary:

Variable Notes

Machine learning model that estimates total monthly and annual per capita...

Fraudulent Financial Transaction Prediction

Fraud Detection with Imbalanced Data

Files in this Dataset

Suggested Usage

Tags

Training CNNs with Low-Rank Filters for Efficient Image Classification:...

Data and scripts from: “Denoising autoencoder for reconstructing sensor...

Ransomware and user samples for training and validating ML models

Caltech-256: Pre-Processed 80/20 Train-Test Split

Disease Prediction Using Machine Learning

Context

Content

Inspiration

Disease Prediction Using Machine Learning

About this Dataset

Category

Keywords

Row Count

Price

Bio-logger Ethogram Benchmark: A benchmark for computational analysis of...

Prediction of Personality Traits using the Big 5 Framework

DFT Calculated xyz and log Files as well as csv Files for Machine Learning...

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Table1_Improving prediction of tacrolimus concentration using a combination...

Dataset to Train Intrusion Detection Systems based on Machine Learning...

Yelp Reviews Sentiment Dataset

Yelp Reviews Sentiment Dataset

A Challenge for Natural Language Processing

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Store Sales - T.S Forecasting...Merged Dataset

Datasets used to train and test prediction model to predict scores in terms...

Models, data, and scripts associated with “Prediction of Distributed River...

Titanic Dataset - Machine Learning from Disaster

Predict survival on the Titanic and get familiar with ML basics

Overview

The data has been split into two groups:

Data Dictionary:

Variable Notes