13 datasets found

n
Data for: Advances and critical assessment of machine learning techniques...
data.niaid.nih.gov
dataone.org
+2more
zip
Updated Mar 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zgmsbccg7
Dataset updated
Mar 3, 2023
Dataset provided by
Slovak University of Technology in Bratislava
Comenius University Bratislava
Authors
Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected. The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study. The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V). Methods Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.
MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...
zenodo.org
csv, zip
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15666484
Dataset updated
Jun 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 14, 2025
Description

The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)
Heart Disease Risk Prediction Dataset
kaggle.com
zip
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahatir Ahmed Tusher (2025). Heart Disease Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/mahatiratusher/heart-disease-risk-prediction-dataset
Explore at:
zip(1448235 bytes)Available download formats
Dataset updated
Feb 7, 2025
Authors
Mahatir Ahmed Tusher
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Heart Disease Risk Prediction Dataset

Overview

This synthetic dataset is designed to predict the risk of heart disease based on a combination of symptoms, lifestyle factors, and medical history. Each row in the dataset represents a patient, with binary (Yes/No) indicators for symptoms and risk factors, along with a computed risk label indicating whether the patient is at high or low risk of developing heart disease.

The dataset contains 70,000 samples, making it suitable for training machine learning models for classification tasks. The goal is to provide researchers, data scientists, and healthcare professionals with a clean and structured dataset to explore predictive modeling for cardiovascular health.

This dataset is a side project of EarlyMed, developed by students of Vellore Institute of Technology (VIT-AP). EarlyMed aims to leverage data science and machine learning for early detection and prevention of chronic diseases.

Dataset Features

Input Features

Symptoms (Binary - Yes/No)

Chest Pain (chest_pain): Presence of chest pain, a common symptom of heart disease.

Shortness of Breath (shortness_of_breath): Difficulty breathing, often associated with heart conditions.

Unexplained Fatigue (fatigue): Persistent tiredness without an obvious cause.

Palpitations (palpitations): Irregular or rapid heartbeat.

Dizziness/Fainting (dizziness): Episodes of lightheadedness or fainting.

Swelling in Legs/Ankles (swelling): Swelling due to fluid retention, often linked to heart failure.

Pain in Arm/Jaw/Neck/Back (radiating_pain): Radiating pain, a hallmark of angina or heart attacks.

Cold Sweats & Nausea (cold_sweats): Symptoms commonly associated with acute cardiac events.

Risk Factors (Binary - Yes/No or Continuous)

Age (age): Patient's age in years (continuous variable).

High Blood Pressure (hypertension): History of hypertension (Yes/No).

High Cholesterol (cholesterol_high): Elevated cholesterol levels (Yes/No).

Diabetes (diabetes): Diagnosis of diabetes (Yes/No).

Smoking History (smoker): Whether the patient is a smoker (Yes/No).

Obesity (obesity): Obesity status (Yes/No).

Family History of Heart Disease (family_history): Family history of cardiovascular conditions (Yes/No).

Output Label

Heart Disease Risk (risk_label): Binary label indicating the risk of heart disease:

0: Low risk

1: High risk

Data Generation Process

This dataset was synthetically generated using Python libraries such as numpy and pandas. The generation process ensured a balanced distribution of high-risk and low-risk cases while maintaining realistic correlations between features. For example: - Patients with multiple risk factors (e.g., smoking, hypertension, and diabetes) were more likely to be labeled as high risk. - Symptom patterns were modeled after clinical guidelines and research studies on heart disease.

Sources of Inspiration

The design of this dataset was inspired by the following resources:

Books

"Harrison's Principles of Internal Medicine" by J. Larry Jameson et al.: A comprehensive resource on cardiovascular diseases and their symptoms.

"Mayo Clinic Cardiology" by Joseph G. Murphy et al.: Provides insights into heart disease risk factors and diagnostic criteria.

Research Papers

Framingham Heart Study: A landmark study identifying key risk factors for cardiovascular disease.

American Heart Association (AHA) Guidelines: Recommendations for diagnosing and managing heart disease.

Existing Datasets

UCI Heart Disease Dataset: A widely used dataset for heart disease prediction.

Kaggle’s Heart Disease datasets: Various datasets contributed by the community.

Clinical Guidelines

Centers for Disease Control and Prevention (CDC): Information on heart disease symptoms and risk factors.

World Health Organization (WHO): Global statistics and risk factor analysis for cardiovascular diseases.

Applications

This dataset can be used for a variety of purposes:

Machine Learning Research:

Train classification models (e.g., Logistic Regression, Random Forest, XGBoost) to predict heart disease risk.

Experiment with feature engineering, model tuning, and evaluation metrics like Accuracy, Precision, Recall, and ROC-AUC.

Healthcare Analytics:

Identify key risk factors contributing to heart disease.

Develop decision support systems for early detection of cardiovascular risks.

Educational Purposes:

Teach students and practitioners about predictive modeling in healthcare.

Demonstrate the importance of feature selection...
f
Table_1_A model for predicting physical function upon discharge of...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Jul 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, Chia-Yu; Yang, Chao-Tung; Hsu, Pi-Shan; Lin, Shih-Yi; Chu, Wei-Min; Chen, Pei-Yu; Hao, Man-Ling; Tsan, Yu-Tse; Chen, Hong-Ming; Chan, Wei-Chan (2023). Table_1_A model for predicting physical function upon discharge of hospitalized older adults in Taiwan—a machine learning approach based on both electronic health records and comprehensive geriatric assessment.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001024098
Explore at:
Dataset updated
Jul 21, 2023
Authors
Chen, Chia-Yu; Yang, Chao-Tung; Hsu, Pi-Shan; Lin, Shih-Yi; Chu, Wei-Min; Chen, Pei-Yu; Hao, Man-Ling; Tsan, Yu-Tse; Chen, Hong-Ming; Chan, Wei-Chan
Description
BackgroundPredicting physical function upon discharge among hospitalized older adults is important. This study has aimed to develop a prediction model of physical function upon discharge through use of a machine learning algorithm using electronic health records (EHRs) and comprehensive geriatrics assessments (CGAs) among hospitalized older adults in Taiwan.MethodsData was retrieved from the clinical database of a tertiary medical center in central Taiwan. Older adults admitted to the acute geriatric unit during the period from January 2012 to December 2018 were included for analysis, while those with missing data were excluded. From data of the EHRs and CGAs, a total of 52 clinical features were input for model building. We used 3 different machine learning algorithms, XGBoost, random forest and logistic regression.ResultsIn total, 1,755 older adults were included in final analysis, with a mean age of 80.68 years. For linear models on physical function upon discharge, the accuracy of prediction was 87% for XGBoost, 85% for random forest, and 32% for logistic regression. For classification models on physical function upon discharge, the accuracy for random forest, logistic regression and XGBoost were 94, 92 and 92%, respectively. The auROC reached 98% for XGBoost and random forest, while logistic regression had an auROC of 97%. The top 3 features of importance were activity of daily living (ADL) at baseline, ADL during admission, and mini nutritional status (MNA) during admission.ConclusionThe results showed that physical function upon discharge among hospitalized older adults can be predicted accurately during admission through use of a machine learning model with data taken from EHRs and CGAs.
Cryptocurrency extra data - IOTA
kaggle.com
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2022). Cryptocurrency extra data - IOTA [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-iota
Explore at:
zip(1196411839 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cryptocurrency extra data - Maker
kaggle.com
zip
Updated Jan 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2022). Cryptocurrency extra data - Maker [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-maker
Explore at:
zip(1150531041 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cryptocurrency extra data - TRON
kaggle.com
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2022). Cryptocurrency extra data - TRON [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-tron
Explore at:
zip(1253566627 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cryptocurrency extra data - Cardano
kaggle.com
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2022). Cryptocurrency extra data - Cardano [Dataset]. https://www.kaggle.com/datasets/yamqwe/cryptocurrency-extra-data-cardano/code
Explore at:
zip(1254179058 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cryptocurrency extra data - Monero
kaggle.com
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2022). Cryptocurrency extra data - Monero [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-monero
Explore at:
zip(1204684577 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cryptocurrency extra data - Ethereum Classic
kaggle.com
zip
Updated Jan 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2022). Cryptocurrency extra data - Ethereum Classic [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-ethereum-classic
Explore at:
zip(1259913408 bytes)Available download formats
Dataset updated
Jan 19, 2022
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cryptocurrency extra data - Binance Coin
kaggle.com
zip
Updated Jan 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2022). Cryptocurrency extra data - Binance Coin [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-binance-coin
Explore at:
zip(1246039618 bytes)Available download formats
Dataset updated
Jan 19, 2022
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cryptocurrency extra data - Bitcoin Cash
kaggle.com
zip
Updated Jan 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2022). Cryptocurrency extra data - Bitcoin Cash [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-bitcoin-cash
Explore at:
zip(1253909016 bytes)Available download formats
Dataset updated
Jan 19, 2022
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cybersecurity 🪪 Intrusion 🦠 Detection Dataset
kaggle.com
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dinesh Naveen Kumar Samudrala
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

1. Understanding the Features

The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

A. Network-Based Features

These features describe network-level information such as packet size, protocol type, and encryption methods.

network_packet_size (Packet Size in Bytes)

Represents the size of network packets, ranging between 64 to 1500 bytes.

Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.

Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.

protocol_type (Communication Protocol)

The protocol used in the session: TCP, UDP, or ICMP.

TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).

UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).

ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.

encryption_used (Encryption Protocol)

Values: AES, DES, None.

AES (Advanced Encryption Standard): Strong encryption, commonly used.

DES (Data Encryption Standard): Older encryption, weaker security.

None: Indicates unencrypted communication, which can be risky.

Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

B. User Behavior-Based Features

These features track user activities, such as login attempts and session duration.

login_attempts (Number of Logins)

High values might indicate brute-force attacks (repeated login attempts).

Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.

session_duration (Session Length in Seconds)

A very long session might indicate unauthorized access or persistence by an attacker.

Attackers may try to stay connected to maintain access.

failed_logins (Failed Login Attempts)

High failed login counts indicate credential stuffing or dictionary attacks.

Many failed attempts followed by a successful login could suggest an account was compromised.

unusual_time_access (Login Time Anomaly)

A binary flag (0 or 1) indicating whether access happened at an unusual time.

Attackers often operate outside normal business hours to evade detection.

ip_reputation_score (Trustworthiness of IP Address)

A score from 0 to 1, where higher values indicate suspicious activity.

IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.

browser_type (User’s Browser)

Common browsers: Chrome, Firefox, Edge, Safari.

Unknown: Could be an indicator of automated scripts or bots.

2. Target Variable (attack_detected)

Binary classification: 1 means an attack was detected, 0 means normal activity.

The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

3. Possible Use Cases

This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

A. Machine Learning-Based Intrusion Detection

Supervised Learning Approaches

Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)

Train the model using labeled data (attack_detected as the target).

Evaluate using accuracy, precision, recall, F1-score.

Deep Learning Approaches

Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.

LSTMs work well for time-series-based network traffic analysis.

B. Anomaly Detection (Unsupervised Learning)

If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

C. Rule-Based Detection

If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

4. Challenges & Considerations

Adversarial Attacks: Attackers may modify traffic to evade detection.

Concept Drift: Cyber threats...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7

Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.zgmsbccg7

Dataset updated

Mar 3, 2023

Dataset provided by

Slovak University of Technology in Bratislava
Comenius University Bratislava

Authors

Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected. The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study. The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V). Methods Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.

Clear search

Close search

Google apps

Main menu

Data for: Advances and critical assessment of machine learning techniques...

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

Heart Disease Risk Prediction Dataset

Heart Disease Risk Prediction Dataset

Overview

Dataset Features

Input Features

Symptoms (Binary - Yes/No)

Risk Factors (Binary - Yes/No or Continuous)

Output Label

Data Generation Process

Sources of Inspiration

Books

Research Papers

Existing Datasets

Clinical Guidelines

Applications

Table_1_A model for predicting physical function upon discharge of...

Cryptocurrency extra data - IOTA

Context:

Introduction

The Data

Indexing

Usage Example

Baseline Example Notebooks:

Loose-ends:

Example Visualisations

License

Cryptocurrency extra data - Maker

Context:

Introduction

The Data

Indexing

Usage Example

Baseline Example Notebooks:

Loose-ends:

Example Visualisations

License

Cryptocurrency extra data - TRON

Context:

Introduction

The Data

Indexing

Usage Example

Baseline Example Notebooks:

Loose-ends:

Example Visualisations

License

Cryptocurrency extra data - Cardano

Context:

Introduction

The Data

Indexing

Usage Example

Baseline Example Notebooks:

Loose-ends:

Example Visualisations

License

Cryptocurrency extra data - Monero

Context:

Introduction

The Data

Indexing

Usage Example

Baseline Example Notebooks:

Loose-ends:

Example Visualisations

License

Cryptocurrency extra data - Ethereum Classic

Context:

Introduction

The Data

Indexing

Usage Example

Baseline Example Notebooks:

Loose-ends:

Example Visualisations

License

Cryptocurrency extra data - Binance Coin

Context:

2. Target Variable (`attack_detected`)