13 datasets found
  1. n

    Data for: Advances and critical assessment of machine learning techniques...

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated Mar 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 3, 2023
    Dataset provided by
    Slovak University of Technology in Bratislava
    Comenius University Bratislava
    Authors
    Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected. The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study. The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V). Methods Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.

  2. MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

    • zenodo.org
    csv, zip
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 14, 2025
    Description

    The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

    Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
    These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)
  3. Heart Disease Risk Prediction Dataset

    • kaggle.com
    zip
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahatir Ahmed Tusher (2025). Heart Disease Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/mahatiratusher/heart-disease-risk-prediction-dataset
    Explore at:
    zip(1448235 bytes)Available download formats
    Dataset updated
    Feb 7, 2025
    Authors
    Mahatir Ahmed Tusher
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Heart Disease Risk Prediction Dataset

    Overview

    This synthetic dataset is designed to predict the risk of heart disease based on a combination of symptoms, lifestyle factors, and medical history. Each row in the dataset represents a patient, with binary (Yes/No) indicators for symptoms and risk factors, along with a computed risk label indicating whether the patient is at high or low risk of developing heart disease.

    The dataset contains 70,000 samples, making it suitable for training machine learning models for classification tasks. The goal is to provide researchers, data scientists, and healthcare professionals with a clean and structured dataset to explore predictive modeling for cardiovascular health.

    This dataset is a side project of EarlyMed, developed by students of Vellore Institute of Technology (VIT-AP). EarlyMed aims to leverage data science and machine learning for early detection and prevention of chronic diseases.

    Dataset Features

    Input Features

    Symptoms (Binary - Yes/No)

    1. Chest Pain (chest_pain): Presence of chest pain, a common symptom of heart disease.
    2. Shortness of Breath (shortness_of_breath): Difficulty breathing, often associated with heart conditions.
    3. Unexplained Fatigue (fatigue): Persistent tiredness without an obvious cause.
    4. Palpitations (palpitations): Irregular or rapid heartbeat.
    5. Dizziness/Fainting (dizziness): Episodes of lightheadedness or fainting.
    6. Swelling in Legs/Ankles (swelling): Swelling due to fluid retention, often linked to heart failure.
    7. Pain in Arm/Jaw/Neck/Back (radiating_pain): Radiating pain, a hallmark of angina or heart attacks.
    8. Cold Sweats & Nausea (cold_sweats): Symptoms commonly associated with acute cardiac events.

    Risk Factors (Binary - Yes/No or Continuous)

    1. Age (age): Patient's age in years (continuous variable).
    2. High Blood Pressure (hypertension): History of hypertension (Yes/No).
    3. High Cholesterol (cholesterol_high): Elevated cholesterol levels (Yes/No).
    4. Diabetes (diabetes): Diagnosis of diabetes (Yes/No).
    5. Smoking History (smoker): Whether the patient is a smoker (Yes/No).
    6. Obesity (obesity): Obesity status (Yes/No).
    7. Family History of Heart Disease (family_history): Family history of cardiovascular conditions (Yes/No).

    Output Label

    • Heart Disease Risk (risk_label): Binary label indicating the risk of heart disease:
      • 0: Low risk
      • 1: High risk

    Data Generation Process

    This dataset was synthetically generated using Python libraries such as numpy and pandas. The generation process ensured a balanced distribution of high-risk and low-risk cases while maintaining realistic correlations between features. For example: - Patients with multiple risk factors (e.g., smoking, hypertension, and diabetes) were more likely to be labeled as high risk. - Symptom patterns were modeled after clinical guidelines and research studies on heart disease.

    Sources of Inspiration

    The design of this dataset was inspired by the following resources:

    Books

    • "Harrison's Principles of Internal Medicine" by J. Larry Jameson et al.: A comprehensive resource on cardiovascular diseases and their symptoms.
    • "Mayo Clinic Cardiology" by Joseph G. Murphy et al.: Provides insights into heart disease risk factors and diagnostic criteria.

    Research Papers

    • Framingham Heart Study: A landmark study identifying key risk factors for cardiovascular disease.
    • American Heart Association (AHA) Guidelines: Recommendations for diagnosing and managing heart disease.

    Existing Datasets

    • UCI Heart Disease Dataset: A widely used dataset for heart disease prediction.
    • Kaggle’s Heart Disease datasets: Various datasets contributed by the community.

    Clinical Guidelines

    • Centers for Disease Control and Prevention (CDC): Information on heart disease symptoms and risk factors.
    • World Health Organization (WHO): Global statistics and risk factor analysis for cardiovascular diseases.

    Applications

    This dataset can be used for a variety of purposes:

    1. Machine Learning Research:

      • Train classification models (e.g., Logistic Regression, Random Forest, XGBoost) to predict heart disease risk.
      • Experiment with feature engineering, model tuning, and evaluation metrics like Accuracy, Precision, Recall, and ROC-AUC.
    2. Healthcare Analytics:

      • Identify key risk factors contributing to heart disease.
      • Develop decision support systems for early detection of cardiovascular risks.
    3. Educational Purposes:

      • Teach students and practitioners about predictive modeling in healthcare.
      • Demonstrate the importance of feature selection...
  4. f

    Table_1_A model for predicting physical function upon discharge of...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jul 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, Chia-Yu; Yang, Chao-Tung; Hsu, Pi-Shan; Lin, Shih-Yi; Chu, Wei-Min; Chen, Pei-Yu; Hao, Man-Ling; Tsan, Yu-Tse; Chen, Hong-Ming; Chan, Wei-Chan (2023). Table_1_A model for predicting physical function upon discharge of hospitalized older adults in Taiwan—a machine learning approach based on both electronic health records and comprehensive geriatric assessment.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001024098
    Explore at:
    Dataset updated
    Jul 21, 2023
    Authors
    Chen, Chia-Yu; Yang, Chao-Tung; Hsu, Pi-Shan; Lin, Shih-Yi; Chu, Wei-Min; Chen, Pei-Yu; Hao, Man-Ling; Tsan, Yu-Tse; Chen, Hong-Ming; Chan, Wei-Chan
    Description

    BackgroundPredicting physical function upon discharge among hospitalized older adults is important. This study has aimed to develop a prediction model of physical function upon discharge through use of a machine learning algorithm using electronic health records (EHRs) and comprehensive geriatrics assessments (CGAs) among hospitalized older adults in Taiwan.MethodsData was retrieved from the clinical database of a tertiary medical center in central Taiwan. Older adults admitted to the acute geriatric unit during the period from January 2012 to December 2018 were included for analysis, while those with missing data were excluded. From data of the EHRs and CGAs, a total of 52 clinical features were input for model building. We used 3 different machine learning algorithms, XGBoost, random forest and logistic regression.ResultsIn total, 1,755 older adults were included in final analysis, with a mean age of 80.68 years. For linear models on physical function upon discharge, the accuracy of prediction was 87% for XGBoost, 85% for random forest, and 32% for logistic regression. For classification models on physical function upon discharge, the accuracy for random forest, logistic regression and XGBoost were 94, 92 and 92%, respectively. The auROC reached 98% for XGBoost and random forest, while logistic regression had an auROC of 97%. The top 3 features of importance were activity of daily living (ADL) at baseline, ADL during admission, and mini nutritional status (MNA) during admission.ConclusionThe results showed that physical function upon discharge among hospitalized older adults can be predicted accurately during admission through use of a machine learning model with data taken from EHRs and CGAs.

  5. Cryptocurrency extra data - IOTA

    • kaggle.com
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2022). Cryptocurrency extra data - IOTA [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-iota
    Explore at:
    zip(1196411839 bytes)Available download formats
    Dataset updated
    Jan 20, 2022
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  6. Cryptocurrency extra data - Maker

    • kaggle.com
    zip
    Updated Jan 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2022). Cryptocurrency extra data - Maker [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-maker
    Explore at:
    zip(1150531041 bytes)Available download formats
    Dataset updated
    Jan 20, 2022
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  7. Cryptocurrency extra data - TRON

    • kaggle.com
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2022). Cryptocurrency extra data - TRON [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-tron
    Explore at:
    zip(1253566627 bytes)Available download formats
    Dataset updated
    Jan 20, 2022
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  8. Cryptocurrency extra data - Cardano

    • kaggle.com
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2022). Cryptocurrency extra data - Cardano [Dataset]. https://www.kaggle.com/datasets/yamqwe/cryptocurrency-extra-data-cardano/code
    Explore at:
    zip(1254179058 bytes)Available download formats
    Dataset updated
    Jan 20, 2022
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  9. Cryptocurrency extra data - Monero

    • kaggle.com
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2022). Cryptocurrency extra data - Monero [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-monero
    Explore at:
    zip(1204684577 bytes)Available download formats
    Dataset updated
    Jan 20, 2022
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  10. Cryptocurrency extra data - Ethereum Classic

    • kaggle.com
    zip
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2022). Cryptocurrency extra data - Ethereum Classic [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-ethereum-classic
    Explore at:
    zip(1259913408 bytes)Available download formats
    Dataset updated
    Jan 19, 2022
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  11. Cryptocurrency extra data - Binance Coin

    • kaggle.com
    zip
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2022). Cryptocurrency extra data - Binance Coin [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-binance-coin
    Explore at:
    zip(1246039618 bytes)Available download formats
    Dataset updated
    Jan 19, 2022
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  12. Cryptocurrency extra data - Bitcoin Cash

    • kaggle.com
    zip
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2022). Cryptocurrency extra data - Bitcoin Cash [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-bitcoin-cash
    Explore at:
    zip(1253909016 bytes)Available download formats
    Dataset updated
    Jan 19, 2022
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  13. Cybersecurity 🪪 Intrusion 🦠 Detection Dataset

    • kaggle.com
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Naveen Kumar Samudrala (2025). Cybersecurity 🪪 Intrusion 🦠 Detection Dataset [Dataset]. https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dinesh Naveen Kumar Samudrala
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.

    1. Understanding the Features

    The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.

    A. Network-Based Features

    These features describe network-level information such as packet size, protocol type, and encryption methods.

    1. network_packet_size (Packet Size in Bytes)

      • Represents the size of network packets, ranging between 64 to 1500 bytes.
      • Packets on the lower end (~64 bytes) may indicate control messages, while larger packets (~1500 bytes) often carry bulk data.
      • Attackers may use abnormally small or large packets for reconnaissance or exploitation attempts.
    2. protocol_type (Communication Protocol)

      • The protocol used in the session: TCP, UDP, or ICMP.
      • TCP (Transmission Control Protocol): Reliable, connection-oriented (common for HTTP, HTTPS, SSH).
      • UDP (User Datagram Protocol): Faster but less reliable (used for VoIP, streaming).
      • ICMP (Internet Control Message Protocol): Used for network diagnostics (ping); often abused in Denial-of-Service (DoS) attacks.
    3. encryption_used (Encryption Protocol)

      • Values: AES, DES, None.
      • AES (Advanced Encryption Standard): Strong encryption, commonly used.
      • DES (Data Encryption Standard): Older encryption, weaker security.
      • None: Indicates unencrypted communication, which can be risky.
      • Attackers might use no encryption to avoid detection or weak encryption to exploit vulnerabilities.

    B. User Behavior-Based Features

    These features track user activities, such as login attempts and session duration.

    1. login_attempts (Number of Logins)

      • High values might indicate brute-force attacks (repeated login attempts).
      • Typical users have 1–3 login attempts, while an attack may have hundreds or thousands.
    2. session_duration (Session Length in Seconds)

      • A very long session might indicate unauthorized access or persistence by an attacker.
      • Attackers may try to stay connected to maintain access.
    3. failed_logins (Failed Login Attempts)

      • High failed login counts indicate credential stuffing or dictionary attacks.
      • Many failed attempts followed by a successful login could suggest an account was compromised.
    4. unusual_time_access (Login Time Anomaly)

      • A binary flag (0 or 1) indicating whether access happened at an unusual time.
      • Attackers often operate outside normal business hours to evade detection.
    5. ip_reputation_score (Trustworthiness of IP Address)

      • A score from 0 to 1, where higher values indicate suspicious activity.
      • IP addresses associated with botnets, spam, or previous attacks tend to have higher scores.
    6. browser_type (User’s Browser)

      • Common browsers: Chrome, Firefox, Edge, Safari.
      • Unknown: Could be an indicator of automated scripts or bots.

    2. Target Variable (attack_detected)

    • Binary classification: 1 means an attack was detected, 0 means normal activity.
    • The dataset is useful for supervised machine learning, where a model learns from labeled attack patterns.

    3. Possible Use Cases

    This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:

    A. Machine Learning-Based Intrusion Detection

    1. Supervised Learning Approaches

      • Classification Models (Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM)
      • Train the model using labeled data (attack_detected as the target).
      • Evaluate using accuracy, precision, recall, F1-score.
    2. Deep Learning Approaches

      • Use Neural Networks (DNN, LSTM, CNN) for pattern recognition.
      • LSTMs work well for time-series-based network traffic analysis.

    B. Anomaly Detection (Unsupervised Learning)

    If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.

    C. Rule-Based Detection

    • If certain thresholds are met (e.g., failed_logins > 10 & ip_reputation_score > 0.8), an alert is triggered.

    4. Challenges & Considerations

    • Adversarial Attacks: Attackers may modify traffic to evade detection.
    • Concept Drift: Cyber threats...
  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7

Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Mar 3, 2023
Dataset provided by
Slovak University of Technology in Bratislava
Comenius University Bratislava
Authors
Lukas Bucinsky; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected. The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study. The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V). Methods Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.

Search
Clear search
Close search
Google apps
Main menu