Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected. The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study. The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V). Methods Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.
Analysis1.csv and Analysis2.csv
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This synthetic dataset is designed to predict the risk of heart disease based on a combination of symptoms, lifestyle factors, and medical history. Each row in the dataset represents a patient, with binary (Yes/No) indicators for symptoms and risk factors, along with a computed risk label indicating whether the patient is at high or low risk of developing heart disease.
The dataset contains 70,000 samples, making it suitable for training machine learning models for classification tasks. The goal is to provide researchers, data scientists, and healthcare professionals with a clean and structured dataset to explore predictive modeling for cardiovascular health.
This dataset is a side project of EarlyMed, developed by students of Vellore Institute of Technology (VIT-AP). EarlyMed aims to leverage data science and machine learning for early detection and prevention of chronic diseases.
chest_pain): Presence of chest pain, a common symptom of heart disease.shortness_of_breath): Difficulty breathing, often associated with heart conditions.fatigue): Persistent tiredness without an obvious cause.palpitations): Irregular or rapid heartbeat.dizziness): Episodes of lightheadedness or fainting.swelling): Swelling due to fluid retention, often linked to heart failure.radiating_pain): Radiating pain, a hallmark of angina or heart attacks.cold_sweats): Symptoms commonly associated with acute cardiac events.age): Patient's age in years (continuous variable).hypertension): History of hypertension (Yes/No).cholesterol_high): Elevated cholesterol levels (Yes/No).diabetes): Diagnosis of diabetes (Yes/No).smoker): Whether the patient is a smoker (Yes/No).obesity): Obesity status (Yes/No).family_history): Family history of cardiovascular conditions (Yes/No).risk_label): Binary label indicating the risk of heart disease:
0: Low risk1: High riskThis dataset was synthetically generated using Python libraries such as numpy and pandas. The generation process ensured a balanced distribution of high-risk and low-risk cases while maintaining realistic correlations between features. For example:
- Patients with multiple risk factors (e.g., smoking, hypertension, and diabetes) were more likely to be labeled as high risk.
- Symptom patterns were modeled after clinical guidelines and research studies on heart disease.
The design of this dataset was inspired by the following resources:
This dataset can be used for a variety of purposes:
Machine Learning Research:
Healthcare Analytics:
Educational Purposes:
Facebook
TwitterBackgroundPredicting physical function upon discharge among hospitalized older adults is important. This study has aimed to develop a prediction model of physical function upon discharge through use of a machine learning algorithm using electronic health records (EHRs) and comprehensive geriatrics assessments (CGAs) among hospitalized older adults in Taiwan.MethodsData was retrieved from the clinical database of a tertiary medical center in central Taiwan. Older adults admitted to the acute geriatric unit during the period from January 2012 to December 2018 were included for analysis, while those with missing data were excluded. From data of the EHRs and CGAs, a total of 52 clinical features were input for model building. We used 3 different machine learning algorithms, XGBoost, random forest and logistic regression.ResultsIn total, 1,755 older adults were included in final analysis, with a mean age of 80.68 years. For linear models on physical function upon discharge, the accuracy of prediction was 87% for XGBoost, 85% for random forest, and 32% for logistic regression. For classification models on physical function upon discharge, the accuracy for random forest, logistic regression and XGBoost were 94, 92 and 92%, respectively. The auROC reached 98% for XGBoost and random forest, while logistic regression had an auROC of 97%. The top 3 features of importance were activity of daily living (ADL) at baseline, ADL during admission, and mini nutritional status (MNA) during admission.ConclusionThe results showed that physical function upon discharge among hospitalized older adults can be predicted accurately during admission through use of a machine learning model with data taken from EHRs and CGAs.
Facebook
TwitterThis dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
Facebook
TwitterThis dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
Facebook
TwitterThis dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
Facebook
TwitterThis dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
Facebook
TwitterThis dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
Facebook
TwitterThis dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
Facebook
TwitterThis dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
Facebook
TwitterThis dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. Below, I’ll explain each aspect in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning.
The dataset consists of network-based and user behavior-based features. Each feature provides valuable information about potential cyber threats.
These features describe network-level information such as packet size, protocol type, and encryption methods.
network_packet_size (Packet Size in Bytes)
protocol_type (Communication Protocol)
encryption_used (Encryption Protocol)
These features track user activities, such as login attempts and session duration.
login_attempts (Number of Logins)
session_duration (Session Length in Seconds)
failed_logins (Failed Login Attempts)
unusual_time_access (Login Time Anomaly)
0 or 1) indicating whether access happened at an unusual time.ip_reputation_score (Trustworthiness of IP Address)
browser_type (User’s Browser)
attack_detected)1 means an attack was detected, 0 means normal activity.This dataset can be used for intrusion detection systems (IDS) and cybersecurity research. Some key applications include:
Supervised Learning Approaches
attack_detected as the target).Deep Learning Approaches
If attack labels are missing, anomaly detection can be used: - Autoencoders: Learn normal traffic and flag anomalies. - Isolation Forest: Detects outliers based on feature isolation. - One-Class SVM: Learns normal behavior and detects deviations.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected. The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study. The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V). Methods Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, Marián Gall, Ján Matúška, Michal Pitoňák, Marek Štekláč. Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110.