CPSC 2018The first dataset is a preprocessed version of the CPSC 2018 dataset, which contains 6877 ECG recordings. We preprocessed the dataset by resampling the ECG signals to 250 Hz and equalizing the ECG signal length to 60 seconds, yielding a signal length of T=15,000 data points per recording.For the hyperparameter study, we employed a fixed train-valid-test split with ratio 60-20-20, while for the final evaluations, including the comparison with the state-of-the-art methods and ablation studies, we used a 10-fold cross-validation strategy.The raw CPSC 2018 dataset can be downloaded from the website of the PhysioNet/Computing in Cardiology Challenge 2020.(License: Creative Commons Attribution 4.0 International Public License).PTB-XL (Super-Diag.)The second dataset is a pre-processed version of PTB-XL, a large multi-label dataset of 21,799 clinical 12-lead ECG records of 10 seconds each. PTB-XL contains 71 ECG statements, categorized into 44 diagnostic, 19 form, and 12 rhythmic classes. In addition, the diagnostic category can be divided into 24 sub- and 5 coarse-grained super-classes. In our pre-processed version, we utilize the super-diagnostic labels for classification and the recommended train-valid-test splits, sampled at 100 Hz. We select only samples with at least one label in the super-diagnostic category,without applying any further preprocessing.The raw PTB-XL dataset can be downloaded from the PhysioNet/PTB-XL website.(License: Creative Commons Attribution 4.0 International Public License).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
InductiveQE datasets
UPD 2.0: Regenerated datasets free of potential test set leakages
UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs
This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.
Each dataset is a zip archive containing 17 files:
Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.
The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.
Paper pre-print: https://arxiv.org/abs/2210.08008
The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt="">
9. Plot the decision tree
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">
Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">
Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.
Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of
independent variables.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">
Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".
Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">
Tune the model mtry=2 has the lowest OOB error rate
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">
Use random forest with mtry = 2 and ntree = 200
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">
Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HME21 is the atomic structure dataset aimed for the neural network potential development. It was created in the development of PFP, a universal neural network potential for material discovery [1]. It contains multiple elements in a single structure and was sampled through a high-temperature molecular dynamics simulation. There are a total of 37 elements in the HME21 dataset, i.e., H, Li, C, N, O, F, Na, Mg, Al, Si, P, S, Cl, K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Mo, Ru, Rh, Pd, Ag, In, Sn, Ba, Ir, Pt, Au, and Pb. They are calculated by Spin-polarized DFT calculations using PBE exchange-correlation functional implemented in VASP [2] version 5.4.4. All structures are under periodic boundary conditions. For the details of DFT calculation conditions and structure sampling method, please see the reference [1]. Please cite the reference [1] if you use this dataset. Files HME21 consists of three files with extxyz format:
train.xyz: 19956 structures valid.xyz: 2498 structures test.xyz 2495 structures
The structures were randomly split into training, validation, and test sub-datasets at a ratio of 8:1:1. They are used as training, validation, and test dataset for the benchmark of neural network potentials [1]. The target values are energy and atomic forces. The energy is shifted such that the energy of a single atom located in a vacuum becomes zero. The length is in angstroms (10^−10 m), and the energy is in electronvolts (eV). For supplementary, vasp_shift_energies.json which corresponds to the reference energy of single atom for each element is also included.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
NaNa---Name to Nationality---is a simple dataset that maps names and nationalities. It contains {train,dev,test}.{src,tgt} files. Each line in the *.src files and *.tgt files have a name and its associated nationality, respectively.
I constructed a new dataset for this project because I failed to find any available dataset that is big and comprehensive enough.
STEP 2. Iterated all pages and collected the title and the nationality.
I regarded the title as a person if the Category section at the bottom of each page included ... births (green rectangule),
and identified their nationality from the most frequent nationality word in the section (red rectangules).
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F305510%2F73e53b309d2867ff7ec6b3bf0d230cbd%2Fwiki.png?generation=1592753437315529&alt=media" alt="">
STEP 3. Randomly split the data into train/dev/test in the ratio of 8:1:1 within each nationality group.
| Nationality | Train | Dev | Test | |--|--|--|--| |Total 1112902 | 890248 |111286|111368| | Afghan | 778 | 97 | 98 | | Albanian | 2193 | 274 | 275 | | Algerian | 1592 | 199 | 200 | | American | 241772 | 30221 | 30222 | | Andorran | 188 | 24 | 24 | | Angolan | 504 | 63 | 63 | | Argentine | 8926 | 1116 | 1116 | | Armenian | 1600 | 200 | 201 | | Aruban | 93 | 12 | 12 | | Australian | 40536 | 5067 | 5067 | | Austrian | 9192 | 1149 | 1149 | | Azerbaijani | 1331 | 166 | 167 | | Bahamian | 233 | 29 | 30 | | Bahraini | 237 | 30 | 30 | | Bangladeshi | 1636 | 204 | 205 | | Barbadian | 372 | 47 | 47 | | Basque | 961 | 120 | 121 | | Belarusian | 2338 | 292 | 293 | | Belgian | 7907 | 988 | 989 | | Belizean | 148 | 19 | 19 | | Beninese | 199 | 25 | 25 | | Bermudian | 270 | 34 | 34 | | Bhutanese | 144 | 18 | 18 | | Bolivian | 657 | 82 | 83 | | Bosniak | 81 | 10 | 11 | | Botswana | 252 | 31 | 32 | | Brazilian | 11234 | 1404 | 1405 | | Breton | 118 | 15 | 15 | | British | 45922 | 5740 | 5741 | | Bruneian | 115 | 14 | 15 | | Bulgarian | 3926 | 491 | 491 | | Burkinabé | 289 | 36 | 37 | | Burmese | 944 | 118 | 118 | | Burundian | 140 | 17 | 18 | | Cambodian | 360 | 45 | 46 | | Cameroonian | 1028 | 129 | 129 | | Canadian | 34152 | 4269 | 4270 | | Catalan | 1717 | 215 | 215 | | Chadian | 139 | 17 | 18 | | Chilean | 2838 | 355 | 355 | | Chinese | 9494 | 1187 | 1187 | | Colombian | 2620 | 328 | 328 | | Comorian | 54 | 7 | 7 | | Congolese | 35 | 4 | 5 | | Cuban | 1938 | 242 | 243 | | Cypriot | 1016 | 127 | 128 | | Czech | 7244 | 906 | 906 | | Dane | 32 | 4 | 5 | | Djiboutian | 54 | 7 | 7 | | Dominican | 1580 | 198 | 198 | | Dutch | 14916 | 1864 | 1865 | | Ecuadorian | 874 | 109 | 110 | | Egyptian | 2776 | 347 | 348 | | Emirati | 621 | 78 | 78 | | English | 77159 | 9645 | 9645 | | Equatoguinean | 193 | 24 | 25 | | Eritrean | 133 | 17 | 17 | | Estonian | 2028 | 254 | 254 | | Ethiopian | 733 | 92 | 92 | | Faroese | 284 | 35 | 36 | | Filipino | 3928 | 491 | 491 | | Finn | 68 | 8 | 9 | | French | 40841 | 5105 | 5106 | | Gabonese | 180 | 23 | 23 | | Gambian | 220 | 28 | 28 | | Georgian | 262 | 33 | 33 | | German | 42388 | 5299 | 5299 | | Ghanaian | 2036 | 255 | 255 | | Gibraltarian | 98 | 12 | 13 | | Greek | 5975 | 747 | 747 | | Grenadian | 139 | 17 | 18 | | Guatemalan | 563 | 70 | 71 | | Guinean | 584 | 73 | 74 | | Guyanese | 358 | 45 | 45 | | Haitian | 561 | 70 | 71 | | Honduran | 500 | 63 | 63 | | Hungarian | 7220 | 903 | 903 | | I-Kiribati | 40 | 5 | 6 | | Indian | 22692 | 2836 | 2837 | | Indonesian | 2820 | 352 | 353 | | Iranian | 5010 | 626 | 627 | | Iraqi | 1252 | 157 | 157 | | Irish | 11844 | 1481 | 1481 | | Israeli | 5149 | 644 | 644 | | Italian | 29336 | 3667 | 3668 | | Jamaican | 1422 | 178 | 178 | | Japanese | 21216 | 2652 | 2652 | | Jordanian | 490 | 61 | 62 | | Kazakh | 24 | 3 | 4 | | Kenyan | 1609 | 201 | 202 | | Korean | 7896 | 987 | 988 | | Kuwaiti | 396 | 50 | 50 | | Kyrgyz | 16 | 2 | 2 | | Lao | 26 | 3 | 4 | | Latvian | 1693 | 212 | 212 | | Lebanese | 1246 | 156 | 156 | | Liberian | 294 | 37 | 37 | | Libyan | 271 | 34 | 34 | | Lithuanian | 1979 | 247 | 248 | | Macedonian | 1099 | 137 | 138 | | Malagasy | 232 | 29 | 29 | | Malawian | 219 | 27 | 28 | | Malaysian | 2582 | 323 | 323 | | Maldivian | 152 | 19 | 20 | | Malian | 385 | 48 | 49 | | Maltese | 663 | 83 | 83 | | Manx | 150 | 19 | 19 | | Marshallese | 32 | 4 | 4 | | Mauritanian | 96 | 12 | 12 | | Mauritian | 263 | 33 | 33 | | Mexican | 8648 | 1081 | 1081 | | Moldovan | 1000 | 125 | 125 | | Mongolian | 504 | 63 | 64 | | Montenegrin | 955 | 119 | 120 | | Moroccan | 1457 | 182 | 183 | | Mozambican | 210 | 26 | 27 | | Namibian | 588 | 74 | 74 | | Nauruan | 32 | 4 | 4 | | Nepalese | 773 | 97 | 97 | | Nicaraguan | 285 | 36 | 36 | | Nigerian | 4060 | 507 | 508 | | Nigerien | 143 | 18 | ...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This repo contains the data for our paper "SOUL: Towards Sentiment and Opinion Understanding of Language" in EMNLP 2023. Github repo
Statistics
The SOUL dataset comprises 15,028 statements related to 3,638 reviews, resulting in an average of 4.13 statements per review. To create training, development, and test sets, we split the reviews in a ratio of 6:1:3, respectively.
Split
True False Not-given Total
Train 2,182 3,675 2,159 8,834 3,000 8,834… See the full description on the dataset page: https://huggingface.co/datasets/DAMO-NLP-SG/SOUL.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The drug-development process is time-consuming and expensive. In High-Throughput Screening (HTS), batches of compounds are tested against a biological target to test the compound's ability to bind to the target. Targets might be antibodies for example. If the compound binds to the target then it is active for that target and known as a hit.
Virtual screening is the computational or in silico screening of biological compounds and complements the HTS process. It is used to aid the selection of compounds for screening in HTS bioassays or for inclusion in a compound-screening library.
Drug discovery is the first stage of the drug-development process and involves finding compounds to test and screen against biological targets. This first stage is known as primary-screening and usually involves the screening of thousands of compounds.
This dataset is a collection of 21 bioassays (screens) that measure the activity of various compounds against different biological targets.
Each bioassay is split into test and train files.
Here are some descriptions of some of the assays compounds. The source, unfortunately, does not have descriptions for every assay. That's the nature of the beast for finding this kind data and was also pointed out in the original study.
AID362 details the results of a primary screening bioassay for Formylpeptide Receptor Ligand Binding University from the New Mexico Center for Molecular Discovery. It is a relatively small dataset with 4279 compounds and with a ratio of 1 active to 70 inactive compounds (1.4% minority class). The compounds were selected on the basis of preliminary virtual screening of approximately 480,000 drug-like small molecules from Chemical Diversity Laboratories.
AID604 is a primary screening bioassay for Rho kinase 2 inhibitors from the Scripps Research Institute Molecular Screening Center. The bioassay contains activity information of 59,788 compounds with a ratio of 1 active compound to 281 inactive compounds (1.4%). 57,546 of the compounds have known drug-like properties.
AID456 is a primary screen assay from the Burnham Center for Chemical Genomics for inhibition of TNFa induced VCAM-1 cell surface expression and consists of 9,982 compounds with a ratio of 1 active compound to 368 inactive compounds (0.27% minority). The compounds have been selected for their known drug-like properties and 9,431 meet the Rule of 5 [19].
AID688 is the result of a primary screen for Yeast eIF2B from the Penn Center for Molecular Discovery and contains activity information of 27,198 compounds with a ratio of 1 active compound to 108 inactive compounds (0.91% minority). The screen is a reporter-gene assay and 25,656 of the compounds have known drug-like properties.
AID373 is a primary screen from the Scripps Research Institute Molecular Screening Center for endothelial differentiation, sphingolipid G-protein-coupled receptor, 3. 59,788 compounds were screened with a ratio of 1 active compound to 963 inactive compounds (0.1%). 57,546 of the compounds screened had known drug-like properties.
AID746 is a primary screen from the Scripps Research Institute Molecular Screening Center for Mitogen-activated protein kinase. 59,788 compounds were screened with a ratio of 1 active compound to 162 inactive compounds (0.61%). 57,546 of the compounds screened had known drug-like properties.
AID687 is the result of a primary screen for coagulation factor XI from the Penn Center for Molecular Discovery and contains activity information of 33,067 compounds with a ratio of 1 active compound to 350 inactive compounds (0.28% minority). 30,353 of the compounds screened had known drug-like properties.
AID1608 is a different type of screening assay that was used to identify compounds that prevent HttQ103-induced cell death. National Institute of Neurological Disorders and Stroke Approved Drug Program. The compounds that prevent a release of a certain chemical into the growth medium are labelled as active and the remaining compounds are labelled as having inconclusive activity. AID1608 is a small dataset with 1,033 compounds and a ratio of 1 active to 14 inconclusive compounds (6.58% minority class).
AID644
AID1284
AID439
AID721
AID1608
AID644
AID1284
AID439
AID721
Original study: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820499/
Data downloaded form UCI ML repository:
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CPSC 2018The first dataset is a preprocessed version of the CPSC 2018 dataset, which contains 6877 ECG recordings. We preprocessed the dataset by resampling the ECG signals to 250 Hz and equalizing the ECG signal length to 60 seconds, yielding a signal length of T=15,000 data points per recording.For the hyperparameter study, we employed a fixed train-valid-test split with ratio 60-20-20, while for the final evaluations, including the comparison with the state-of-the-art methods and ablation studies, we used a 10-fold cross-validation strategy.The raw CPSC 2018 dataset can be downloaded from the website of the PhysioNet/Computing in Cardiology Challenge 2020.(License: Creative Commons Attribution 4.0 International Public License).PTB-XL (Super-Diag.)The second dataset is a pre-processed version of PTB-XL, a large multi-label dataset of 21,799 clinical 12-lead ECG records of 10 seconds each. PTB-XL contains 71 ECG statements, categorized into 44 diagnostic, 19 form, and 12 rhythmic classes. In addition, the diagnostic category can be divided into 24 sub- and 5 coarse-grained super-classes. In our pre-processed version, we utilize the super-diagnostic labels for classification and the recommended train-valid-test splits, sampled at 100 Hz. We select only samples with at least one label in the super-diagnostic category,without applying any further preprocessing.The raw PTB-XL dataset can be downloaded from the PhysioNet/PTB-XL website.(License: Creative Commons Attribution 4.0 International Public License).