9 datasets found
  1. o

    Preprocessed CPSC and PTB-XL Data

    • explore.openaire.eu
    • figshare.com
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanessa Borst (2024). Preprocessed CPSC and PTB-XL Data [Dataset]. http://doi.org/10.6084/m9.figshare.25532869.v3
    Explore at:
    Dataset updated
    Jan 1, 2024
    Authors
    Vanessa Borst
    Description

    CPSC 2018The first dataset is a preprocessed version of the CPSC 2018 dataset, which contains 6877 ECG recordings. We preprocessed the dataset by resampling the ECG signals to 250 Hz and equalizing the ECG signal length to 60 seconds, yielding a signal length of T=15,000 data points per recording.For the hyperparameter study, we employed a fixed train-valid-test split with ratio 60-20-20, while for the final evaluations, including the comparison with the state-of-the-art methods and ablation studies, we used a 10-fold cross-validation strategy.The raw CPSC 2018 dataset can be downloaded from the website of the PhysioNet/Computing in Cardiology Challenge 2020.(License: Creative Commons Attribution 4.0 International Public License).PTB-XL (Super-Diag.)The second dataset is a pre-processed version of PTB-XL, a large multi-label dataset of 21,799 clinical 12-lead ECG records of 10 seconds each. PTB-XL contains 71 ECG statements, categorized into 44 diagnostic, 19 form, and 12 rhythmic classes. In addition, the diagnostic category can be divided into 24 sub- and 5 coarse-grained super-classes. In our pre-processed version, we utilize the super-diagnostic labels for classification and the recommended train-valid-test splits, sampled at 100 Hz. We select only samples with at least one label in the super-diagnostic category,without applying any further preprocessing.The raw PTB-XL dataset can be downloaded from the PhysioNet/PTB-XL website.(License: Creative Commons Attribution 4.0 International Public License).

  2. R

    Cdd Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    hakuna matata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cumcumber Diease Detection Bounding Boxes
    Description

    Project Documentation: Cucumber Disease Detection

    1. Title and Introduction Title: Cucumber Disease Detection

    Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

    1. Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

    Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

    Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

    1. Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

    Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

    Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

    1. Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

    2. Methodology Machine Learning Algorithms:

    Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

    The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

    1. Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

    2. Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

    3. Model Evaluation Evaluation Metrics:

    Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

    The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

    1. Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

    2. Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

    3. References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

    4. Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

    Rafiur Rahman Rafit EWU 2018-3-60-111

  3. InductiveQE Datasets

    • zenodo.org
    zip
    Updated Nov 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mikhail Galkin; Mikhail Galkin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    InductiveQE datasets

    UPD 2.0: Regenerated datasets free of potential test set leakages

    UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

    This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

    Each dataset is a zip archive containing 17 files:

    • train_graph.txt (pt for wikikg) - original training graph
    • val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph
    • val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.
    • test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph
    • test_predict.txt (pt) - missing edges in the test inference graph to be predicted.
    • train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)
    • *_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal
    • *_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed
    • train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models
    • train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models
    • og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2
    • stats.txt - a small file with dataset stats

    Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

    The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

    Paper pre-print: https://arxiv.org/abs/2210.08008

    The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE

  4. Customer Churn - Decision Tree & Random Forest

    • kaggle.com
    Updated Jul 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Customer Churn - Decision Tree & Random Forest [Dataset]. https://www.kaggle.com/datasets/vikramamin/customer-churn-decision-tree-and-random-forest
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2023
    Dataset provided by
    Kaggle
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Main objective: Find out customers who will churn and who will not.
    • Methodology: It is a classification problem. We will use decision tree and random forest to predict the outcome.
    • Steps Involved
    • Read the data
    • Check for data types https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1ffb600d8a4b4b36bc25e957524a3524%2FPicture1.png?generation=1688638600831386&alt=media" alt="">
    1. Change character vector to factor vector as this is as classification problem
    2. Drop the variable which is not significant for the analysis. We drop "customerID".
    3. Check for missing values. None are found.
    4. Split the data into train and test so we can use the train data for building the model and use test data for prediction. We split this into 80-20 ratio (train/test) using the sample function.
    5. Install and run libraries (rpart, rpart.plot, rattle, RColorBrewer, caret)
    6. Run decision tree using rpart function. The dependent variable is Churn and 19 other independent variables

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt=""> 9. Plot the decision tree

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">

    Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service

    1. Tuning the model
    2. Define the search grid using the expand.grid function
    3. Set up the control parameters through 5 fold cross validation
    4. When we print the model we get the best CP = 0.01 and an accuracy of 79.00%

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">

    1. Predict the model
    2. Find out the variables which are most and least significant. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61beb4224e9351cfc772147c43800502%2FPicture5.png?generation=1688639468638950&alt=media" alt="">

    Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.

    USE RANDOM FOREST

    1. Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of independent variables. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">

      Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".

    2. Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">

    1. Predict the model and create a new data frame showing the actuals vs predicted values

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">

    1. Plot the model so as to find out where the OOB (out of bag ) error stops decreasing or becoming constant. As we can see that the error stops decreasing between 100 to 200 trees. So we decide to take ntree = 200 when we tune the model.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">

    Tune the model mtry=2 has the lowest OOB error rate

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">

    Use random forest with mtry = 2 and ntree = 200

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">

    Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...

  5. f

    High-temperature multi-element 2021 (HME21) dataset

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    So Takamoto; Chikashi Shinagawa; Daisuke Motoki; Kosuke Nakago; Wenwen Li; Iori Kurata; Taku Watanabe; Yoshihiro Yayama; Hiroki Iriguchi; Yusuke Asano; Tasuku Onodera; Takafumi Ishii; Takao Kudo; Hideki Ono; Ryohto Sawada; Ryuichiro Ishitani; Marc Ong; Taiki Yamaguchi; Toshiki Kataoka; Akihide Hayashi; Nontawat Charoenphakdee; Takeshi Ibuka (2023). High-temperature multi-element 2021 (HME21) dataset [Dataset]. http://doi.org/10.6084/m9.figshare.19658538.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    So Takamoto; Chikashi Shinagawa; Daisuke Motoki; Kosuke Nakago; Wenwen Li; Iori Kurata; Taku Watanabe; Yoshihiro Yayama; Hiroki Iriguchi; Yusuke Asano; Tasuku Onodera; Takafumi Ishii; Takao Kudo; Hideki Ono; Ryohto Sawada; Ryuichiro Ishitani; Marc Ong; Taiki Yamaguchi; Toshiki Kataoka; Akihide Hayashi; Nontawat Charoenphakdee; Takeshi Ibuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HME21 is the atomic structure dataset aimed for the neural network potential development. It was created in the development of PFP, a universal neural network potential for material discovery [1]. It contains multiple elements in a single structure and was sampled through a high-temperature molecular dynamics simulation. There are a total of 37 elements in the HME21 dataset, i.e., H, Li, C, N, O, F, Na, Mg, Al, Si, P, S, Cl, K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Mo, Ru, Rh, Pd, Ag, In, Sn, Ba, Ir, Pt, Au, and Pb. They are calculated by Spin-polarized DFT calculations using PBE exchange-correlation functional implemented in VASP [2] version 5.4.4. All structures are under periodic boundary conditions. For the details of DFT calculation conditions and structure sampling method, please see the reference [1]. Please cite the reference [1] if you use this dataset. Files HME21 consists of three files with extxyz format:

    train.xyz: 19956 structures valid.xyz: 2498 structures test.xyz 2495 structures

    The structures were randomly split into training, validation, and test sub-datasets at a ratio of 8:1:1. They are used as training, validation, and test dataset for the benchmark of neural network potentials [1]. The target values are energy and atomic forces. The energy is shifted such that the energy of a single atom located in a vacuum becomes zero. The length is in angstroms (10^−10 m), and the energy is in electronvolts (eV). For supplementary, vasp_shift_energies.json which corresponds to the reference energy of single atom for each element is also included.

  6. O

    ZS-F-VQA

    • opendatalab.com
    zip
    Updated Sep 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies (2022). ZS-F-VQA [Dataset]. https://opendatalab.com/OpenDataLab/ZS-F-VQA
    Explore at:
    zip(8251200 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    Huawei
    Zhejiang University
    University of Oxford
    University of Edinburgh
    Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.

  7. NaNa Dataset

    • kaggle.com
    Updated Jun 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyubyong Park (2020). NaNa Dataset [Dataset]. https://www.kaggle.com/datasets/bryanpark/nana-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kyubyong Park
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    NaNa Dataset

    NaNa---Name to Nationality---is a simple dataset that maps names and nationalities. It contains {train,dev,test}.{src,tgt} files. Each line in the *.src files and *.tgt files have a name and its associated nationality, respectively.

    Construction

    I constructed a new dataset for this project because I failed to find any available dataset that is big and comprehensive enough.

    • STEP 1. Downloaded and extracted the 20200601 English wiki dump (enwiki-20200601-pages-articles.xml).
    • STEP 2. Iterated all pages and collected the title and the nationality. I regarded the title as a person if the Category section at the bottom of each page included ... births (green rectangule), and identified their nationality from the most frequent nationality word in the section (red rectangules). https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F305510%2F73e53b309d2867ff7ec6b3bf0d230cbd%2Fwiki.png?generation=1592753437315529&alt=media" alt="">

    • STEP 3. Randomly split the data into train/dev/test in the ratio of 8:1:1 within each nationality group.

    Stats

    | Nationality | Train | Dev | Test | |--|--|--|--| |Total 1112902 | 890248 |111286|111368| | Afghan | 778 | 97 | 98 | | Albanian | 2193 | 274 | 275 | | Algerian | 1592 | 199 | 200 | | American | 241772 | 30221 | 30222 | | Andorran | 188 | 24 | 24 | | Angolan | 504 | 63 | 63 | | Argentine | 8926 | 1116 | 1116 | | Armenian | 1600 | 200 | 201 | | Aruban | 93 | 12 | 12 | | Australian | 40536 | 5067 | 5067 | | Austrian | 9192 | 1149 | 1149 | | Azerbaijani | 1331 | 166 | 167 | | Bahamian | 233 | 29 | 30 | | Bahraini | 237 | 30 | 30 | | Bangladeshi | 1636 | 204 | 205 | | Barbadian | 372 | 47 | 47 | | Basque | 961 | 120 | 121 | | Belarusian | 2338 | 292 | 293 | | Belgian | 7907 | 988 | 989 | | Belizean | 148 | 19 | 19 | | Beninese | 199 | 25 | 25 | | Bermudian | 270 | 34 | 34 | | Bhutanese | 144 | 18 | 18 | | Bolivian | 657 | 82 | 83 | | Bosniak | 81 | 10 | 11 | | Botswana | 252 | 31 | 32 | | Brazilian | 11234 | 1404 | 1405 | | Breton | 118 | 15 | 15 | | British | 45922 | 5740 | 5741 | | Bruneian | 115 | 14 | 15 | | Bulgarian | 3926 | 491 | 491 | | Burkinabé | 289 | 36 | 37 | | Burmese | 944 | 118 | 118 | | Burundian | 140 | 17 | 18 | | Cambodian | 360 | 45 | 46 | | Cameroonian | 1028 | 129 | 129 | | Canadian | 34152 | 4269 | 4270 | | Catalan | 1717 | 215 | 215 | | Chadian | 139 | 17 | 18 | | Chilean | 2838 | 355 | 355 | | Chinese | 9494 | 1187 | 1187 | | Colombian | 2620 | 328 | 328 | | Comorian | 54 | 7 | 7 | | Congolese | 35 | 4 | 5 | | Cuban | 1938 | 242 | 243 | | Cypriot | 1016 | 127 | 128 | | Czech | 7244 | 906 | 906 | | Dane | 32 | 4 | 5 | | Djiboutian | 54 | 7 | 7 | | Dominican | 1580 | 198 | 198 | | Dutch | 14916 | 1864 | 1865 | | Ecuadorian | 874 | 109 | 110 | | Egyptian | 2776 | 347 | 348 | | Emirati | 621 | 78 | 78 | | English | 77159 | 9645 | 9645 | | Equatoguinean | 193 | 24 | 25 | | Eritrean | 133 | 17 | 17 | | Estonian | 2028 | 254 | 254 | | Ethiopian | 733 | 92 | 92 | | Faroese | 284 | 35 | 36 | | Filipino | 3928 | 491 | 491 | | Finn | 68 | 8 | 9 | | French | 40841 | 5105 | 5106 | | Gabonese | 180 | 23 | 23 | | Gambian | 220 | 28 | 28 | | Georgian | 262 | 33 | 33 | | German | 42388 | 5299 | 5299 | | Ghanaian | 2036 | 255 | 255 | | Gibraltarian | 98 | 12 | 13 | | Greek | 5975 | 747 | 747 | | Grenadian | 139 | 17 | 18 | | Guatemalan | 563 | 70 | 71 | | Guinean | 584 | 73 | 74 | | Guyanese | 358 | 45 | 45 | | Haitian | 561 | 70 | 71 | | Honduran | 500 | 63 | 63 | | Hungarian | 7220 | 903 | 903 | | I-Kiribati | 40 | 5 | 6 | | Indian | 22692 | 2836 | 2837 | | Indonesian | 2820 | 352 | 353 | | Iranian | 5010 | 626 | 627 | | Iraqi | 1252 | 157 | 157 | | Irish | 11844 | 1481 | 1481 | | Israeli | 5149 | 644 | 644 | | Italian | 29336 | 3667 | 3668 | | Jamaican | 1422 | 178 | 178 | | Japanese | 21216 | 2652 | 2652 | | Jordanian | 490 | 61 | 62 | | Kazakh | 24 | 3 | 4 | | Kenyan | 1609 | 201 | 202 | | Korean | 7896 | 987 | 988 | | Kuwaiti | 396 | 50 | 50 | | Kyrgyz | 16 | 2 | 2 | | Lao | 26 | 3 | 4 | | Latvian | 1693 | 212 | 212 | | Lebanese | 1246 | 156 | 156 | | Liberian | 294 | 37 | 37 | | Libyan | 271 | 34 | 34 | | Lithuanian | 1979 | 247 | 248 | | Macedonian | 1099 | 137 | 138 | | Malagasy | 232 | 29 | 29 | | Malawian | 219 | 27 | 28 | | Malaysian | 2582 | 323 | 323 | | Maldivian | 152 | 19 | 20 | | Malian | 385 | 48 | 49 | | Maltese | 663 | 83 | 83 | | Manx | 150 | 19 | 19 | | Marshallese | 32 | 4 | 4 | | Mauritanian | 96 | 12 | 12 | | Mauritian | 263 | 33 | 33 | | Mexican | 8648 | 1081 | 1081 | | Moldovan | 1000 | 125 | 125 | | Mongolian | 504 | 63 | 64 | | Montenegrin | 955 | 119 | 120 | | Moroccan | 1457 | 182 | 183 | | Mozambican | 210 | 26 | 27 | | Namibian | 588 | 74 | 74 | | Nauruan | 32 | 4 | 4 | | Nepalese | 773 | 97 | 97 | | Nicaraguan | 285 | 36 | 36 | | Nigerian | 4060 | 507 | 508 | | Nigerien | 143 | 18 | ...

  8. h

    SOUL

    • huggingface.co
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Lab at Alibaba DAMO Academy (2024). SOUL [Dataset]. https://huggingface.co/datasets/DAMO-NLP-SG/SOUL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2024
    Dataset authored and provided by
    Language Technology Lab at Alibaba DAMO Academy
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This repo contains the data for our paper "SOUL: Towards Sentiment and Opinion Understanding of Language" in EMNLP 2023. Github repo

      Statistics
    

    The SOUL dataset comprises 15,028 statements related to 3,638 reviews, resulting in an average of 4.13 statements per review. To create training, development, and test sets, we split the reviews in a ratio of 6:1:3, respectively.

    Split

    reviews

    statements

    True False Not-given Total

    Train 2,182 3,675 2,159 8,834 3,000 8,834… See the full description on the dataset page: https://huggingface.co/datasets/DAMO-NLP-SG/SOUL.

  9. Bioassay Datasets

    • kaggle.com
    zip
    Updated Sep 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2017). Bioassay Datasets [Dataset]. https://www.kaggle.com/uciml/bioassay-datasets
    Explore at:
    zip(50341627 bytes)Available download formats
    Dataset updated
    Sep 7, 2017
    Dataset authored and provided by
    UCI Machine Learning
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The drug-development process is time-consuming and expensive. In High-Throughput Screening (HTS), batches of compounds are tested against a biological target to test the compound's ability to bind to the target. Targets might be antibodies for example. If the compound binds to the target then it is active for that target and known as a hit.

    Virtual screening is the computational or in silico screening of biological compounds and complements the HTS process. It is used to aid the selection of compounds for screening in HTS bioassays or for inclusion in a compound-screening library.

    Drug discovery is the first stage of the drug-development process and involves finding compounds to test and screen against biological targets. This first stage is known as primary-screening and usually involves the screening of thousands of compounds.

    This dataset is a collection of 21 bioassays (screens) that measure the activity of various compounds against different biological targets.

    Content

    Each bioassay is split into test and train files.

    Here are some descriptions of some of the assays compounds. The source, unfortunately, does not have descriptions for every assay. That's the nature of the beast for finding this kind data and was also pointed out in the original study.

    Primary screens

    • AID362 details the results of a primary screening bioassay for Formylpeptide Receptor Ligand Binding University from the New Mexico Center for Molecular Discovery. It is a relatively small dataset with 4279 compounds and with a ratio of 1 active to 70 inactive compounds (1.4% minority class). The compounds were selected on the basis of preliminary virtual screening of approximately 480,000 drug-like small molecules from Chemical Diversity Laboratories.

    • AID604 is a primary screening bioassay for Rho kinase 2 inhibitors from the Scripps Research Institute Molecular Screening Center. The bioassay contains activity information of 59,788 compounds with a ratio of 1 active compound to 281 inactive compounds (1.4%). 57,546 of the compounds have known drug-like properties.

    • AID456 is a primary screen assay from the Burnham Center for Chemical Genomics for inhibition of TNFa induced VCAM-1 cell surface expression and consists of 9,982 compounds with a ratio of 1 active compound to 368 inactive compounds (0.27% minority). The compounds have been selected for their known drug-like properties and 9,431 meet the Rule of 5 [19].

    • AID688 is the result of a primary screen for Yeast eIF2B from the Penn Center for Molecular Discovery and contains activity information of 27,198 compounds with a ratio of 1 active compound to 108 inactive compounds (0.91% minority). The screen is a reporter-gene assay and 25,656 of the compounds have known drug-like properties.

    • AID373 is a primary screen from the Scripps Research Institute Molecular Screening Center for endothelial differentiation, sphingolipid G-protein-coupled receptor, 3. 59,788 compounds were screened with a ratio of 1 active compound to 963 inactive compounds (0.1%). 57,546 of the compounds screened had known drug-like properties.

    • AID746 is a primary screen from the Scripps Research Institute Molecular Screening Center for Mitogen-activated protein kinase. 59,788 compounds were screened with a ratio of 1 active compound to 162 inactive compounds (0.61%). 57,546 of the compounds screened had known drug-like properties.

    • AID687 is the result of a primary screen for coagulation factor XI from the Penn Center for Molecular Discovery and contains activity information of 33,067 compounds with a ratio of 1 active compound to 350 inactive compounds (0.28% minority). 30,353 of the compounds screened had known drug-like properties.

    Primary and Confirmatory

    • AID604 (primary) with AID644 (confirmatory)
    • AID746 (primary) with AID1284 (confirmatory)
    • AID373 (primary) with AID439 (confirmatory)
    • AID746 (primary) with AID721 (confirmatory)

    Confirmatory

    • AID1608 is a different type of screening assay that was used to identify compounds that prevent HttQ103-induced cell death. National Institute of Neurological Disorders and Stroke Approved Drug Program. The compounds that prevent a release of a certain chemical into the growth medium are labelled as active and the remaining compounds are labelled as having inconclusive activity. AID1608 is a small dataset with 1,033 compounds and a ratio of 1 active to 14 inconclusive compounds (6.58% minority class).

    • AID644

    • AID1284

    • AID439

    • AID721

    • AID1608

    • AID644

    • AID1284

    • AID439

    • AID721

    Acknowledgements

    Original study: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820499/

    Data downloaded form UCI ML repository:

    Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

    ...

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vanessa Borst (2024). Preprocessed CPSC and PTB-XL Data [Dataset]. http://doi.org/10.6084/m9.figshare.25532869.v3

Preprocessed CPSC and PTB-XL Data

Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 1, 2024
Authors
Vanessa Borst
Description

CPSC 2018The first dataset is a preprocessed version of the CPSC 2018 dataset, which contains 6877 ECG recordings. We preprocessed the dataset by resampling the ECG signals to 250 Hz and equalizing the ECG signal length to 60 seconds, yielding a signal length of T=15,000 data points per recording.For the hyperparameter study, we employed a fixed train-valid-test split with ratio 60-20-20, while for the final evaluations, including the comparison with the state-of-the-art methods and ablation studies, we used a 10-fold cross-validation strategy.The raw CPSC 2018 dataset can be downloaded from the website of the PhysioNet/Computing in Cardiology Challenge 2020.(License: Creative Commons Attribution 4.0 International Public License).PTB-XL (Super-Diag.)The second dataset is a pre-processed version of PTB-XL, a large multi-label dataset of 21,799 clinical 12-lead ECG records of 10 seconds each. PTB-XL contains 71 ECG statements, categorized into 44 diagnostic, 19 form, and 12 rhythmic classes. In addition, the diagnostic category can be divided into 24 sub- and 5 coarse-grained super-classes. In our pre-processed version, we utilize the super-diagnostic labels for classification and the recommended train-valid-test splits, sampled at 100 Hz. We select only samples with at least one label in the super-diagnostic category,without applying any further preprocessing.The raw PTB-XL dataset can be downloaded from the PhysioNet/PTB-XL website.(License: Creative Commons Attribution 4.0 International Public License).

Search
Clear search
Close search
Google apps
Main menu