9 datasets found

o
Preprocessed CPSC and PTB-XL Data
explore.openaire.eu
figshare.com
Updated Jan 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanessa Borst (2024). Preprocessed CPSC and PTB-XL Data [Dataset]. http://doi.org/10.6084/m9.figshare.25532869.v3
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.25532869.v3
Dataset updated
Jan 1, 2024
Authors
Vanessa Borst
Description
CPSC 2018The first dataset is a preprocessed version of the CPSC 2018 dataset, which contains 6877 ECG recordings. We preprocessed the dataset by resampling the ECG signals to 250 Hz and equalizing the ECG signal length to 60 seconds, yielding a signal length of T=15,000 data points per recording.For the hyperparameter study, we employed a fixed train-valid-test split with ratio 60-20-20, while for the final evaluations, including the comparison with the state-of-the-art methods and ablation studies, we used a 10-fold cross-validation strategy.The raw CPSC 2018 dataset can be downloaded from the website of the PhysioNet/Computing in Cardiology Challenge 2020.(License: Creative Commons Attribution 4.0 International Public License).PTB-XL (Super-Diag.)The second dataset is a pre-processed version of PTB-XL, a large multi-label dataset of 21,799 clinical 12-lead ECG records of 10 seconds each. PTB-XL contains 71 ECG statements, categorized into 44 diagnostic, 19 form, and 12 rhythmic classes. In addition, the diagnostic category can be divided into 24 sub- and 5 coarse-grained super-classes. In our pre-processed version, we utilize the super-diagnostic labels for classification and the recommended train-valid-test splits, sampled at 100 Hz. We select only samples with at least one label in the super-diagnostic category,without applying any further preprocessing.The raw PTB-XL dataset can be downloaded from the PhysioNet/PTB-XL website.(License: Creative Commons Attribution 4.0 International Public License).
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
InductiveQE Datasets
zenodo.org
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7306046
Dataset updated
Nov 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikhail Galkin; Mikhail Galkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
InductiveQE datasets

UPD 2.0: Regenerated datasets free of potential test set leakages

UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

Each dataset is a zip archive containing 17 files:

train_graph.txt (pt for wikikg) - original training graph

val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph

val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.

test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph

test_predict.txt (pt) - missing edges in the test inference graph to be predicted.

train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)

*_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal

*_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed

train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2

stats.txt - a small file with dataset stats

Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

Paper pre-print: https://arxiv.org/abs/2210.08008

The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
Customer Churn - Decision Tree & Random Forest
kaggle.com
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Customer Churn - Decision Tree & Random Forest [Dataset]. https://www.kaggle.com/datasets/vikramamin/customer-churn-decision-tree-and-random-forest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2023
Dataset provided by
Kaggle
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Main objective: Find out customers who will churn and who will not.

Methodology: It is a classification problem. We will use decision tree and random forest to predict the outcome.

Steps Involved

Read the data

Check for data types https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1ffb600d8a4b4b36bc25e957524a3524%2FPicture1.png?generation=1688638600831386&alt=media" alt="">

Change character vector to factor vector as this is as classification problem

Drop the variable which is not significant for the analysis. We drop "customerID".

Check for missing values. None are found.

Split the data into train and test so we can use the train data for building the model and use test data for prediction. We split this into 80-20 ratio (train/test) using the sample function.

Install and run libraries (rpart, rpart.plot, rattle, RColorBrewer, caret)

Run decision tree using rpart function. The dependent variable is Churn and 19 other independent variables

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt=""> 9. Plot the decision tree

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">

Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service

Tuning the model

Define the search grid using the expand.grid function

Set up the control parameters through 5 fold cross validation

When we print the model we get the best CP = 0.01 and an accuracy of 79.00%

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">

Predict the model

Find out the variables which are most and least significant. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61beb4224e9351cfc772147c43800502%2FPicture5.png?generation=1688639468638950&alt=media" alt="">

Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.

USE RANDOM FOREST

Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of independent variables. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".

Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">

Predict the model and create a new data frame showing the actuals vs predicted values

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">

Plot the model so as to find out where the OOB (out of bag ) error stops decreasing or becoming constant. As we can see that the error stops decreasing between 100 to 200 trees. So we decide to take ntree = 200 when we tune the model.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">

Tune the model mtry=2 has the lowest OOB error rate

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">

Use random forest with mtry = 2 and ntree = 200

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...
f
High-temperature multi-element 2021 (HME21) dataset
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
So Takamoto; Chikashi Shinagawa; Daisuke Motoki; Kosuke Nakago; Wenwen Li; Iori Kurata; Taku Watanabe; Yoshihiro Yayama; Hiroki Iriguchi; Yusuke Asano; Tasuku Onodera; Takafumi Ishii; Takao Kudo; Hideki Ono; Ryohto Sawada; Ryuichiro Ishitani; Marc Ong; Taiki Yamaguchi; Toshiki Kataoka; Akihide Hayashi; Nontawat Charoenphakdee; Takeshi Ibuka (2023). High-temperature multi-element 2021 (HME21) dataset [Dataset]. http://doi.org/10.6084/m9.figshare.19658538.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19658538.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
So Takamoto; Chikashi Shinagawa; Daisuke Motoki; Kosuke Nakago; Wenwen Li; Iori Kurata; Taku Watanabe; Yoshihiro Yayama; Hiroki Iriguchi; Yusuke Asano; Tasuku Onodera; Takafumi Ishii; Takao Kudo; Hideki Ono; Ryohto Sawada; Ryuichiro Ishitani; Marc Ong; Taiki Yamaguchi; Toshiki Kataoka; Akihide Hayashi; Nontawat Charoenphakdee; Takeshi Ibuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HME21 is the atomic structure dataset aimed for the neural network potential development. It was created in the development of PFP, a universal neural network potential for material discovery [1]. It contains multiple elements in a single structure and was sampled through a high-temperature molecular dynamics simulation. There are a total of 37 elements in the HME21 dataset, i.e., H, Li, C, N, O, F, Na, Mg, Al, Si, P, S, Cl, K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Mo, Ru, Rh, Pd, Ag, In, Sn, Ba, Ir, Pt, Au, and Pb. They are calculated by Spin-polarized DFT calculations using PBE exchange-correlation functional implemented in VASP [2] version 5.4.4. All structures are under periodic boundary conditions. For the details of DFT calculation conditions and structure sampling method, please see the reference [1]. Please cite the reference [1] if you use this dataset. Files HME21 consists of three files with extxyz format:

train.xyz: 19956 structures valid.xyz: 2498 structures test.xyz 2495 structures

The structures were randomly split into training, validation, and test sub-datasets at a ratio of 8:1:1. They are used as training, validation, and test dataset for the benchmark of neural network potentials [1]. The target values are energy and atomic forces. The energy is shifted such that the energy of a single atom located in a vacuum becomes zero. The length is in angstroms (10^−10 m), and the energy is in electronvolts (eV). For supplementary, vasp_shift_energies.json which corresponds to the reference energy of single atom for each element is also included.
O
ZS-F-VQA
opendatalab.com
zip
Updated Sep 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies (2022). ZS-F-VQA [Dataset]. https://opendatalab.com/OpenDataLab/ZS-F-VQA
Explore at:
zip(8251200 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
Huawei
Zhejiang University
University of Oxford
University of Edinburgh
Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.
NaNa Dataset
kaggle.com
Updated Jun 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyubyong Park (2020). NaNa Dataset [Dataset]. https://www.kaggle.com/datasets/bryanpark/nana-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kyubyong Park
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
NaNa Dataset

NaNa---Name to Nationality---is a simple dataset that maps names and nationalities. It contains {train,dev,test}.{src,tgt} files. Each line in the *.src files and *.tgt files have a name and its associated nationality, respectively.

Construction

I constructed a new dataset for this project because I failed to find any available dataset that is big and comprehensive enough.

STEP 1. Downloaded and extracted the 20200601 English wiki dump (enwiki-20200601-pages-articles.xml).

STEP 2. Iterated all pages and collected the title and the nationality. I regarded the title as a person if the Category section at the bottom of each page included ... births (green rectangule), and identified their nationality from the most frequent nationality word in the section (red rectangules). https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F305510%2F73e53b309d2867ff7ec6b3bf0d230cbd%2Fwiki.png?generation=1592753437315529&alt=media" alt="">

STEP 3. Randomly split the data into train/dev/test in the ratio of 8:1:1 within each nationality group.

Stats

| Nationality | Train | Dev | Test | |--|--|--|--| |Total 1112902 | 890248 |111286|111368| | Afghan | 778 | 97 | 98 | | Albanian | 2193 | 274 | 275 | | Algerian | 1592 | 199 | 200 | | American | 241772 | 30221 | 30222 | | Andorran | 188 | 24 | 24 | | Angolan | 504 | 63 | 63 | | Argentine | 8926 | 1116 | 1116 | | Armenian | 1600 | 200 | 201 | | Aruban | 93 | 12 | 12 | | Australian | 40536 | 5067 | 5067 | | Austrian | 9192 | 1149 | 1149 | | Azerbaijani | 1331 | 166 | 167 | | Bahamian | 233 | 29 | 30 | | Bahraini | 237 | 30 | 30 | | Bangladeshi | 1636 | 204 | 205 | | Barbadian | 372 | 47 | 47 | | Basque | 961 | 120 | 121 | | Belarusian | 2338 | 292 | 293 | | Belgian | 7907 | 988 | 989 | | Belizean | 148 | 19 | 19 | | Beninese | 199 | 25 | 25 | | Bermudian | 270 | 34 | 34 | | Bhutanese | 144 | 18 | 18 | | Bolivian | 657 | 82 | 83 | | Bosniak | 81 | 10 | 11 | | Botswana | 252 | 31 | 32 | | Brazilian | 11234 | 1404 | 1405 | | Breton | 118 | 15 | 15 | | British | 45922 | 5740 | 5741 | | Bruneian | 115 | 14 | 15 | | Bulgarian | 3926 | 491 | 491 | | Burkinabé | 289 | 36 | 37 | | Burmese | 944 | 118 | 118 | | Burundian | 140 | 17 | 18 | | Cambodian | 360 | 45 | 46 | | Cameroonian | 1028 | 129 | 129 | | Canadian | 34152 | 4269 | 4270 | | Catalan | 1717 | 215 | 215 | | Chadian | 139 | 17 | 18 | | Chilean | 2838 | 355 | 355 | | Chinese | 9494 | 1187 | 1187 | | Colombian | 2620 | 328 | 328 | | Comorian | 54 | 7 | 7 | | Congolese | 35 | 4 | 5 | | Cuban | 1938 | 242 | 243 | | Cypriot | 1016 | 127 | 128 | | Czech | 7244 | 906 | 906 | | Dane | 32 | 4 | 5 | | Djiboutian | 54 | 7 | 7 | | Dominican | 1580 | 198 | 198 | | Dutch | 14916 | 1864 | 1865 | | Ecuadorian | 874 | 109 | 110 | | Egyptian | 2776 | 347 | 348 | | Emirati | 621 | 78 | 78 | | English | 77159 | 9645 | 9645 | | Equatoguinean | 193 | 24 | 25 | | Eritrean | 133 | 17 | 17 | | Estonian | 2028 | 254 | 254 | | Ethiopian | 733 | 92 | 92 | | Faroese | 284 | 35 | 36 | | Filipino | 3928 | 491 | 491 | | Finn | 68 | 8 | 9 | | French | 40841 | 5105 | 5106 | | Gabonese | 180 | 23 | 23 | | Gambian | 220 | 28 | 28 | | Georgian | 262 | 33 | 33 | | German | 42388 | 5299 | 5299 | | Ghanaian | 2036 | 255 | 255 | | Gibraltarian | 98 | 12 | 13 | | Greek | 5975 | 747 | 747 | | Grenadian | 139 | 17 | 18 | | Guatemalan | 563 | 70 | 71 | | Guinean | 584 | 73 | 74 | | Guyanese | 358 | 45 | 45 | | Haitian | 561 | 70 | 71 | | Honduran | 500 | 63 | 63 | | Hungarian | 7220 | 903 | 903 | | I-Kiribati | 40 | 5 | 6 | | Indian | 22692 | 2836 | 2837 | | Indonesian | 2820 | 352 | 353 | | Iranian | 5010 | 626 | 627 | | Iraqi | 1252 | 157 | 157 | | Irish | 11844 | 1481 | 1481 | | Israeli | 5149 | 644 | 644 | | Italian | 29336 | 3667 | 3668 | | Jamaican | 1422 | 178 | 178 | | Japanese | 21216 | 2652 | 2652 | | Jordanian | 490 | 61 | 62 | | Kazakh | 24 | 3 | 4 | | Kenyan | 1609 | 201 | 202 | | Korean | 7896 | 987 | 988 | | Kuwaiti | 396 | 50 | 50 | | Kyrgyz | 16 | 2 | 2 | | Lao | 26 | 3 | 4 | | Latvian | 1693 | 212 | 212 | | Lebanese | 1246 | 156 | 156 | | Liberian | 294 | 37 | 37 | | Libyan | 271 | 34 | 34 | | Lithuanian | 1979 | 247 | 248 | | Macedonian | 1099 | 137 | 138 | | Malagasy | 232 | 29 | 29 | | Malawian | 219 | 27 | 28 | | Malaysian | 2582 | 323 | 323 | | Maldivian | 152 | 19 | 20 | | Malian | 385 | 48 | 49 | | Maltese | 663 | 83 | 83 | | Manx | 150 | 19 | 19 | | Marshallese | 32 | 4 | 4 | | Mauritanian | 96 | 12 | 12 | | Mauritian | 263 | 33 | 33 | | Mexican | 8648 | 1081 | 1081 | | Moldovan | 1000 | 125 | 125 | | Mongolian | 504 | 63 | 64 | | Montenegrin | 955 | 119 | 120 | | Moroccan | 1457 | 182 | 183 | | Mozambican | 210 | 26 | 27 | | Namibian | 588 | 74 | 74 | | Nauruan | 32 | 4 | 4 | | Nepalese | 773 | 97 | 97 | | Nicaraguan | 285 | 36 | 36 | | Nigerian | 4060 | 507 | 508 | | Nigerien | 143 | 18 | ...
h
SOUL
huggingface.co
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Lab at Alibaba DAMO Academy (2024). SOUL [Dataset]. https://huggingface.co/datasets/DAMO-NLP-SG/SOUL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2024
Dataset authored and provided by
Language Technology Lab at Alibaba DAMO Academy
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This repo contains the data for our paper "SOUL: Towards Sentiment and Opinion Understanding of Language" in EMNLP 2023. Github repo

Statistics

The SOUL dataset comprises 15,028 statements related to 3,638 reviews, resulting in an average of 4.13 statements per review. To create training, development, and test sets, we split the reviews in a ratio of 6:1:3, respectively.

Split

reviews

statements

True False Not-given Total

Train 2,182 3,675 2,159 8,834 3,000 8,834… See the full description on the dataset page: https://huggingface.co/datasets/DAMO-NLP-SG/SOUL.
Bioassay Datasets
kaggle.com
zip
Updated Sep 7, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2017). Bioassay Datasets [Dataset]. https://www.kaggle.com/uciml/bioassay-datasets
Explore at:
zip(50341627 bytes)Available download formats
Dataset updated
Sep 7, 2017
Dataset authored and provided by
UCI Machine Learning
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The drug-development process is time-consuming and expensive. In High-Throughput Screening (HTS), batches of compounds are tested against a biological target to test the compound's ability to bind to the target. Targets might be antibodies for example. If the compound binds to the target then it is active for that target and known as a hit.

Virtual screening is the computational or in silico screening of biological compounds and complements the HTS process. It is used to aid the selection of compounds for screening in HTS bioassays or for inclusion in a compound-screening library.

Drug discovery is the first stage of the drug-development process and involves finding compounds to test and screen against biological targets. This first stage is known as primary-screening and usually involves the screening of thousands of compounds.

This dataset is a collection of 21 bioassays (screens) that measure the activity of various compounds against different biological targets.

Content

Each bioassay is split into test and train files.

Here are some descriptions of some of the assays compounds. The source, unfortunately, does not have descriptions for every assay. That's the nature of the beast for finding this kind data and was also pointed out in the original study.

Primary screens

AID362 details the results of a primary screening bioassay for Formylpeptide Receptor Ligand Binding University from the New Mexico Center for Molecular Discovery. It is a relatively small dataset with 4279 compounds and with a ratio of 1 active to 70 inactive compounds (1.4% minority class). The compounds were selected on the basis of preliminary virtual screening of approximately 480,000 drug-like small molecules from Chemical Diversity Laboratories.

AID604 is a primary screening bioassay for Rho kinase 2 inhibitors from the Scripps Research Institute Molecular Screening Center. The bioassay contains activity information of 59,788 compounds with a ratio of 1 active compound to 281 inactive compounds (1.4%). 57,546 of the compounds have known drug-like properties.

AID456 is a primary screen assay from the Burnham Center for Chemical Genomics for inhibition of TNFa induced VCAM-1 cell surface expression and consists of 9,982 compounds with a ratio of 1 active compound to 368 inactive compounds (0.27% minority). The compounds have been selected for their known drug-like properties and 9,431 meet the Rule of 5 [19].

AID688 is the result of a primary screen for Yeast eIF2B from the Penn Center for Molecular Discovery and contains activity information of 27,198 compounds with a ratio of 1 active compound to 108 inactive compounds (0.91% minority). The screen is a reporter-gene assay and 25,656 of the compounds have known drug-like properties.

AID373 is a primary screen from the Scripps Research Institute Molecular Screening Center for endothelial differentiation, sphingolipid G-protein-coupled receptor, 3. 59,788 compounds were screened with a ratio of 1 active compound to 963 inactive compounds (0.1%). 57,546 of the compounds screened had known drug-like properties.

AID746 is a primary screen from the Scripps Research Institute Molecular Screening Center for Mitogen-activated protein kinase. 59,788 compounds were screened with a ratio of 1 active compound to 162 inactive compounds (0.61%). 57,546 of the compounds screened had known drug-like properties.

AID687 is the result of a primary screen for coagulation factor XI from the Penn Center for Molecular Discovery and contains activity information of 33,067 compounds with a ratio of 1 active compound to 350 inactive compounds (0.28% minority). 30,353 of the compounds screened had known drug-like properties.

Primary and Confirmatory

AID604 (primary) with AID644 (confirmatory)

AID746 (primary) with AID1284 (confirmatory)

AID373 (primary) with AID439 (confirmatory)

AID746 (primary) with AID721 (confirmatory)

Confirmatory

AID1608 is a different type of screening assay that was used to identify compounds that prevent HttQ103-induced cell death. National Institute of Neurological Disorders and Stroke Approved Drug Program. The compounds that prevent a release of a certain chemical into the growth medium are labelled as active and the remaining compounds are labelled as having inconclusive activity. AID1608 is a small dataset with 1,033 compounds and a ratio of 1 active to 14 inconclusive compounds (6.58% minority class).

AID644

AID1284

AID439

AID721

AID1608

AID644

AID1284

AID439

AID721

Acknowledgements

Original study: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820499/

Data downloaded form UCI ML repository:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Vanessa Borst (2024). Preprocessed CPSC and PTB-XL Data [Dataset]. http://doi.org/10.6084/m9.figshare.25532869.v3

Preprocessed CPSC and PTB-XL Data

Explore at:

23 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.6084/m9.figshare.25532869.v3

Dataset updated

Jan 1, 2024

Authors

Vanessa Borst

Description

CPSC 2018The first dataset is a preprocessed version of the CPSC 2018 dataset, which contains 6877 ECG recordings. We preprocessed the dataset by resampling the ECG signals to 250 Hz and equalizing the ECG signal length to 60 seconds, yielding a signal length of T=15,000 data points per recording.For the hyperparameter study, we employed a fixed train-valid-test split with ratio 60-20-20, while for the final evaluations, including the comparison with the state-of-the-art methods and ablation studies, we used a 10-fold cross-validation strategy.The raw CPSC 2018 dataset can be downloaded from the website of the PhysioNet/Computing in Cardiology Challenge 2020.(License: Creative Commons Attribution 4.0 International Public License).PTB-XL (Super-Diag.)The second dataset is a pre-processed version of PTB-XL, a large multi-label dataset of 21,799 clinical 12-lead ECG records of 10 seconds each. PTB-XL contains 71 ECG statements, categorized into 44 diagnostic, 19 form, and 12 rhythmic classes. In addition, the diagnostic category can be divided into 24 sub- and 5 coarse-grained super-classes. In our pre-processed version, we utilize the super-diagnostic labels for classification and the recommended train-valid-test splits, sampled at 100 Hz. We select only samples with at least one label in the super-diagnostic category,without applying any further preprocessing.The raw PTB-XL dataset can be downloaded from the PhysioNet/PTB-XL website.(License: Creative Commons Attribution 4.0 International Public License).

Clear search

Close search

Google apps

Main menu

Preprocessed CPSC and PTB-XL Data

Cdd Dataset

InductiveQE Datasets

Customer Churn - Decision Tree & Random Forest

USE RANDOM FOREST

High-temperature multi-element 2021 (HME21) dataset

ZS-F-VQA

NaNa Dataset

NaNa Dataset

Construction

Stats

SOUL

reviews

statements

Bioassay Datasets

Context

Content

Primary screens

Primary and Confirmatory

Confirmatory

Acknowledgements

Preprocessed CPSC and PTB-XL Data