21 datasets found

d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
f
Materials Science Named Entity Recognition: train/development/test sets
figshare.com
txt
Updated Jun 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leigh Weston (2019). Materials Science Named Entity Recognition: train/development/test sets [Dataset]. http://doi.org/10.6084/m9.figshare.8184428.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8184428.v1
Dataset updated
Jun 4, 2019
Dataset provided by
figshare
Authors
Leigh Weston
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training, development and test sets for supervised named entity recognition for materials science. The data is labelled using the IOB annotation scheme. There exist 7 entity tags: material (MAT), sample descriptor (DSC), symmetry/phase label (SPL), property (PRO), application (APL), synthesis method (SMT), and characterization method (CMT), along with the outside tag (O).The data consists of 800 hand-labelled materials science abstracts. The data has an 80-10-10 split, giving 640 abstracts in the training set, 80 in the development set, and 80 in the test set.
o
Natural Language Inference Evaluation Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Natural Language Inference Evaluation Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/abcd24c8-a1a1-4724-83b2-ea07314b8d13
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
The HellaSwag dataset is a highly valuable resource for assessing a machine's sentence completion abilities based on commonsense natural language inference (NLI). It was initially introduced in a paper published at ACL2019. This dataset enables researchers and machine learning practitioners to train, validate, and evaluate models designed to understand and predict plausible sentence completions using common sense knowledge. It is useful for understanding the limitations of current NLI systems and for developing algorithms that reason with common sense.

Columns

The dataset includes several key columns: * ind: The index of the data point. (Integer) * activity_label: The label indicating the activity or event described in the sentence. (String) * ctx_a: The first context sentence, providing background information. (String) * ctx_b: The second context sentence, providing further background information. (String) * endings: A list of possible sentence completions for the given context. (List of Strings) * split: The dataset split, such as 'train', 'dev', or 'test'. (String) * split_type: The type of split used for dividing the dataset, like 'random' or 'balanced'. (String) * source_id: An identifier for the source. * label: A label associated with the data point.

Distribution

The dataset is typically provided in CSV format and consists of three primary files: train.csv, validation.csv, and test.csv. The train.csv file facilitates the learning process for machine learning models, validation.csv is used to validate model performance, and test.csv enables thorough evaluation of models in completing sentences with common sense. While exact total row counts for the entire dataset are not specified in the provided information, insights into unique values for fields such as activity_label (9965 unique values), source_id (8173 unique values), and split_type (e.g., 'indomain' and 'zeroshot' each accounting for 50%) are available.

Usage

This dataset is ideal for a variety of applications and use cases: * Language Modelling: Training language models to better understand common sense knowledge and improve sentence completion tasks. * Common Sense Reasoning: Developing and studying algorithms that can reason and make inferences based on common sense. * Machine Performance Evaluation: Assessing the effectiveness of machine learning models in generating appropriate sentence endings given specific contexts and activity labels. * Natural Language Inference (NLI): Benchmarking and improving NLI systems by evaluating their ability to predict plausible sentence completions.

Coverage

The dataset has a global region scope. It was listed on 17/06/2025. Specific time ranges for the data collection itself or detailed demographic scopes are not provided. The dataset includes various splits (train, dev, test) and split types (random, balanced) to ensure diversity for generalisation testing and fairness evaluation during model development.

License

CC0

Who Can Use It

The HellaSwag dataset is intended for researchers and machine learning practitioners. They can utilise it to: * Train, validate, and evaluate machine learning models for tasks requiring common sense knowledge. * Develop and refine algorithms for common sense reasoning. * Benchmark and assess the performance and limitations of current natural language inference systems.

Dataset Name Suggestions

HellaSwag: Commonsense NLI

Commonsense Sentence Completion Data

Natural Language Inference Evaluation Dataset

AI Common Sense Benchmark

Attributes

Original Data Source: HellaSwag: Commonsense NLI
MedalCare-XL
zenodo.org
paperswithcode.com
zip
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karli Gillette*; Karli Gillette*; Matthias A.F. Gsell*; Matthias A.F. Gsell*; Claudia Nagel*; Claudia Nagel*; Jule Bender; Benjamin Winkler; Benjamin Winkler; Steven E. Williams; Steven E. Williams; Markus Bär; Markus Bär; Tobias Schäffter; Tobias Schäffter; Olaf Dössel; Olaf Dössel; Gernot Plank; Gernot Plank; Axel Loewe; Axel Loewe; Jule Bender (2023). MedalCare-XL [Dataset]. http://doi.org/10.5281/zenodo.7293655
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7293655
Dataset updated
Aug 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Karli Gillette*; Karli Gillette*; Matthias A.F. Gsell*; Matthias A.F. Gsell*; Claudia Nagel*; Claudia Nagel*; Jule Bender; Benjamin Winkler; Benjamin Winkler; Steven E. Williams; Steven E. Williams; Markus Bär; Markus Bär; Tobias Schäffter; Tobias Schäffter; Olaf Dössel; Olaf Dössel; Gernot Plank; Gernot Plank; Axel Loewe; Axel Loewe; Jule Bender
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mechanistic cardiac electrophysiology models allow for personalized simulations of the electrical activity in the heart and the ensuing electrocardiogram (ECG) on the body surface. As such, synthetic signals possess precisely known ground truth labels of the underlying disease (model parameterization) and can be employed for validation of machine learning ECG analysis tools in addition to clinical signals. Recently, synthetic ECG signals were used to enrich sparse clinical data for machine learning or even replace them completely during training leading to good performance on real-world clinical test data.

We thus generated a large synthetic database comprising a total of 16,900 12-lead ECGs based on multi-scale electrophysiological simulations equally distributed into 1 normal healthy control and 7 pathology classes. The pathological case of myocardial infraction had 6 sub-classes. A comparison of extracted timing and amplitude features between the virtual cohort and a large publicly available clinical ECG database demonstrated that the synthetic signals represent clinical ECGs for healthy and pathological subpopulations with high fidelity. The novel dataset of simulated ECG signals is split into training, validation and test data folds for development of novel machine learning algorithms and their objective assessment.

This folder WP2_largeDataset_Noise contains the 12-lead ECGs of 10 seconds length. Each ECG is stored in a separate CSV file with one row per lead (lead order: I, II, III, aVR, aVL, aVF, V1-V6) and one sample per column (sampling rate: 500Hz). Data are split by pathologies (avblock = AV block, lbbb = left bundle branch block, rbbb = right bundle branch block, sinus = normal sinus rhythm, lae = left atrial enlargement, fam = fibrotic atrial cardiomyopathy, iab = interatrial conduction block, mi = myocardial infarction). MI data are further split into subclasses depending on the occlusion site (LAD, LCX, RCA) and transmurality (0.3 or 1.0). Each pathology subclass contains training, validation and testing data (~ 70/15/15 split). Training, validation and testing datasets were defined according to the model with which QRST complexes were simulated, i.e., ECGs calculated with the same anatomical model but different electrophysiological parameters are only present in one of the test, validation and training datasets but never in multiple. Each subfolder also contains a "siginfo.csv" file specifying the respective simulation run for the P wave and the QRST segment that was used to synthesize the 10 second ECG segment. Each signal is available in three variations:
run_*_raw.csv contains the synthesized ECG without added noise and without filtering
run_*_noise.csv contains the synthesized ECG (unfiltered) with superimposed noise
run_*_filtered.csv contains the filtered synthesized ECG (fiter settings: highpass cutoff frequency 0.5Hz, lowpass cutoff frequency 150Hz, butterworth filters of order 3).

The folder WP2_largeDataset_ParameterFiles contains the parameter files used to simulate the 12-lead ECGs. Parameters are split for atrial and ventricular simulations, which were run independently from one another.
See Gillette*, Gsell*, Nagel* et al. "MedalCare-XL: 16,900 healthy and pathological electrocardiograms obtained through multi-scale electrophysiological models" for a description of the model parameters.
t
Student Exam Performance
test.dbrepo.tuwien.ac.at
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azan (2025). Student Exam Performance [Dataset]. http://doi.org/10.82556/egd2-q295
Explore at:
Unique identifier
https://doi.org/10.82556/egd2-q295
Dataset updated
Jun 15, 2025
Authors
Azan
Time period covered
2025
Description
This dataset is part of a student exam performance prediction use case and contains structured data such as study hours, attendance percentage, previous exam scores, and final exam scores. The data has been split into subsets (training, validation, and test) for use in a machine learning workflow. Each subset is used for a specific phase of model development: training, tuning, and evaluation. The dataset supports a regression-based model and follows FAIR data principles.
h
aze_carpet
huggingface.co
Updated May 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emil Niyazov (2025). aze_carpet [Dataset]. https://huggingface.co/datasets/eniyazov/aze_carpet
Explore at:
Dataset updated
May 17, 2025
Authors
Emil Niyazov
Description
Azerbaijani Carpet Dataset The dataset was collected by ADA University students. Dataset Description This dataset contains images of seven regions of Azerbaijani carpet categories. It is organised into standard training, validation, and test splits to facilitate machine learning model development and evaluation. Features 7 Region Categories: Baku, Karabakh, Shirvan, Guba, Gazakh, Ganja, Shirvan 196 Total Images: Properly split for machine learning tasks Categories Brief descriptions of the… See the full description on the dataset page: https://huggingface.co/datasets/eniyazov/aze_carpet.
P
Data from: UPAR Dataset
paperswithcode.com
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Specker; Mickael Cormier; Jürgen Beyerer (2025). UPAR Dataset [Dataset]. https://paperswithcode.com/dataset/upar
Explore at:
Dataset updated
Feb 17, 2025
Authors
Andreas Specker; Mickael Cormier; Jürgen Beyerer
Description
The Task: The challenge will use an extension of the UPAR Dataset [1], which consists of images of pedestrians annotated for 40 binary attributes. For deployment and long-term use of machine-learning algorithms in a surveillance context, the algorithms must be robust to domain gaps that occur when the environment changes. This challenge aims to spotlight the problem of domain gaps in a real-world surveillance context and highlight the challenges and limitations of existing methods to provide a direction for future research.

The Dataset: We will use an extension of the UPAR dataset [1]. The challenge dataset consists of the harmonization of three public datasets (PA100K [2], PETA [3], and Market1501-Attributes [4]) and a private test set. 40 binary attributes have been unified between those for which we provide additional annotations. This dataset enables the investigation of PAR methods' generalization ability under different attribute distributions, viewpoints, varying illumination, and low resolution.

The Tracks: This challenge is split into two tracks associated with semantic pedestrian attributes, such as gender or clothing information: Pedestrian Attribute Recognition (PAR) and attribute-based person retrieval. Both tracks build on the same data sources but will have different evaluation criteria. There are three different dataset splits for both tracks that use different training domains. Each track evaluates how robust a given method is to domain shifts by training on limited data from a specific limited domain and evaluating using data from unseen domains.

Track 1: Pedestrian Attribute Recognition: The task is to train an attribute classifier that accurately predicts persons’ semantic attributes, such as age or clothing information, under domain shifts. Track 2: Attribute-based Person Retrieval: Attribute-based person retrieval aims to find persons in a huge database of images called gallery that match a specific attribute description. The goal of this track is to develop an approach that takes binary attribute queries and gallery images as input and ranks the images according to their similarity to the query. The Phases: Each track will be composed of two phases, i.e., the development and test phases. During the development phase, public training data will be released, and participants must submit their predictions concerning a validation set. At the test (final) phase, participants will need to submit their results for the test data, which will be released just a few days before the end of the challenge. As we progress into the test phase, validation annotations will become available together with the test images for the final submission. At the end of the challenge, participants will be ranked using the public test data and additional data that is kept private. It is important to note that this competition involves submitting results and code. Therefore, participants will be required to share their code and trained models after the end of the challenge (with detailed instructions) so that the organizers can reproduce the results submitted at the test phase in a code verification stage. Verified code will be applied to a private test dataset for final ranking. The organizers will evaluate the top submissions on the public leaderboard on the private test set to determine the 3 top winners of the challenge. At the end of the challenge, top-ranked methods that pass the code verification stage will be considered valid submissions and compete for any prize that may be offered.
o
Data from: Common Phone: A Multilingual Dataset for Robust Acoustic...
explore.openaire.eu
zenodo.org
Updated Jan 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Klumpp; Tom��s Arias-Vergara; Paula Andrea P��rez-Toro; Elmar N��th; Juan Rafael Orozco-Arroyave (2022). Common Phone: A Multilingual Dataset for Robust Acoustic Modelling [Dataset]. http://doi.org/10.5281/zenodo.5846137
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5846137
Dataset updated
Jan 17, 2022
Authors
Philipp Klumpp; Tom��s Arias-Vergara; Paula Andrea P��rez-Toro; Elmar N��th; Juan Rafael Orozco-Arroyave
Description
Release Date: 17.01.22 Welcome to Common Phone 1.0 Legal Information Common Phone is a subset of the Common Voice corpus collected by Mozilla Corporation. By using Common Phone, you agree to the Common Voice Legal Terms. Common Phone is maintained and distributed by speech researchers at the Pattern Recognition Lab of Friedrich-Alexander-University Erlangen-Nuremberg (FAU) under the CC0 license. Like for Common Voice, you must not make any attempt to identify speakers that contributed to Common Phone. About Common Phone This corpus aims to provide a basis for Machine Learning (ML) researchers and enthusiasts to train and test their models against a wide variety of speakers, hardware/software ecosystems and acoustic conditions to improve generalization and availability of ML in real-world speech applications. The current version of Common Phone comprises 116,5 hours of speech samples, collected from 11.246 speakers in 6 languages: Language Speakers Hours train / dev / test train / dev / test English 4716 / 771 / 774 14.1 / 2.3 / 2.3 French 796 / 138 / 135 13.6 / 2.3 / 2.2 German 1176 / 202 / 206 14.5 / 2.5 / 2.6 Italian 1031 / 176 / 178 14.6 / 2.5 / 2.5 Spanish 508 / 88 / 91 16.5 / 3.0 / 3.1 Russian 190 / 34 / 36 12.7 / 2.6 / 2.8 Total 8417 / 1409 / 1420 85.8 / 15.2 / 15.5 Presented train, dev and test splits are not identical to those shipped with Common Voice. Speaker separation among splits was realized by only using those speakers that had provided age and gender information. This information can only be provided as a registered user on the website. When logged in, the session ID of contributed recordings is always linked to your user, thus we could easily link recordings to individual speakers. Keep in mind this would not be possible for unregistered users, as their session ID changes if they decide to contribute more than once. During speaker selection, we considered that some speakers had contributed to more than one of the six Common Voice datasets (one for each language). In Common Phone, a speaker will only appear in one language. The dataset is structured as follows: Six top-level directories, one for each language. Each language folder contains: [train|dev|test].csv files listing audio files, respective speaker ID and plain text transcript. meta.csv provides speaker information: age group, gender, language, accent (if available) and which of the three splits this speaker was assigned to. File names match corresponding audio file names except their extension. /grids/ contains phonetic transcription for every audio file in Praat TextGrid format. /mp3/ contains audio files in mp3, identical to those of Common Voice, e.g., sampling rates have been preserved and may vary for different files. /wav/ contains raw audio files in 16 bits/sample, 16 kHz single channel. They had been created from the original mp3 audios. We provide them for convenience, keep in mind that their source had undergone MP3-compression. Where does the phonetic annotation come from? Phonetic annotation was computed via BAS Web Services. We used the regular Pipeline (G2P-MAUS) without ASR to create an alignment of text transcripts with audio signals. We chose International Phonetic Alphabet (IPA) output symbols as they work well even in a multi-lingual setup. Common Phone annotation comprises 101 phonetic symbols, including silence. Why Common Phone? Large number of speakers and varying acoustic conditions to improve robustness of ML models Time-aligned IPA phonetic transcription for every audio sample Gender-balanced and age-group-matched (equal number of female/male speakers in every age group) Support for six different languages to leverage multi-lingual approaches Original MP3 files plus standard WAVE files Is there any publication available? Yes, a paper describing Common Phone in detail is currently under revision for LREC 2022. You can access a pre-print version on arXiv entitled ��Common Phone: A Multilingual Dataset for Robust Acoustic Modelling��. {"references": ["Klumpp, Philipp et al. (2022); "Common Phone: A Multilingual Dataset for Robust Acoustic Modelling" https://arxiv.org/abs/2201.05912"]}
h
azerbaijani-cuisine
huggingface.co
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ARMammadli (2025). azerbaijani-cuisine [Dataset]. https://huggingface.co/datasets/ARMammadli/azerbaijani-cuisine
Explore at:
Dataset updated
May 9, 2025
Authors
ARMammadli
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Azerbaijani Cuisine Dataset

A curated image dataset of traditional Azerbaijani dishes for computer vision and image classification tasks.

Dataset Description

This dataset contains images of five traditional Azerbaijani dish categories. It is organized into standard training, validation, and test splits to facilitate machine learning model development and evaluation.

Features

5 Food Categories: Dolma, Kebabs, Pakhlava, Plov, and Soups 324 Total Images: Properly… See the full description on the dataset page: https://huggingface.co/datasets/ARMammadli/azerbaijani-cuisine.
f
Europe PMC Full Text Corpus
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre (2023). Europe PMC Full Text Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.22848380.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22848380.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism.

Corpus Directory Structure

annotations/: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.

hypothesis/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.
GROUP0/: contains raw manual annotations made by curator GROUP0. GROUP1/: contains raw manual annotations made by curator GROUP1. GROUP2/: contains raw manual annotations made by curator GROUP2.

IOB/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in Inside–Outside–Beginning tagging format.
dev/: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. test/: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. train/: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task.

JSON/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in JSON format. README.md: a detailed description of all the annotation formats.

articles/: contains the full-text articles annotated in Europe PMC corpus.

Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API. README.md: a detailed description of the sentencising and fetching of XML articles.

docs/: contains related documents that were used for generating the corpus.

Annotation guideline.pdf: annotation guideline that is provided to curators to assist the manual annotation. demo to molecular conenctions.pdf: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. Training set development.pdf: initial document that details the paper selection procedures.

pilot/: contains annotations and articles that were used in a pilot study.

annotations/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. articles/: contains the full-text articles annotated in the pilot study.

Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API.

README.md: a detailed description of the sentencising and fetching of XML articles.

src/: source codes for cleaning annotations and generating IOB files

metrics/ner_metrics.py: Python script contains SemEval evaluation metrics. annotations.py: Python script used to extract annotations from raw Hypothes.is annotations. generate_IOB_dataset.py: Python script used to convert JSON format annotations to IOB tagging format. generate_json_dataset.py: Python script used to extract annotations to JSON format. hypothesis.py: Python script used to fetch raw Hypothes.is annotations.

License

CCBY

Feedback

For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.
c
ATLAS Top Tagging Open Data Set
opendata.cern.ch
Updated 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATLAS collaboration (2022). ATLAS Top Tagging Open Data Set [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.FG5F.96GA
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.ATLAS.FG5F.96GA
Dataset updated
2022
Dataset provided by
CERN Open Data Portal
Authors
ATLAS collaboration
Description
Boosted top tagging is an essential binary classification task for experiments at the Large Hadron Collider (LHC) to measure the properties of the top quark. The ATLAS Top Tagging Open Data Set is a publicly available data set for the development of Machine Learning (ML) based boosted top tagging algorithms. The data are split into two orthogonal sets, named train and test and stored in the HDF5 file format, containing 42 million and 2.5 million jets respectively. Both sets are composed of equal parts signal (jets initiated by a boosted top quark) and background (jets initiated by light quarks or gluons). For each jet, the data set contains:
The four vectors of constituent particles
15 high level summary quantities evaluated on the jet
The four vector of the whole jet
A training weight
A signal (1) vs background (0) label.
There is one rule in using this data set: the contribution to a loss function from any jet should always be weighted by the training weight. Apart from this a model should separate the signal jets from background by whatever means necessary.
Updated on July 26th 2024. This dataset has been superseeded by a new dataset which also includes systematic uncertainties. Please use the new dataset instead of this one.
f
Data from: Machine Learning Accelerates the Discovery of Design Rules and...
acs.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Nandy; Jiazhou Zhu; Jon Paul Janet; Chenru Duan; Rachel B. Getman; Heather J. Kulik (2023). Machine Learning Accelerates the Discovery of Design Rules and Exceptions in Stable Metal–Oxo Intermediate Formation [Dataset]. http://doi.org/10.1021/acscatal.9b02165.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acscatal.9b02165.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Aditya Nandy; Jiazhou Zhu; Jon Paul Janet; Chenru Duan; Rachel B. Getman; Heather J. Kulik
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Metal–oxo moieties are important catalytic intermediates in the selective partial oxidation of hydrocarbons and in water splitting. Stable metal–oxo species have reactive properties that vary depending on the spin state of the metal, complicating the development of structure–property relationships. To overcome these challenges, we train machine-learning (ML) models capable of predicting metal–oxo formation energies across a range of first-row metals, oxidation states, and spin states. Using connectivity-only features tailored for inorganic chemistry as inputs to kernel ridge regression or artificial neural network (ANN) ML models, we achieve good mean absolute errors (4–5 kcal/mol) on set-aside test data across a range of ligand orientations. Analysis of feature importance for oxo formation energy prediction reveals the dominance of nonlocal, electronic ligand properties in contrast to other transition metal complex properties (e.g., spin-state or ionization potential). We enumerate the theoretical catalyst space with an ANN, revealing expected trends in oxo formation energetics, such as destabilization of the metal–oxo species with increasing d-filling, as well as exceptions, such as weak correlations with indicators of oxidative stability of the metal in the resting state or unexpected spin-state dependence in reactivity. We carry out uncertainty-aware evolutionary optimization using the ANN to explore a >37 000 candidate catalyst space. New metal and oxidation state combinations are uncovered and validated with density functional theory (DFT), including counterintuitive oxo formation energies for oxidatively stable complexes. This approach doubles the density of confirmed DFT leads in originally sparsely populated regions of property space, highlighting the potential of ML-model-driven discovery to uncover catalyst design rules and exceptions.
P
EyePACS-light (v2) Dataset
paperswithcode.com
Updated Dec 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). EyePACS-light (v2) Dataset [Dataset]. https://paperswithcode.com/dataset/eyepacs-light-v2
Explore at:
Dataset updated
Dec 19, 2023
Description
This is an improved machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] set. This dataset is split into training, validation, and test folders which contain 4000 (~84%), 385 (~8%), and 385 (~8%) fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG).

Improvements from v1: Increased the image dimensions from 256x256 pixels to 512x512 pixels Swapped the image file format from JPG to PNG Added 3000 images from the Rotterdam EyePACS AIROGS dev set Readjusted train/val/test split Improved sampling from source dataset

Drawbacks of Rotterdam EyePACS AIROGS: One of the largest drawbacks of the original dataset is the accessibility of the dataset. The dataset requires a long download, a large storage space, it spans several folders, and it is not machine-learning-ready (it requires data processing and splitting). The dataset also contains raw fundus images in their original dimensions; these original images often contain a large amount of black background and the dimensions are too large for machine learning inputs. The proposed dataset addresses the aforementioned concerns by image sampling and image standardization to balance and reduce the dataset size respectively.

Origin: The images in this dataset are sourced from the Rotterdam EyePACS AIROGS [1] dataset, which contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity; this impressive dataset is over 60GB when compressed.

[1] EyePACS-AIROGS; https://airogs.grand-challenge.org/data-and-challenge/
f
DataSheet4_Application of machine learning to predict unbound drug...
frontiersin.figshare.com
pdf
Updated Apr 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. Francisco Morales; M. Esperanza Ruiz; Robert E. Stratford; Alan Talevi (2024). DataSheet4_Application of machine learning to predict unbound drug bioavailability in the brain.PDF [Dataset]. http://doi.org/10.3389/fddsv.2024.1360732.s004
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fddsv.2024.1360732.s004
Dataset updated
Apr 4, 2024
Dataset provided by
Frontiers
Authors
J. Francisco Morales; M. Esperanza Ruiz; Robert E. Stratford; Alan Talevi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose: Optimizing brain bioavailability is highly relevant for the development of drugs targeting the central nervous system. Several pharmacokinetic parameters have been used for measuring drug bioavailability in the brain. The most biorelevant among them is possibly the unbound brain-to-plasma partition coefficient, Kpuu,brain,ss, which relates unbound brain and plasma drug concentrations under steady-state conditions. In this study, we developed new in silico models to predict Kpuu,brain,ss.Methods: A manually curated 157-compound dataset was compiled from literature and split into training and test sets using a clustering approach. Additional models were trained with a refined dataset generated by removing known P-gp and/or Breast Cancer Resistance Protein substrates from the original dataset. Different supervised machine learning algorithms have been tested, including Support Vector Machine, Gradient Boosting Machine, k-nearest neighbors, classificatory Partial Least Squares, Random Forest, Extreme Gradient Boosting, Deep Learning and Linear Discriminant Analysis. Good practices of predictive Quantitative Structure-Activity Relationships modeling were followed for the development of the models.Results: The best performance in the complete dataset was achieved by extreme gradient boosting, with an accuracy in the test set of 85.1%. A similar estimation of accuracy was observed in a prospective validation experiment, using a small sample of compounds and comparing predicted unbound brain bioavailability with observed experimental data.Conclusion: New in silico models were developed to predict the Kpuu,brain,ss of drug candidates. The dataset used in this study is publicly disclosed, so that the models may be reproduced, refined, or expanded, as a useful tool to assist drug discovery processes.
P
Suspicious Activity Detection Dataset Dataset
paperswithcode.com
Updated Mar 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Suspicious Activity Detection Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/suspicious-activity-detection-dataset
Explore at:
Dataset updated
Mar 29, 2025
Description
Description:

👉 Download the dataset here

This dataset has been meticulously curated to facilitate. The development and training of machine learning models specifically designed for detecting Suspicious Activity Detection Dataset. With a primary focus on shoplifting. The dataset is organized into two distinct categories: 'Suspicious' and 'Normal' activities. These classifications are intended to help models differentiate between typical behaviors and actions that may warrant further investigation in a retail setting.

Download Dataset

Structure and Organization

The dataset is structured into three main directories-train, test, and validation-each containing a balanced distribution of images from both categories. This structured approach ensures that the model is trained effectively, evaluated comprehensively, and validated on a diverse set of scenarios.

Train Folder: Contains a substantial number of images representing both suspicious and normal activities. This folder serves as the primary dataset for training the model, allowing it to learn and generalize patterns from a wide variety of scenarios.

Test Folder: Designed for evaluating the model's performance post-training, this folder contains a separate set of labeled images. The test data allows for unbiased performance evaluation, ensuring that the model can generalize well to unseen situations.

Validation Folder: This additional split is used during the model training process to tune hyperparameters and prevent overfitting by testing the model's accuracy on a smaller, separate dataset before final testing.

Labels and Annotations

Each image is accompanied by a corresponding label that indicates whether the activity is 'Suspicious' or 'Normal.' The dataset is fully labeled, making it ideal for supervised learning tasks. Additionally, the labels provide contextual information such as the type of activity or the environment in which it occurred, further enriching the dataset for nuanced model training.

Use Cases and Applications

This dataset is particularly valuable for Al applications in the retail industry, where detecting potential shoplifting or suspicious behaviors is crucial for loss prevention. The dataset can be used to train models for:

Real-Time Surveillance Systems: Integrate Al-driven models into surveillance cameras to detect and alert security personnel to potential threats.

Retail Analytics: Use the dataset to identify patterns in customer behavior, helping retailers optimize their store layouts or refine security measures.

Anomaly Detection: Extend the dataset's application beyond shoplifting to other suspicious activities, such as unauthorized access or vandalism in different environments.

Key Features

High-Quality Image Data: Each image is captured in various retail environments, providing a broad spectrum of lighting conditions, angles, and occlusions to challenge model performance.

Detailed Annotations: Beyond simple categorization, each image includes metadata that offers deeper insights, such as activity type, timestamp, and environmental conditions.

Scalable and Versatile: The dataset's comprehensive structure and annotations make it versatile for use in not only retail but also other security-critical environments like airports or stadiums.

Conclusion

This dataset offers a robust foundation for developing advanced machine learning. Models tailored for real-time activity detection. Providing critical tools for retail security, surveillance systems, and anomaly detection applications. With its rich variety of label data and organize structure. The Suspicious Activity Detection Dataset serves. As a valuable resource for any Al project focusing on enhancing safety and security through visual recognition.

This dataset is sourced from Kaggle.
Health Care Analytics
kaggle.com
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abishek Sudarshan
Description
Context

Part of Janatahack Hackathon in Analytics Vidhya

Content

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

The Process:

MedCamp employees / volunteers reach out to people and drive registrations. During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.

Other things to note:

Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people. For a few camps, there was hardware failure, so some information about date and time of registration is lost. MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

Favorable outcome:

For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall. You need to predict the chances (probability) of having a favourable outcome.

Train / Test split:

Camps started on or before 31st March 2006 are considered in Train Test data is for all camps conducted on or after 1st April 2006.

Acknowledgements

Credits to AV

Inspiration

To share with the data science community to jump start their journey in Healthcare Analytics
Bioassay Datasets
kaggle.com
zip
Updated Sep 7, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2017). Bioassay Datasets [Dataset]. https://www.kaggle.com/uciml/bioassay-datasets
Explore at:
zip(50341627 bytes)Available download formats
Dataset updated
Sep 7, 2017
Dataset authored and provided by
UCI Machine Learning
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The drug-development process is time-consuming and expensive. In High-Throughput Screening (HTS), batches of compounds are tested against a biological target to test the compound's ability to bind to the target. Targets might be antibodies for example. If the compound binds to the target then it is active for that target and known as a hit.

Virtual screening is the computational or in silico screening of biological compounds and complements the HTS process. It is used to aid the selection of compounds for screening in HTS bioassays or for inclusion in a compound-screening library.

Drug discovery is the first stage of the drug-development process and involves finding compounds to test and screen against biological targets. This first stage is known as primary-screening and usually involves the screening of thousands of compounds.

This dataset is a collection of 21 bioassays (screens) that measure the activity of various compounds against different biological targets.

Content

Each bioassay is split into test and train files.

Here are some descriptions of some of the assays compounds. The source, unfortunately, does not have descriptions for every assay. That's the nature of the beast for finding this kind data and was also pointed out in the original study.

Primary screens

AID362 details the results of a primary screening bioassay for Formylpeptide Receptor Ligand Binding University from the New Mexico Center for Molecular Discovery. It is a relatively small dataset with 4279 compounds and with a ratio of 1 active to 70 inactive compounds (1.4% minority class). The compounds were selected on the basis of preliminary virtual screening of approximately 480,000 drug-like small molecules from Chemical Diversity Laboratories.

AID604 is a primary screening bioassay for Rho kinase 2 inhibitors from the Scripps Research Institute Molecular Screening Center. The bioassay contains activity information of 59,788 compounds with a ratio of 1 active compound to 281 inactive compounds (1.4%). 57,546 of the compounds have known drug-like properties.

AID456 is a primary screen assay from the Burnham Center for Chemical Genomics for inhibition of TNFa induced VCAM-1 cell surface expression and consists of 9,982 compounds with a ratio of 1 active compound to 368 inactive compounds (0.27% minority). The compounds have been selected for their known drug-like properties and 9,431 meet the Rule of 5 [19].

AID688 is the result of a primary screen for Yeast eIF2B from the Penn Center for Molecular Discovery and contains activity information of 27,198 compounds with a ratio of 1 active compound to 108 inactive compounds (0.91% minority). The screen is a reporter-gene assay and 25,656 of the compounds have known drug-like properties.

AID373 is a primary screen from the Scripps Research Institute Molecular Screening Center for endothelial differentiation, sphingolipid G-protein-coupled receptor, 3. 59,788 compounds were screened with a ratio of 1 active compound to 963 inactive compounds (0.1%). 57,546 of the compounds screened had known drug-like properties.

AID746 is a primary screen from the Scripps Research Institute Molecular Screening Center for Mitogen-activated protein kinase. 59,788 compounds were screened with a ratio of 1 active compound to 162 inactive compounds (0.61%). 57,546 of the compounds screened had known drug-like properties.

AID687 is the result of a primary screen for coagulation factor XI from the Penn Center for Molecular Discovery and contains activity information of 33,067 compounds with a ratio of 1 active compound to 350 inactive compounds (0.28% minority). 30,353 of the compounds screened had known drug-like properties.

Primary and Confirmatory

AID604 (primary) with AID644 (confirmatory)

AID746 (primary) with AID1284 (confirmatory)

AID373 (primary) with AID439 (confirmatory)

AID746 (primary) with AID721 (confirmatory)

Confirmatory

AID1608 is a different type of screening assay that was used to identify compounds that prevent HttQ103-induced cell death. National Institute of Neurological Disorders and Stroke Approved Drug Program. The compounds that prevent a release of a certain chemical into the growth medium are labelled as active and the remaining compounds are labelled as having inconclusive activity. AID1608 is a small dataset with 1,033 compounds and a ratio of 1 active to 14 inconclusive compounds (6.58% minority class).

AID644

AID1284

AID439

AID721

AID1608

AID644

AID1284

AID439

AID721

Acknowledgements

Original study: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820499/

Data downloaded form UCI ML repository:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

...
f
Table_6_Prediction model of preeclampsia using machine learning based...
frontiersin.figshare.com
docx
Updated Jun 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taishun Li; Mingyang Xu; Yuan Wang; Ya Wang; Huirong Tang; Honglei Duan; Guangfeng Zhao; Mingming Zheng; Yali Hu (2024). Table_6_Prediction model of preeclampsia using machine learning based methods: a population based cohort study in China.docx [Dataset]. http://doi.org/10.3389/fendo.2024.1345573.s006
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fendo.2024.1345573.s006
Dataset updated
Jun 11, 2024
Dataset provided by
Frontiers
Authors
Taishun Li; Mingyang Xu; Yuan Wang; Ya Wang; Huirong Tang; Honglei Duan; Guangfeng Zhao; Mingming Zheng; Yali Hu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionPreeclampsia is a disease with an unknown pathogenesis and is one of the leading causes of maternal and perinatal morbidity. At present, early identification of high-risk groups for preeclampsia and timely intervention with aspirin is an effective preventive method against preeclampsia. This study aims to develop a robust and effective preeclampsia prediction model with good performance by machine learning algorithms based on maternal characteristics, biophysical and biochemical markers at 11–13 + 6 weeks’ gestation, providing an effective tool for early screening and prediction of preeclampsia.MethodsThis study included 5116 singleton pregnant women who underwent PE screening and fetal aneuploidy from a prospective cohort longitudinal study in China. Maternal characteristics (such as maternal age, height, pre-pregnancy weight), past medical history, mean arterial pressure, uterine artery pulsatility index, pregnancy-associated plasma protein A, and placental growth factor were collected as the covariates for the preeclampsia prediction model. Five classification algorithms including Logistic Regression, Extra Trees Classifier, Voting Classifier, Gaussian Process Classifier and Stacking Classifier were applied for the prediction model development. Five-fold cross-validation with an 8:2 train-test split was applied for model validation.ResultsWe ultimately included 49 cases of preterm preeclampsia and 161 cases of term preeclampsia from the 4644 pregnant women data in the final analysis. Compared with other prediction algorithms, the AUC and detection rate at 10% FPR of the Voting Classifier algorithm showed better performance in the prediction of preterm preeclampsia (AUC=0.884, DR at 10%FPR=0.625) under all covariates included. However, its performance was similar to that of other model algorithms in all PE and term PE prediction. In the prediction of all preeclampsia, the contribution of PLGF was higher than PAPP-A (11.9% VS 8.7%), while the situation was opposite in the prediction of preterm preeclampsia (7.2% VS 16.5%). The performance for preeclampsia or preterm preeclampsia using machine learning algorithms was similar to that achieved by the fetal medicine foundation competing risk model under the same predictive factors (AUCs of 0.797 and 0.856 for PE and preterm PE, respectively).ConclusionsOur models provide an accessible tool for large-scale population screening and prediction of preeclampsia, which helps reduce the disease burden and improve maternal and fetal outcomes.
Elasticity tensors of 10276 crystals from DFT computations
zenodo.org
data.niaid.nih.gov
json
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingjian Wen; Matthew K. Horton; Jason M. Munro; Patrick Huck; Kristin A. Persson; Mingjian Wen; Matthew K. Horton; Jason M. Munro; Patrick Huck; Kristin A. Persson (2024). Elasticity tensors of 10276 crystals from DFT computations [Dataset]. http://doi.org/10.5281/zenodo.8190849
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8190849
Dataset updated
Feb 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mingjian Wen; Matthew K. Horton; Jason M. Munro; Patrick Huck; Kristin A. Persson; Mingjian Wen; Matthew K. Horton; Jason M. Munro; Patrick Huck; Kristin A. Persson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Paper introducing this dataset

Wen, M., Horton, M., Munro, J., Huck, P., & Persson, K. (2024). An equivariant graph neural network for the elasticity tensors of all seven crystal systems. Digital Discovery. DOI:https://doi.org/10.1039/D3DD00233K">https://doi.org/10.1039/D3DD00233K

This dataset consists of three data files in the json format. Each file is explained below.

crystal_elasticity_tensor.json

DFT computed elastic tensors of 10276 crystals used for developing the MatTen model.

structure: crystal structure of the material
formula_pretty: chemical formula
crystal_system: crystal system
elastic_tensor: full fourth-rank elastic tensor
elastic_tensor_voigt: 6x6 Voigt matrix of the elastic tensor
split: split of the data into train, validation, and test subsets for model development

max_directional_E.json

New crystals with large maximum directional Young's modulus.

material_id: Materials Project identifier
formula_pretty: chemical formula

structure_original: crystal structure from the Materials Project database
elastic_tensor_matten_original: MatTen predicted elastic tensor using `structure_original`
max_directional_E_matten_original: MatTen predicted maximum directional Young's modulus using `structure_original`

structure: further DFT optimized structure with a tigher criterion
elastic_tensor: DFT elastic tensor corresponding to `structure`
max_directional_E: DFT maximum directional Young's modulus using `structure`
elastic_tensor_matten: MatTen predicted elastic tensor using `structure`
max_directional_E_matten: MatTen predicted maximum directional Young's modulus using `structure`

elemental_cubic_metal_max_E_along_100_direction.json

New crystals with its maximum directional Young's modulus along the [100] direction.

material_id: Materials Project identifier
formula_pretty: chemical formula

structure_original: crystal structure from the Materials Project database
elastic_tensor_matten: MatTen predicted elastic tensor using `structure_original`
Delta_S_matten: value of $S_{1111} - S_{1122} - 2*S_{2323}$ using `elastic_tensor_matten`

structure: further DFT optimized structure with a tigher criterion
elastic_tensor: DFT elastic tensor corresponding to `structure`
Delta_S: value of $S_{1111} - S_{1122} - 2*S_{2323}$ using `elastic_tensor`

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0

Training dataset for NABat Machine Learning V1.0

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

U.S. Geological Survey

Description

Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

Clear search

Close search

Google apps

Main menu

Training dataset for NABat Machine Learning V1.0

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Materials Science Named Entity Recognition: train/development/test sets

Natural Language Inference Evaluation Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

MedalCare-XL

Student Exam Performance

aze_carpet

Data from: UPAR Dataset

Data from: Common Phone: A Multilingual Dataset for Robust Acoustic...

azerbaijani-cuisine

Europe PMC Full Text Corpus

ATLAS Top Tagging Open Data Set

Data from: Machine Learning Accelerates the Discovery of Design Rules and...

EyePACS-light (v2) Dataset

DataSheet4_Application of machine learning to predict unbound drug...

Suspicious Activity Detection Dataset Dataset

Health Care Analytics

Context

Content

Acknowledgements

Inspiration

Bioassay Datasets

Context

Content

Primary screens

Primary and Confirmatory

Confirmatory

Acknowledgements

Table_6_Prediction model of preeclampsia using machine learning based...

Elasticity tensors of 10276 crystals from DFT computations

Paper introducing this dataset

crystal_elasticity_tensor.json

max_directional_E.json

elemental_cubic_metal_max_E_along_100_direction.json

Training dataset for NABat Machine Learning V1.0See More Versions

Training dataset for NABat Machine Learning V1.0