100+ datasets found

f
Data from: Machine Learning Models Identify New Inhibitors for Human OATP1B1...
figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas R. Lane; Fabio Urbina; Xiaohong Zhang; Margret Fye; Jacob Gerlach; Stephen H. Wright; Sean Ekins (2023). Machine Learning Models Identify New Inhibitors for Human OATP1B1 [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.2c00662.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.molpharmaceut.2c00662.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Thomas R. Lane; Fabio Urbina; Xiaohong Zhang; Margret Fye; Jacob Gerlach; Stephen H. Wright; Sean Ekins
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The uptake transporter OATP1B1 (SLC01B1) is largely localized to the sinusoidal membrane of hepatocytes and is a known victim of unwanted drug–drug interactions. Computational models are useful for identifying potential substrates and/or inhibitors of clinically relevant transporters. Our goal was to generate OATP1B1 in vitro inhibition data for [3H] estrone-3-sulfate (E3S) transport in CHO cells and use it to build machine learning models to facilitate a comparison of seven different classification models (Deep learning, Adaboosted decision trees, Bernoulli naïve bayes, k-nearest neighbors (knn), random forest, support vector classifier (SVC), logistic regression (lreg), and XGBoost (xgb)] using ECFP6 fingerprints to perform 5-fold, nested cross validation. In addition, we compared models using 3D pharmacophores, simple chemical descriptors alone or plus ECFP6, as well as ECFP4 and ECFP8 fingerprints. Several machine learning algorithms (SVC, lreg, xgb, and knn) had excellent nested cross validation statistics, particularly for accuracy, AUC, and specificity. An external test set containing 207 unique compounds not in the training set demonstrated that at every threshold SVC outperformed the other algorithms based on a rank normalized score. A prospective validation test set was chosen using prediction scores from the SVC models with ECFP fingerprints and were tested in vitro with 15 of 19 compounds (84% accuracy) predicted as active (≥20% inhibition) showed inhibition. Of these compounds, six (abamectin, asiaticoside, berbamine, doramectin, mobocertinib, and umbralisib) appear to be novel inhibitors of OATP1B1 not previously reported. These validated machine learning models can now be used to make predictions for drug–drug interactions for human OATP1B1 alongside other machine learning models for important drug transporters in our MegaTrans software.
TREC 2022 Deep Learning test collection
catalog.data.gov
data.nist.gov
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Data from: Simulated Design–Build–Test–Learn Cycles for Consistent...
figshare.com
acs.figshare.com
xlsx
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul van Lent; Joep Schmitz; Thomas Abeel (2023). Simulated Design–Build–Test–Learn Cycles for Consistent Comparison of Machine Learning Methods in Metabolic Engineering [Dataset]. http://doi.org/10.1021/acssynbio.3c00186.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acssynbio.3c00186.s001
Dataset updated
Aug 24, 2023
Dataset provided by
ACS Publications
Authors
Paul van Lent; Joep Schmitz; Thomas Abeel
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Combinatorial pathway optimization is an important tool in metabolic flux optimization. Simultaneous optimization of a large number of pathway genes often leads to combinatorial explosions. Strain optimization is therefore often performed using iterative design–build–test–learn (DBTL) cycles. The aim of these cycles is to develop a product strain iteratively, every time incorporating learning from the previous cycle. Machine learning methods provide a potentially powerful tool to learn from data and propose new designs for the next DBTL cycle. However, due to the lack of a framework for consistently testing the performance of machine learning methods over multiple DBTL cycles, evaluating the effectiveness of these methods remains a challenge. In this work, we propose a mechanistic kinetic model-based framework to test and optimize machine learning for iterative combinatorial pathway optimization. Using this framework, we show that gradient boosting and random forest models outperform the other tested methods in the low-data regime. We demonstrate that these methods are robust for training set biases and experimental noise. Finally, we introduce an algorithm for recommending new designs using machine learning model predictions. We show that when the number of strains to be built is limited, starting with a large initial DBTL cycle is favorable over building the same number of strains for every cycle.
Z
DCASE 2024 Challenge Task 2 Additional Training Dataset
data.niaid.nih.gov
Updated May 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Takashi, Endo (2024). DCASE 2024 Challenge Task 2 Additional Training Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11183283
Explore at:
Dataset updated
May 15, 2024
Dataset provided by
Augusti, Filippo
Yohei, Kawaguchi
Keisuke, Imoto
Tomoya, Nishida
Albertini, Davide
Kota, Dohi
Noboru, Harada
Sannino, Roberto
Daisuke, Niizumi
Takashi, Endo
Pradolini, Simone
Harsh, Purohit
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

This dataset is the "additional training dataset" for the DCASE 2024 Challenge Task 2.

The data consists of the normal/anomalous operating sounds of nine types of real/toy machines. Each recording is a single-channel audio that includes both a machine's operating sound and environmental noise. The duration of recordings varies from 6 to 10 seconds. The following nine types of real/toy machines are used in this task:

3DPrinter

AirCompressor

BrushlessMotor

HairDryer

HoveringDrone

RoboticArm

Scanner

ToothBrush

ToyCircuit

Overview of the task

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

This task is the follow-up from DCASE 2020 Task 2 to DCASE 2023 Task 2. The task this year is to develop an ASD system that meets the following five requirements.

Train a model using only normal sound (unsupervised learning scenario) Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.

Detect anomalies regardless of domain shifts (domain generalization task) In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2 and DCASE 2023 Task 2.

Train a model for a completely new machine typeFor a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning. This requirement is the same as in DCASE 2023 Task 2.

Train a model using a limited number of machines from its machine typeWhile sounds from multiple machines of the same machine type can be used to enhance the detection performance, it is often the case that only a limited number of machines are available for a machine type. In such a case, the system should be able to train models using a few machines from a machine type. This requirement is the same as in DCASE 2023 Task 2.

5 . Train a model both with or without attribute informationWhile additional attribute information can help enhance the detection performance, we cannot always obtain such information. Therefore, the system must work well both when attribute information is available and when it is not.

The last requirement is newly introduced in DCASE 2024 Task2.

Definition

We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

"Machine type" indicates the type of machine, which in the additional training dataset is one of nine: 3D-printer, air compressor, brushless motor, hair dryer, hovering drone, robotic arm, document scanner (scanner), toothbrush, and Toy circuit.

A section is defined as a subset of the dataset for calculating performance metrics.

The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

Attributes are parameters that define states of machines or types of noise. For several machine types, the attributes are hidden.

Dataset

This dataset consists of nine machine types. For each machine type, one section is provided, and the section is a complete set of training data. A set of test data corresponding to this training data will be provided in another seperate zenodo page as an "evaluation dataset" for the DCASE 2024 Challenge task 2. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training and (ii) ten clips of normal sounds in the target domain for training. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

File names and attribute csv files

File names and attribute csv files provide reference labels for each clip. The given reference labels for each training clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Note that for machine types that has its attribute information hidden, the attribute information in each file names are only labeled as "noAttributes". Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...

For machine types that have their attribute information hidden, all columns except the filename column are left blank for each row.

Recording procedure

Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

Directory structure

/eval_data

/raw - /3DPrinter - /train (only normal clips) - /section_00_source_train_normal_0001_.wav - ... - /section_00_source_train_normal_0990_.wav - /section_00_target_train_normal_0001_.wav - ... - /section_00_target_train_normal_0010_.wav - attributes_00.csv (attribute csv for section 00) - /AirCompressor (The other machine types have the same directory structure as 3DPrinter.) - /BrushlessMotor - /HairDryer - /HoveringDrone - /RoboticArm - /Scanner - /ToothBrush - /ToyCircuit

Baseline system

The baseline system is available on the Github repository . The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Condition of use

This dataset was created jointly by Hitachi, Ltd., NTT Corporation and STMicroelectronics and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

Contact

If there is any problem, please contact us:

Tomoya Nishida, tomoya.nishida.ax@hitachi.com

Keisuke Imoto, keisuke.imoto@ieee.org

Noboru Harada, noboru@ieee.org

Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp

Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
Z
PAN22 Authorship Analysis: Style Change Detection
data.niaid.nih.gov
zenodo.org
Updated Dec 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tschuggnall, Michael (2023). PAN22 Authorship Analysis: Style Change Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6334244
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Tschuggnall, Michael
Zangerle, Eva
Stein, Benno
Potthast, Martin
Mayerl, Maximilian
Description
This is the dataset for the Style Change Detection task of PAN 2022.

Task

The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches. Hence, a fundamental question is the following: If multiple authors have written a text together, can we find evidence for this fact; i.e., do we have a means to detect variations in the writing style? Answering this question belongs to the most difficult and most interesting challenges in author identification: Style change detection is the only means to detect plagiarism in a document if no comparison texts are given; likewise, style change detection can help to uncover gift authorships, to verify a claimed authorship, or to develop new technology for writing support.

Previous editions of the Style Change Detection task aim at e.g., detecting whether a document is single- or multi-authored (2018), the actual number of authors within a document (2019), whether there was a style change between two consecutive paragraphs (2020, 2021) and where the actual style changes were located (2021). Based on the progress made towards this goal in previous years, we again extend the set of challenges to likewise entice novices and experts:

Given a document, we ask participants to solve the following three tasks:

[Task1] Style Change Basic: for a text written by two authors that contains a single style change only, find the position of this change (i.e., cut the text into the two authors’ texts on the paragraph-level),

[Task2] Style Change Advanced: for a text written by two or more authors, find all positions of writing style change (i.e., assign all paragraphs of the text uniquely to some author out of the number of authors assumed for the multi-author document)

[Task3] Style Change Real-World: for a text written by two or more authors, find all positions of writing style change, where style changes now not only occur between paragraphs, but at the sentence level.

All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors.

Data

To develop and then test your algorithms, three datasets including ground truth information are provided (dataset1 for task 1, dataset2 for task 2, and dataset3 for task 3).

Each dataset is split into three parts:

training set: Contains 70% of the whole dataset and includes ground truth data. Use this set to develop and train your models.

validation set: Contains 15% of the whole dataset and includes ground truth data. Use this set to evaluate and optimize your models.

test set: Contains 15% of the whole dataset, no ground truth data is given. This set is used for evaluation (see later).

You are free to use additional external data for training your models. However, we ask you to make the additional data utilized freely available under a suitable license.

Input Format

The datasets are based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data for each dataset, respectively.

For each problem instance X (i.e., each input document), two files are provided:

problem-X.txt contains the actual text, where paragraphs are denoted by for tasks 1 and 2. For task 3, we provide one sentence per paragraph (again, split by ).

truth-problem-X.json contains the ground truth, i.e., the correct solution in JSON format. An example file is listed in the following (note that we list keys for the three tasks here):

{ "authors": NUMBER_OF_AUTHORS, "site": SOURCE_SITE, "changes": RESULT_ARRAY_TASK1 or RESULT_ARRAY_TASK3, "paragraph-authors": RESULT_ARRAY_TASK2 }

The result for task 1 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). For task 2 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., [1, 2, 1] for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic). The result for task 3 (key "changes") is similarly structured as the results array for task 1. However, for task 3, the changes array holds a binary for each pair of consecutive sentences and they may be multiple style changes in the document.

An example of a multi-author document with a style change between the third and fourth paragraph (or sentence for task 3) could be described as follows (we only list the relevant key/value pairs here):

{ "changes": [0,0,1,...], "paragraph-authors": [1,1,1,2,...] }

Output Format

To evaluate the solutions for the tasks, the results have to be stored in a single file for each of the input documents and each of the datasets. Please note that we require a solution file to be generated for each input problem for each dataset. The data structure during the evaluation phase will be similar to that in the training phase, with the exception that the ground truth files are missing.

For each given problem problem-X.txt, your software should output the missing solution file solution-problem-X.json, containing a JSON object holding the solution to the respective task. The solution for tasks 1 and 3 is an array containing a binary value for each pair of consecutive paragraphs (task 1) or sentences (task 3). For task 2, the solution is an array containing the order of authors contained in the document (as in the truth files).

An example solution file for tasks 1 and 3 is featured in the following (note again that for task 1, changes are captured on the paragraph level, whereas for task 3, changes are captured on the sentence level):

{ "changes": [0,0,1,0,0,...] }

For task 2, the solution file looks as follows:

{ "paragraph-authors": [1,1,2,2,3,2,...] }
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
f
An Efficient Method for Predicting Soil Thickness in Large-scale Area Based...
figshare.com
xlsx
Updated Jun 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Wang (2020). An Efficient Method for Predicting Soil Thickness in Large-scale Area Based on Cluster Sampling [Dataset]. http://doi.org/10.6084/m9.figshare.12496841.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12496841.v1
Dataset updated
Jun 17, 2020
Dataset provided by
figshare
Authors
Wei Wang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
After fast mean shift (FMS) clustering, the whole research area was divided to 10 subareas, so the new samples can characterize the geographical features of each subarea were collected through field investigations. Because of our limited human and material resources, it is difficult to conduct a mass of sampling in each subarea. In order to make the most of our limited resources, we need to conduct reasonable field sampling strategy. For the first two large subareas, we collected 70 field samples respectively, and labeled them as the first sample set and the second sample set that will be used to build their own GWR models for extend prediction of unobserved points in each area, i.e. local extension prediction; while the remaining 8 small subareas took moderate amounts of samples according to their size, if one subarea owns the size of raster points more than 5000, 16 samples will be collected from it, otherwise, take 12 samples. In this way, a total of 112 samples are put together as the third sample set, and the third GWR model is constructed to achieve the global extension prediction of 8 subareas. In addition, three sample sets were divided into training set and test set, respectively. For the first two sample sets, the ratio of sample size of training set and test set are all 5:2, i.e. training set contains 50 samples, test set has 20 samples. Because of the third sample set composed of samples from 8 subareas, we divided the samples of each subarea into training set and test set according to the ratio of 3:1. In the other word, the sample number of training set from third to tenth subarea is 12, 9, 9, 12, 9, 12, 12 and 9 respectively, and 84 training sample in total; and the sample number of test set from eight subarea is 4, 3, 3, 4, 3, 4, 4 and 3 respectively, a total of 28 samples.
Calabi-Yau: CICY-4 folds
kaggle.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lorrespz (2024). Calabi-Yau: CICY-4 folds [Dataset]. https://www.kaggle.com/datasets/lorresprz/calabi-yau-cicy-4-folds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 4, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
lorrespz
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains the complete intersection Calabi-Yau four-folds (CICY4) configuration matrices and their four Hodge numbers, designed for the problem of machine learning the Hodge numbers using the configuration matrices as inputs to a neural network model.

The original data for CICY4 is from the paper: "Topological Invariants and Fibration Structure of Complete Intersection Calabi-Yau Four-Folds", arXiv:1405.2073. and can be downloaded in either text or Mathematica format from: https://www-thphys.physics.ox.ac.uk/projects/CalabiYau/Cicy4folds/index.html

The full CICY4 data included with this dataset in npy format (conf.npy, hodge.npy, direct.npy) is created by running the script 'create_data.py' from https://github.com/robin-schneider/cicy-fourfolds. Given this full data, the following two additional datasets at 72% and 80% training ratios were created.

At 72% data split, - The train dataset consists of the files (conf_Xtrain.npy, hodge_ytrain.npy) - The validation dataset consists of the files (conf_Xvalid.npy, hodge_yvalid.npy) - The test dataset consists of the files (conf_Xtest.npy, hodge_ytest.npy)

At 80% data split, the 3 datasets are: - (conf_Xtrain_80.npy, hodge_ytrain_80.npy) - (conf_Xvalid.npy, hodge_yvalid.npy) - (conf_Xtest_80.npy, hodge_ytest_80.npy) The new train and test sets were formed from the old ones: The old test set is divided into 2 parts with the ratio (0.6, 0.4). The 0.6-partition becomes the new test set, the 0.4-partition is merged with the old train set to form the new train set.

Trained neural networks models and their training/validation losses - 12 models were trained on the 72% dataset and their checkpoints are stored in the folder 'trained_models'. The 12 csv files containing the train+validation losses of these models are stored in the folder 'train-validation-losses'. - At 80% data split, the top 3 performing models trained on the 72% dataset were retrained and their checkpoints are stored in 'trained_models_80pc_split', together with the 3 csv files containing the loss values during the training phase.

Inference notebook: The inference notebook using this dataset is https://www.kaggle.com/code/lorresprz/cicy4-training-results-inference-all-models

Publication: This dataset was created for the work: Deep Learning Calabi-Yau four folds with hybrid and recurrent neural network architectures, https://arxiv.org/abs/2405.17406
Z
DCASE 2023 Challenge Task 2 Development Dataset
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated May 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kota Dohi (2023). DCASE 2023 Challenge Task 2 Development Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7687463
Explore at:
Dataset updated
May 2, 2023
Dataset provided by
Noboru
Harsh
Keisuke
Takashi
Daisuke
Tomoya
Kota Dohi
Yohei
Yuma
Description
Description

This dataset is the "development dataset" for the DCASE 2023 Challenge Task 2 "First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring".

The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:

ToyCar

ToyTrain

Fan

Gearbox

Bearing

Slide rail

Valve

Overview of the task

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

This task is the follow-up from DCASE 2020 Task 2 to DCASE 2022 Task 2. The task this year is to develop an ASD system that meets the following four requirements.

Train a model using only normal sound (unsupervised learning scenario)

Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.

Detect anomalies regardless of domain shifts (domain generalization task)

In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2.

Train a model for a completely new machine type

For a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning.

Train a model using only one machine from its machine type

While sounds from multiple machines of the same machine type can be used to enhance detection performance, it is often the case that sound data from only one machine are available for a machine type. In such a case, the system should be able to train models using only one machine from a machine type.

The last two requirements are newly introduced in DCASE 2023 Task2 as the "first-shot problem".

Definition

We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

"Machine type" indicates the type of machine, which in the development dataset is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain.

A section is defined as a subset of the dataset for calculating performance metrics.

The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

Attributes are parameters that define states of machines or types of noise.

Dataset

This dataset consists of seven machine types. For each machine type, one section is provided, and the section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, and (iii) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

File names and attribute csv files

File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...

Recording procedure

Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

Directory structure

/dev_data

/raw

/fan

/train (only normal clips)

/section_00_source_train_normal_0000_.wav

...

/section_00_source_train_normal_0989_.wav

/section_00_target_train_normal_0000_.wav

...

/section_00_target_train_normal_0009_.wav

/test

/section_00_source_test_normal_0000_.wav

...

/section_00_source_test_normal_0049_.wav

/section_00_source_test_anomaly_0000_.wav

...

/section_00_source_test_anomaly_0049_.wav

/section_00_target_test_normal_0000_.wav

...

/section_00_target_test_normal_0049_.wav

/section_00_target_test_anomaly_0000_.wav

...

/section_00_target_test_anomaly_0049_.wav

attributes_00.csv (attribute csv for section 00)

/gearbox (The other machine types have the same directory structure as fan.)

/bearing

/slider (slider means "slide rail")

/ToyCar

/ToyTrain

/valve

Baseline system

The baseline system is available on the Github repository dcase2023_task2_baseline_ae.The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Condition of use

This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

If you use this dataset, please cite all the following papers. We will publish a paper on the description of the DCASE 2023 Task 2, so pleasure make sure to cite the paper, too.

Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, and Masahiro Yasuda. First-shot anomaly detection for machine condition monitoring: A domain generalization baseline. In arXiv e-prints: 2303.00455, 2023. [URL]

Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi. MIMII DG: sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task. In Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 31-35. Nancy, France, November 2022, . [URL]

Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 1–5. Barcelona, Spain, November 2021. [URL]

Contact

If there is any problem, please contact us:

Kota Dohi, kota.dohi.gr@hitachi.com

Keisuke Imoto, keisuke.imoto@ieee.org

Noboru Harada, noboru@ieee.org

Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp

Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
o
IMDB Movie Reviews (Binary Sentiment)
opendatabay.com
.csv
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). IMDB Movie Reviews (Binary Sentiment) [Dataset]. https://www.opendatabay.com/data/ai-ml/c48f7110-3d06-45be-9cae-aa8799720eec
Explore at:
.csvAvailable download formats
Dataset updated
Jun 18, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
Source Huggingface Hub: link

About this dataset This is a large dataset for binary sentiment classification containing a substantial amount of data compared to previous benchmark datasets. Provided are 25,000 highly polar movie reviews for training and 25,000 for testing. There is also additional unlabeled data available for use. The data fields are consistent among all splits of the dataset

How to use the dataset In order to use this dataset, you will need to first download the IMDB Large Movie Review Dataset. Once you have downloaded the dataset, you can either use it in its original form or split it into training and testing sets. To split the dataset, you will need to create a new file called unsupervised.csv and copy the text column from train.csv into it. You can then split unsupervised.csv into two files: train_unsupervised.csv and test_unsupervised.csv.

Once you have either the original dataset or the training and testing sets, you can begin using them for binary sentiment classification. In order to do this, you will need to use a machine learning algorithm that is capable of performing binary classification, such as logistic regression or support vector machines. Once you have trained your model on the training set, you can then evaluate its performance on the test set by predicting the labels of the reviews in test_unsupervised.csv

Research Ideas This dataset can be used to train a binary sentiment classification model. This dataset can be used to train a model to classify movie reviews into positive and negative sentiment categories. This dataset can be used to build a large movie review database for research purposes

License

CC0

Original Data Source: IMDB Movie Reviews (Binary Sentiment)
f
Data from: Candidate predictors
plos.figshare.com
xls
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid (2025). Candidate predictors [Dataset]. http://doi.org/10.1371/journal.pdig.0000820.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000820.t002
Dataset updated
May 6, 2025
Dataset provided by
PLOS Digital Health
Authors
Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.
SQuAD-it (Italian SQuAD)
kaggle.com
Updated Dec 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). SQuAD-it (Italian SQuAD) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-the-secrets-of-italian-qa-with-squad-it/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
SQuAD-it (Italian SQuAD)

Semi-automatic translation of the SQuAD dataset into Italian

By Huggingface Hub [source]

About this dataset

SQuAD-it is the perfect resource for Italian language learners and Natural Language Processing (NLP) experts alike! This dataset includes a collection of semi-automatically translated question-answer pairs from the SQuAD dataset, giving you an expansive knowledge base in your chosen language. With this robust set of Italian text made available through the SQuAD-it dataset, you can access both a training set (in train.csv) and a testing set (in test.csv) to evaluate and power up your answers! Unlock the wealth of insight that lies within this insightful collection today and boost your NLP experience with SQuAD-it: Italian QA at Your Fingertips!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This guide will help you get the most out of the SQuAD-it dataset. The SQuAD-it dataset is derived from the popular English language SQuAD (Stanford Question Answering Dataset) and it consists of more than 160,000 open question-answer pairs in Italian. This dataset can be used to power up Natural Language Processing models with a focused Italian language knowledge base.

Get Familiar with the Data: First, get familiar with the data and its format by understanding what columns are included in the train/test CSV files and their purpose. Each row contains a context and an answer related to it, which gives you insight into how NLP models can be trained on this data set.

Consider Your Use Case: Think about how your project will benefit from using this dataset – is it to build an Italian language Q&A system, or just extract useful facts from them? Being clear on your goals will help guide which features you’ll need to focus on while exploiting this data set for maximum benefit.

Brainstorm Possible Models: There are many different NLP algorithms that could be used when utilizing this dataset such as text summarization techniques or information extraction approaches based on supervised machine learning algorithms such as rule-based parsing or statistical models for text categorization/transformation tasks like syntactic parsing or utterance detection . Establishing what kind of model you want to create before beginning any experimentation may save time down the line when tuning parameters for performance optimization or making improvements based off prior performance issues caused by decisions made previously in development phase.

Create Meaningful Features: Carefully design features that capture important meaning behind each question answer pair in order capture salient points of discussion between two speakers which can then be used later as inputs into various Machine Learning models while training those systems properly without introducing any bias (eugenics). Features should describe both sides’ meanings without assigning any preconceived notions of correct answers Ultimately reasonable feature representations should provide meaningful metrics associated with words beyond just counts like word embeddings, topic modeling etc..

Develop Your Model Pipelines & Systems : Utilize libraries like ScikitLearn , TensorFlow , Keras , PyTorch and others popular tools as needed depending upon chosen use case while not blindly picking one over another; select wisely and know why certain libraries were picked including potential consequences resulting directly due being unprepared during development stages i . e forgetting about regulrization hyper

Research Ideas

Developing conversational AI systems that are capable of understanding questions posed in Italian and providing relevant answers to those questions.

Developing Machine Learning models that can identify complex topics discussed in Italian text corpora, so as to facilitate more efficient searching for content such as news articles or blog posts in Italian language.

Creating automatic summarization algorithms using the question-answer pairs from the SQuAD-It dataset, which can then be used to generate overviews of lengthy texts written in Italian with minimal human assistance

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See O...

Codecfake dataset - development set

zenodo.org

bin, zip

Updated May 16, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Yuankun Xie; Yuankun Xie (2024). Codecfake dataset - development set [Dataset]. http://doi.org/10.5281/zenodo.11169872

Explore at:

zip, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11169872

Dataset updated

May 16, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Yuankun Xie; Yuankun Xie

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

This dataset is the development set of the Codecfake dataset, corresponding to the manuscript "The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection".

Abstract

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD)
models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose
the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.

Codecfake Dataset

Due to platform restrictions on the size of zenodo repositories, we have divided the Codecfake dataset into various subsets as shown in the table below:

Codecfake dataset	description	link
training set (part 1 of 3) & label	train_split.zip & train_split.z01 - train_split.z06	https://zenodo.org/records/11171708
training set (part 2 of 3)	train_split.z07 - train_split.z14	https://zenodo.org/records/11171720
training set (part 3 of 3)	train_split.z15 - train_split.z19	https://zenodo.org/records/11171724
development set	dev_split.zip & dev_split.z01 - dev_split.z02	https://zenodo.org/records/11169872
test set (part 1 of 2)	Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.zip	https://zenodo.org/records/11169781
test set (part 2 of 2)	Codec unseen test: C7.zip	https://zenodo.org/records/11125029

Countermeasure

The source code of the countermeasure and pre-trained model are available on GitHub https://github.com/xieyuankun/Codecfake.

The Codecfake dataset and pre-trained model are licensed with CC BY-NC-ND 4.0 license.

Dataset for Cost-effective Simulation-based Test Selection in Self-driving...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella (2024). Dataset for Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor [Dataset]. http://doi.org/10.5281/zenodo.5914130
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5914130
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software

This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.

GitHub: github.com/ChristianBirchler/sdc-scissor

This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.

Usage

Demo

YouTube Link

Installation

The tool can either be run with Docker or locally using Poetry.

When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.

To install the application use one of the following approaches:

Docker: docker build --tag sdc-scissor .

Poetry: poetry install

Using the Tool

The tool can be used with the following two commands:

Docker: docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS] (this will write all files written to /out to the local folder results)

Poetry: poetry run python sdc-scissor.py [COMMAND] [OPTIONS]

There are multiple commands to use. For simplifying the documentation only the command and their options are described.

Generation of tests:

generate-tests --out-path /path/to/store/tests

Automated labeling of Tests:

label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests

Note: This only works locally with BeamNG.research installed

Model evaluation:

evaluate-models --dataset /path/to/train/set --save

Split train and test data:

split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8

Test outcome prediction:

predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib

Evaluation based on random strategy:

evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib

The possible parameters are always documented with --help.

Linting

The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:

poetry run flake8 . poetry run pylint **/*.py

License

The software we developed is distributed under GNU GPL license. See the LICENSE.md file.

Contacts

Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch

Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch

Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch

Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de

Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch

References

Christian Birchler, Nicolas Ganz, Sajad Khatiri, Alessio Gambi, and Sebastiano Panichella. 2022. Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor. In 2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE.

If you use this tool in your research, please cite the following papers:

@INPROCEEDINGS{Birchler2022, author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano}, booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor}, year={2022}, }
a
Tree Point Classification - New Zealand
hub.arcgis.com
Updated Jul 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eagle Technology Group Ltd (2022). Tree Point Classification - New Zealand [Dataset]. https://hub.arcgis.com/content/0e2e3d0d0ef843e690169cac2f5620f9
Explore at:
Dataset updated
Jul 25, 2022
Dataset authored and provided by
Eagle Technology Group Ltd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story
R
Solar flare forecasting based on magnetogram sequences learning with MViT...
redu.unicamp.br
data.niaid.nih.gov
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Repositório de Dados de Pesquisa da Unicamp (2024). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. http://doi.org/10.25824/redu/IH0AH0
Explore at:
Unique identifier
https://doi.org/10.25824/redu/IH0AH0
Dataset updated
Jul 15, 2024
Dataset provided by
Repositório de Dados de Pesquisa da Unicamp
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Description
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders: magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders. M24/M48: both present the following sub-folders structure: Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root: inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files: magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where: hmi: is the instrument that captured the image sharp_720s: is the database source of SDO/HMI. : is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA). : is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds). : is the date-time when the sequence starts, and follow the same format of . : is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where: : is Seq16 if refers to a sequence, or void if refers direct to images. : "24h" or "48h". : is "TrainVal" or "Test". The refers to the split of Train/Val. : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders: Model training codes: "SF_MViT_M+_", where: : void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test); : "24h" or "48h"; : "oneSplit" for a specific split or "allSplits" if run all splits. : void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where: : point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt: : train or val; : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where: : epoch number of the checkpoint; : corresponding valid loss; : 0 to 4.
Student Performance Data Set (competition form)
kaggle.com
zip
Updated Dec 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kukuroo3 (2021). Student Performance Data Set (competition form) [Dataset]. https://www.kaggle.com/kukuroo3/student-performance-data-set-competition-form
Explore at:
zip(24175 bytes)Available download formats
Dataset updated
Dec 8, 2021
Authors
kukuroo3
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset was taken from link and preprocessed and separated into competition format. The label for the test data is provided in the form of a function.
Complete Blood Count (CBC)
kaggle.com
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Noukhez (2024). Complete Blood Count (CBC) [Dataset]. https://www.kaggle.com/datasets/mdnoukhej/complete-blood-count-cbc
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhammad Noukhez
Description
Dataset Description:

This dataset is a comprehensive collection of Complete Blood Count (CBC) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.

Dataset Breakdown:

Training Images: 300

Validation Images: 60

Test Images: 60

Annotations: Detailed annotations included for all images

Overview:

The Complete Blood Count (CBC) is a crucial test used in medical diagnostics to evaluate the overall health and detect a variety of disorders, including anemia, infection, and many other diseases. This dataset provides a rich source of CBC images that can be used to train machine learning models to automate the analysis and interpretation of these tests.

Data Composition:

Training Set:

Contains 300 images

These images are used to train machine learning models, enabling them to learn and recognize patterns associated with various blood cell types and conditions.

Validation Set:

Contains 60 images

Used to tune the models and optimize their performance, ensuring that the models generalize well to new, unseen data.

Test Set:

Contains 60 images

Used to evaluate the final model performance, providing an unbiased assessment of how well the model performs on new data.

Annotations:

Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.

Key Features:

High-Quality Images: All images are of high quality, making them suitable for a variety of machine learning tasks, including image classification, object detection, and segmentation.

Comprehensive Annotations: Each image is thoroughly annotated, providing valuable information that can be used to train and validate models.

Balanced Dataset: The dataset is carefully balanced with distinct sets for training, validation, and testing, ensuring that models trained on this data will be robust and generalizable.

Applications:

This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated CBC Analysis: Developing algorithms to automatically analyze CBC images and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of CBC image analysis.

Usage Notes:

Data Augmentation: Users may consider applying data augmentation techniques to increase the diversity of the training data and improve model robustness.

Preprocessing: Proper preprocessing, such as normalization and noise reduction, can enhance model performance.

Evaluation Metrics: It is recommended to use standard evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

Conclusion:

This CBC dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for CBC analysis.

Facebook

Twitter

Click to copy link

Link copied

Cite

Thomas R. Lane; Fabio Urbina; Xiaohong Zhang; Margret Fye; Jacob Gerlach; Stephen H. Wright; Sean Ekins (2023). Machine Learning Models Identify New Inhibitors for Human OATP1B1 [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.2c00662.s002

Data from: Machine Learning Models Identify New Inhibitors for Human OATP1B1

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.molpharmaceut.2c00662.s002

Dataset updated

Jun 2, 2023

Dataset provided by

ACS Publications

Authors

Thomas R. Lane; Fabio Urbina; Xiaohong Zhang; Margret Fye; Jacob Gerlach; Stephen H. Wright; Sean Ekins

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The uptake transporter OATP1B1 (SLC01B1) is largely localized to the sinusoidal membrane of hepatocytes and is a known victim of unwanted drug–drug interactions. Computational models are useful for identifying potential substrates and/or inhibitors of clinically relevant transporters. Our goal was to generate OATP1B1 in vitro inhibition data for [3H] estrone-3-sulfate (E3S) transport in CHO cells and use it to build machine learning models to facilitate a comparison of seven different classification models (Deep learning, Adaboosted decision trees, Bernoulli naïve bayes, k-nearest neighbors (knn), random forest, support vector classifier (SVC), logistic regression (lreg), and XGBoost (xgb)] using ECFP6 fingerprints to perform 5-fold, nested cross validation. In addition, we compared models using 3D pharmacophores, simple chemical descriptors alone or plus ECFP6, as well as ECFP4 and ECFP8 fingerprints. Several machine learning algorithms (SVC, lreg, xgb, and knn) had excellent nested cross validation statistics, particularly for accuracy, AUC, and specificity. An external test set containing 207 unique compounds not in the training set demonstrated that at every threshold SVC outperformed the other algorithms based on a rank normalized score. A prospective validation test set was chosen using prediction scores from the SVC models with ECFP fingerprints and were tested in vitro with 15 of 19 compounds (84% accuracy) predicted as active (≥20% inhibition) showed inhibition. Of these compounds, six (abamectin, asiaticoside, berbamine, doramectin, mobocertinib, and umbralisib) appear to be novel inhibitors of OATP1B1 not previously reported. These validated machine learning models can now be used to make predictions for drug–drug interactions for human OATP1B1 alongside other machine learning models for important drug transporters in our MegaTrans software.

Clear search

Close search

Google apps

Main menu

Data from: Machine Learning Models Identify New Inhibitors for Human OATP1B1...

TREC 2022 Deep Learning test collection

Data from: Simulated Design–Build–Test–Learn Cycles for Consistent...

DCASE 2024 Challenge Task 2 Additional Training Dataset

PAN22 Authorship Analysis: Style Change Detection

Training dataset for NABat Machine Learning V1.0

alpaca-train-validation-test-split

An Efficient Method for Predicting Soil Thickness in Large-scale Area Based...

Calabi-Yau: CICY-4 folds

DCASE 2023 Challenge Task 2 Development Dataset

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

IMDB Movie Reviews (Binary Sentiment)

License

Data from: Candidate predictors

SQuAD-it (Italian SQuAD)

SQuAD-it (Italian SQuAD)

Semi-automatic translation of the SQuAD dataset into Italian

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Codecfake dataset - development set

Abstract

Codecfake Dataset

Countermeasure

Dataset for Cost-effective Simulation-based Test Selection in Self-driving...

Tree Point Classification - New Zealand

Solar flare forecasting based on magnetogram sequences learning with MViT...

Student Performance Data Set (competition form)

Context

Complete Blood Count (CBC)

Dataset Breakdown:

Overview:

Data Composition:

Annotations:

Key Features:

Applications:

Usage Notes:

Conclusion:

Data from: Machine Learning Models Identify New Inhibitors for Human OATP1B1