Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The uptake transporter OATP1B1 (SLC01B1) is largely localized to the sinusoidal membrane of hepatocytes and is a known victim of unwanted drug–drug interactions. Computational models are useful for identifying potential substrates and/or inhibitors of clinically relevant transporters. Our goal was to generate OATP1B1 in vitro inhibition data for [3H] estrone-3-sulfate (E3S) transport in CHO cells and use it to build machine learning models to facilitate a comparison of seven different classification models (Deep learning, Adaboosted decision trees, Bernoulli naïve bayes, k-nearest neighbors (knn), random forest, support vector classifier (SVC), logistic regression (lreg), and XGBoost (xgb)] using ECFP6 fingerprints to perform 5-fold, nested cross validation. In addition, we compared models using 3D pharmacophores, simple chemical descriptors alone or plus ECFP6, as well as ECFP4 and ECFP8 fingerprints. Several machine learning algorithms (SVC, lreg, xgb, and knn) had excellent nested cross validation statistics, particularly for accuracy, AUC, and specificity. An external test set containing 207 unique compounds not in the training set demonstrated that at every threshold SVC outperformed the other algorithms based on a rank normalized score. A prospective validation test set was chosen using prediction scores from the SVC models with ECFP fingerprints and were tested in vitro with 15 of 19 compounds (84% accuracy) predicted as active (≥20% inhibition) showed inhibition. Of these compounds, six (abamectin, asiaticoside, berbamine, doramectin, mobocertinib, and umbralisib) appear to be novel inhibitors of OATP1B1 not previously reported. These validated machine learning models can now be used to make predictions for drug–drug interactions for human OATP1B1 alongside other machine learning models for important drug transporters in our MegaTrans software.
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Combinatorial pathway optimization is an important tool in metabolic flux optimization. Simultaneous optimization of a large number of pathway genes often leads to combinatorial explosions. Strain optimization is therefore often performed using iterative design–build–test–learn (DBTL) cycles. The aim of these cycles is to develop a product strain iteratively, every time incorporating learning from the previous cycle. Machine learning methods provide a potentially powerful tool to learn from data and propose new designs for the next DBTL cycle. However, due to the lack of a framework for consistently testing the performance of machine learning methods over multiple DBTL cycles, evaluating the effectiveness of these methods remains a challenge. In this work, we propose a mechanistic kinetic model-based framework to test and optimize machine learning for iterative combinatorial pathway optimization. Using this framework, we show that gradient boosting and random forest models outperform the other tested methods in the low-data regime. We demonstrate that these methods are robust for training set biases and experimental noise. Finally, we introduce an algorithm for recommending new designs using machine learning model predictions. We show that when the number of strains to be built is limited, starting with a large initial DBTL cycle is favorable over building the same number of strains for every cycle.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is the "additional training dataset" for the DCASE 2024 Challenge Task 2.
The data consists of the normal/anomalous operating sounds of nine types of real/toy machines. Each recording is a single-channel audio that includes both a machine's operating sound and environmental noise. The duration of recordings varies from 6 to 10 seconds. The following nine types of real/toy machines are used in this task:
3DPrinter
AirCompressor
BrushlessMotor
HairDryer
HoveringDrone
RoboticArm
Scanner
ToothBrush
ToyCircuit
Overview of the task
Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.
This task is the follow-up from DCASE 2020 Task 2 to DCASE 2023 Task 2. The task this year is to develop an ASD system that meets the following five requirements.
Train a model using only normal sound (unsupervised learning scenario) Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.
Detect anomalies regardless of domain shifts (domain generalization task) In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2 and DCASE 2023 Task 2.
Train a model for a completely new machine typeFor a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning. This requirement is the same as in DCASE 2023 Task 2.
Train a model using a limited number of machines from its machine typeWhile sounds from multiple machines of the same machine type can be used to enhance the detection performance, it is often the case that only a limited number of machines are available for a machine type. In such a case, the system should be able to train models using a few machines from a machine type. This requirement is the same as in DCASE 2023 Task 2.
5 . Train a model both with or without attribute informationWhile additional attribute information can help enhance the detection performance, we cannot always obtain such information. Therefore, the system must work well both when attribute information is available and when it is not.
The last requirement is newly introduced in DCASE 2024 Task2.
Definition
We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".
"Machine type" indicates the type of machine, which in the additional training dataset is one of nine: 3D-printer, air compressor, brushless motor, hair dryer, hovering drone, robotic arm, document scanner (scanner), toothbrush, and Toy circuit.
A section is defined as a subset of the dataset for calculating performance metrics.
The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.
Attributes are parameters that define states of machines or types of noise. For several machine types, the attributes are hidden.
Dataset
This dataset consists of nine machine types. For each machine type, one section is provided, and the section is a complete set of training data. A set of test data corresponding to this training data will be provided in another seperate zenodo page as an "evaluation dataset" for the DCASE 2024 Challenge task 2. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training and (ii) ten clips of normal sounds in the target domain for training. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.
File names and attribute csv files
File names and attribute csv files provide reference labels for each clip. The given reference labels for each training clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Note that for machine types that has its attribute information hidden, the attribute information in each file names are only labeled as "noAttributes". Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:
[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...
For machine types that have their attribute information hidden, all columns except the filename column are left blank for each row.
Recording procedure
Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.
Directory structure
/eval_data
Baseline system
The baseline system is available on the Github repository . The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Condition of use
This dataset was created jointly by Hitachi, Ltd., NTT Corporation and STMicroelectronics and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Citation
Contact
If there is any problem, please contact us:
Tomoya Nishida, tomoya.nishida.ax@hitachi.com
Keisuke Imoto, keisuke.imoto@ieee.org
Noboru Harada, noboru@ieee.org
Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp
Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
This is the dataset for the Style Change Detection task of PAN 2022.
Task
The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches. Hence, a fundamental question is the following: If multiple authors have written a text together, can we find evidence for this fact; i.e., do we have a means to detect variations in the writing style? Answering this question belongs to the most difficult and most interesting challenges in author identification: Style change detection is the only means to detect plagiarism in a document if no comparison texts are given; likewise, style change detection can help to uncover gift authorships, to verify a claimed authorship, or to develop new technology for writing support.
Previous editions of the Style Change Detection task aim at e.g., detecting whether a document is single- or multi-authored (2018), the actual number of authors within a document (2019), whether there was a style change between two consecutive paragraphs (2020, 2021) and where the actual style changes were located (2021). Based on the progress made towards this goal in previous years, we again extend the set of challenges to likewise entice novices and experts:
Given a document, we ask participants to solve the following three tasks:
[Task1] Style Change Basic: for a text written by two authors that contains a single style change only, find the position of this change (i.e., cut the text into the two authors’ texts on the paragraph-level),
[Task2] Style Change Advanced: for a text written by two or more authors, find all positions of writing style change (i.e., assign all paragraphs of the text uniquely to some author out of the number of authors assumed for the multi-author document)
[Task3] Style Change Real-World: for a text written by two or more authors, find all positions of writing style change, where style changes now not only occur between paragraphs, but at the sentence level.
All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors.
Data
To develop and then test your algorithms, three datasets including ground truth information are provided (dataset1 for task 1, dataset2 for task 2, and dataset3 for task 3).
Each dataset is split into three parts:
training set: Contains 70% of the whole dataset and includes ground truth data. Use this set to develop and train your models.
validation set: Contains 15% of the whole dataset and includes ground truth data. Use this set to evaluate and optimize your models.
test set: Contains 15% of the whole dataset, no ground truth data is given. This set is used for evaluation (see later).
You are free to use additional external data for training your models. However, we ask you to make the additional data utilized freely available under a suitable license.
Input Format
The datasets are based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data for each dataset, respectively.
For each problem instance X (i.e., each input document), two files are provided:
problem-X.txt contains the actual text, where paragraphs are denoted by for tasks 1 and 2. For task 3, we provide one sentence per paragraph (again, split by ).
truth-problem-X.json contains the ground truth, i.e., the correct solution in JSON format. An example file is listed in the following (note that we list keys for the three tasks here):
{ "authors": NUMBER_OF_AUTHORS, "site": SOURCE_SITE, "changes": RESULT_ARRAY_TASK1 or RESULT_ARRAY_TASK3, "paragraph-authors": RESULT_ARRAY_TASK2 }
The result for task 1 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). For task 2 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., [1, 2, 1] for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic). The result for task 3 (key "changes") is similarly structured as the results array for task 1. However, for task 3, the changes array holds a binary for each pair of consecutive sentences and they may be multiple style changes in the document.
An example of a multi-author document with a style change between the third and fourth paragraph (or sentence for task 3) could be described as follows (we only list the relevant key/value pairs here):
{ "changes": [0,0,1,...], "paragraph-authors": [1,1,1,2,...] }
Output Format
To evaluate the solutions for the tasks, the results have to be stored in a single file for each of the input documents and each of the datasets. Please note that we require a solution file to be generated for each input problem for each dataset. The data structure during the evaluation phase will be similar to that in the training phase, with the exception that the ground truth files are missing.
For each given problem problem-X.txt, your software should output the missing solution file solution-problem-X.json, containing a JSON object holding the solution to the respective task. The solution for tasks 1 and 3 is an array containing a binary value for each pair of consecutive paragraphs (task 1) or sentences (task 3). For task 2, the solution is an array containing the order of authors contained in the document (as in the truth files).
An example solution file for tasks 1 and 3 is featured in the following (note again that for task 1, changes are captured on the paragraph level, whereas for task 3, changes are captured on the sentence level):
{ "changes": [0,0,1,0,0,...] }
For task 2, the solution file looks as follows:
{ "paragraph-authors": [1,1,2,2,3,2,...] }
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
After fast mean shift (FMS) clustering, the whole research area was divided to 10 subareas, so the new samples can characterize the geographical features of each subarea were collected through field investigations. Because of our limited human and material resources, it is difficult to conduct a mass of sampling in each subarea. In order to make the most of our limited resources, we need to conduct reasonable field sampling strategy. For the first two large subareas, we collected 70 field samples respectively, and labeled them as the first sample set and the second sample set that will be used to build their own GWR models for extend prediction of unobserved points in each area, i.e. local extension prediction; while the remaining 8 small subareas took moderate amounts of samples according to their size, if one subarea owns the size of raster points more than 5000, 16 samples will be collected from it, otherwise, take 12 samples. In this way, a total of 112 samples are put together as the third sample set, and the third GWR model is constructed to achieve the global extension prediction of 8 subareas. In addition, three sample sets were divided into training set and test set, respectively. For the first two sample sets, the ratio of sample size of training set and test set are all 5:2, i.e. training set contains 50 samples, test set has 20 samples. Because of the third sample set composed of samples from 8 subareas, we divided the samples of each subarea into training set and test set according to the ratio of 3:1. In the other word, the sample number of training set from third to tenth subarea is 12, 9, 9, 12, 9, 12, 12 and 9 respectively, and 84 training sample in total; and the sample number of test set from eight subarea is 4, 3, 3, 4, 3, 4, 4 and 3 respectively, a total of 28 samples.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains the complete intersection Calabi-Yau four-folds (CICY4) configuration matrices and their four Hodge numbers, designed for the problem of machine learning the Hodge numbers using the configuration matrices as inputs to a neural network model.
The original data for CICY4 is from the paper: "Topological Invariants and Fibration Structure of Complete Intersection Calabi-Yau Four-Folds", arXiv:1405.2073. and can be downloaded in either text or Mathematica format from: https://www-thphys.physics.ox.ac.uk/projects/CalabiYau/Cicy4folds/index.html
The full CICY4 data included with this dataset in npy format (conf.npy, hodge.npy, direct.npy) is created by running the script 'create_data.py' from https://github.com/robin-schneider/cicy-fourfolds. Given this full data, the following two additional datasets at 72% and 80% training ratios were created.
At 72% data split, - The train dataset consists of the files (conf_Xtrain.npy, hodge_ytrain.npy) - The validation dataset consists of the files (conf_Xvalid.npy, hodge_yvalid.npy) - The test dataset consists of the files (conf_Xtest.npy, hodge_ytest.npy)
At 80% data split, the 3 datasets are: - (conf_Xtrain_80.npy, hodge_ytrain_80.npy) - (conf_Xvalid.npy, hodge_yvalid.npy) - (conf_Xtest_80.npy, hodge_ytest_80.npy) The new train and test sets were formed from the old ones: The old test set is divided into 2 parts with the ratio (0.6, 0.4). The 0.6-partition becomes the new test set, the 0.4-partition is merged with the old train set to form the new train set.
Trained neural networks models and their training/validation losses - 12 models were trained on the 72% dataset and their checkpoints are stored in the folder 'trained_models'. The 12 csv files containing the train+validation losses of these models are stored in the folder 'train-validation-losses'. - At 80% data split, the top 3 performing models trained on the 72% dataset were retrained and their checkpoints are stored in 'trained_models_80pc_split', together with the 3 csv files containing the loss values during the training phase.
Inference notebook: The inference notebook using this dataset is https://www.kaggle.com/code/lorresprz/cicy4-training-results-inference-all-models
Publication: This dataset was created for the work: Deep Learning Calabi-Yau four folds with hybrid and recurrent neural network architectures, https://arxiv.org/abs/2405.17406
Description
This dataset is the "development dataset" for the DCASE 2023 Challenge Task 2 "First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring".
The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:
ToyCar
ToyTrain
Fan
Gearbox
Bearing
Slide rail
Valve
Overview of the task
Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.
This task is the follow-up from DCASE 2020 Task 2 to DCASE 2022 Task 2. The task this year is to develop an ASD system that meets the following four requirements.
Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.
In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2.
For a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning.
While sounds from multiple machines of the same machine type can be used to enhance detection performance, it is often the case that sound data from only one machine are available for a machine type. In such a case, the system should be able to train models using only one machine from a machine type.
The last two requirements are newly introduced in DCASE 2023 Task2 as the "first-shot problem".
Definition
We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".
"Machine type" indicates the type of machine, which in the development dataset is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain.
A section is defined as a subset of the dataset for calculating performance metrics.
The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.
Attributes are parameters that define states of machines or types of noise.
Dataset
This dataset consists of seven machine types. For each machine type, one section is provided, and the section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, and (iii) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.
File names and attribute csv files
File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:
[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...
Recording procedure
Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.
Directory structure
/dev_data
slider
means "slide rail")Baseline system
The baseline system is available on the Github repository dcase2023_task2_baseline_ae.The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Condition of use
This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Citation
If you use this dataset, please cite all the following papers. We will publish a paper on the description of the DCASE 2023 Task 2, so pleasure make sure to cite the paper, too.
Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, and Masahiro Yasuda. First-shot anomaly detection for machine condition monitoring: A domain generalization baseline. In arXiv e-prints: 2303.00455, 2023. [URL]
Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi. MIMII DG: sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task. In Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 31-35. Nancy, France, November 2022, . [URL]
Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 1–5. Barcelona, Spain, November 2021. [URL]
Contact
If there is any problem, please contact us:
Kota Dohi, kota.dohi.gr@hitachi.com
Keisuke Imoto, keisuke.imoto@ieee.org
Noboru Harada, noboru@ieee.org
Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp
Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv
, validation_data.csv
, and test_data.csv
. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas
, numpy
, scikit-learn
, matplotlib
, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Source Huggingface Hub: link
About this dataset This is a large dataset for binary sentiment classification containing a substantial amount of data compared to previous benchmark datasets. Provided are 25,000 highly polar movie reviews for training and 25,000 for testing. There is also additional unlabeled data available for use. The data fields are consistent among all splits of the dataset
How to use the dataset In order to use this dataset, you will need to first download the IMDB Large Movie Review Dataset. Once you have downloaded the dataset, you can either use it in its original form or split it into training and testing sets. To split the dataset, you will need to create a new file called unsupervised.csv and copy the text column from train.csv into it. You can then split unsupervised.csv into two files: train_unsupervised.csv and test_unsupervised.csv.
Once you have either the original dataset or the training and testing sets, you can begin using them for binary sentiment classification. In order to do this, you will need to use a machine learning algorithm that is capable of performing binary classification, such as logistic regression or support vector machines. Once you have trained your model on the training set, you can then evaluate its performance on the test set by predicting the labels of the reviews in test_unsupervised.csv
Research Ideas This dataset can be used to train a binary sentiment classification model. This dataset can be used to train a model to classify movie reviews into positive and negative sentiment categories. This dataset can be used to build a large movie review database for research purposes
CC0
Original Data Source: IMDB Movie Reviews (Binary Sentiment)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
SQuAD-it is the perfect resource for Italian language learners and Natural Language Processing (NLP) experts alike! This dataset includes a collection of semi-automatically translated question-answer pairs from the SQuAD dataset, giving you an expansive knowledge base in your chosen language. With this robust set of Italian text made available through the SQuAD-it dataset, you can access both a training set (in train.csv) and a testing set (in test.csv) to evaluate and power up your answers! Unlock the wealth of insight that lies within this insightful collection today and boost your NLP experience with SQuAD-it: Italian QA at Your Fingertips!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide will help you get the most out of the SQuAD-it dataset. The SQuAD-it dataset is derived from the popular English language SQuAD (Stanford Question Answering Dataset) and it consists of more than 160,000 open question-answer pairs in Italian. This dataset can be used to power up Natural Language Processing models with a focused Italian language knowledge base.
Get Familiar with the Data: First, get familiar with the data and its format by understanding what columns are included in the train/test CSV files and their purpose. Each row contains a context and an answer related to it, which gives you insight into how NLP models can be trained on this data set.
Consider Your Use Case: Think about how your project will benefit from using this dataset – is it to build an Italian language Q&A system, or just extract useful facts from them? Being clear on your goals will help guide which features you’ll need to focus on while exploiting this data set for maximum benefit.
Brainstorm Possible Models: There are many different NLP algorithms that could be used when utilizing this dataset such as text summarization techniques or information extraction approaches based on supervised machine learning algorithms such as rule-based parsing or statistical models for text categorization/transformation tasks like syntactic parsing or utterance detection . Establishing what kind of model you want to create before beginning any experimentation may save time down the line when tuning parameters for performance optimization or making improvements based off prior performance issues caused by decisions made previously in development phase.
Create Meaningful Features: Carefully design features that capture important meaning behind each question answer pair in order capture salient points of discussion between two speakers which can then be used later as inputs into various Machine Learning models while training those systems properly without introducing any bias (eugenics). Features should describe both sides’ meanings without assigning any preconceived notions of correct answers Ultimately reasonable feature representations should provide meaningful metrics associated with words beyond just counts like word embeddings, topic modeling etc..
- Develop Your Model Pipelines & Systems : Utilize libraries like ScikitLearn , TensorFlow , Keras , PyTorch and others popular tools as needed depending upon chosen use case while not blindly picking one over another; select wisely and know why certain libraries were picked including potential consequences resulting directly due being unprepared during development stages i . e forgetting about regulrization hyper
- Developing conversational AI systems that are capable of understanding questions posed in Italian and providing relevant answers to those questions.
- Developing Machine Learning models that can identify complex topics discussed in Italian text corpora, so as to facilitate more efficient searching for content such as news articles or blog posts in Italian language.
- Creating automatic summarization algorithms using the question-answer pairs from the SQuAD-It dataset, which can then be used to generate overviews of lengthy texts written in Italian with minimal human assistance
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See O...
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset is the development set of the Codecfake dataset, corresponding to the manuscript "The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection".
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD)
models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose
the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.
Due to platform restrictions on the size of zenodo repositories, we have divided the Codecfake dataset into various subsets as shown in the table below:
Codecfake dataset | description | link |
training set (part 1 of 3) & label | train_split.zip & train_split.z01 - train_split.z06 | https://zenodo.org/records/11171708 |
training set (part 2 of 3) | train_split.z07 - train_split.z14 | https://zenodo.org/records/11171720 |
training set (part 3 of 3) | train_split.z15 - train_split.z19 | https://zenodo.org/records/11171724 |
development set | dev_split.zip & dev_split.z01 - dev_split.z02 | https://zenodo.org/records/11169872 |
test set (part 1 of 2) | Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.zip | https://zenodo.org/records/11169781 |
test set (part 2 of 2) | Codec unseen test: C7.zip | https://zenodo.org/records/11125029 |
The source code of the countermeasure and pre-trained model are available on GitHub https://github.com/xieyuankun/Codecfake.
The Codecfake dataset and pre-trained model are licensed with CC BY-NC-ND 4.0 license.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software
This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.
GitHub: github.com/ChristianBirchler/sdc-scissor
This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.
Usage
Demo
Installation
The tool can either be run with Docker or locally using Poetry.
When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.
To install the application use one of the following approaches:
docker build --tag sdc-scissor .
poetry install
Using the Tool
The tool can be used with the following two commands:
docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS]
(this will write all files written to /out
to the local folder results
)poetry run python sdc-scissor.py [COMMAND] [OPTIONS]
There are multiple commands to use. For simplifying the documentation only the command and their options are described.
generate-tests --out-path /path/to/store/tests
label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests
evaluate-models --dataset /path/to/train/set --save
split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8
predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib
evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib
The possible parameters are always documented with --help
.
Linting
The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:
poetry run flake8 . poetry run pylint **/*.py
License
The software we developed is distributed under GNU GPL license. See the LICENSE.md file.
Contacts
Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch
Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch
Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch
Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de
Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch
References
If you use this tool in your research, please cite the following papers:
@INPROCEEDINGS{Birchler2022,
author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano},
booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER),
title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor},
year={2022},
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders: magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders. M24/M48: both present the following sub-folders structure: Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root: inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files: magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where: hmi: is the instrument that captured the image sharp_720s: is the database source of SDO/HMI. : is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA). : is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds). : is the date-time when the sequence starts, and follow the same format of . : is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where: : is Seq16 if refers to a sequence, or void if refers direct to images. : "24h" or "48h". : is "TrainVal" or "Test". The refers to the split of Train/Val. : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders: Model training codes: "SF_MViT_M+_", where: : void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test); : "24h" or "48h"; : "oneSplit" for a specific split or "allSplits" if run all splits. : void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where: : point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt: : train or val; : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where: : epoch number of the checkpoint; : corresponding valid loss; : 0 to 4.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was taken from link and preprocessed and separated into competition format. The label for the test data is provided in the form of a function.
Dataset Description:
This dataset is a comprehensive collection of Complete Blood Count (CBC) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.
The Complete Blood Count (CBC) is a crucial test used in medical diagnostics to evaluate the overall health and detect a variety of disorders, including anemia, infection, and many other diseases. This dataset provides a rich source of CBC images that can be used to train machine learning models to automate the analysis and interpretation of these tests.
Training Set:
Validation Set:
Test Set:
Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.
This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated CBC Analysis: Developing algorithms to automatically analyze CBC images and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of CBC image analysis.
This CBC dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for CBC analysis.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The uptake transporter OATP1B1 (SLC01B1) is largely localized to the sinusoidal membrane of hepatocytes and is a known victim of unwanted drug–drug interactions. Computational models are useful for identifying potential substrates and/or inhibitors of clinically relevant transporters. Our goal was to generate OATP1B1 in vitro inhibition data for [3H] estrone-3-sulfate (E3S) transport in CHO cells and use it to build machine learning models to facilitate a comparison of seven different classification models (Deep learning, Adaboosted decision trees, Bernoulli naïve bayes, k-nearest neighbors (knn), random forest, support vector classifier (SVC), logistic regression (lreg), and XGBoost (xgb)] using ECFP6 fingerprints to perform 5-fold, nested cross validation. In addition, we compared models using 3D pharmacophores, simple chemical descriptors alone or plus ECFP6, as well as ECFP4 and ECFP8 fingerprints. Several machine learning algorithms (SVC, lreg, xgb, and knn) had excellent nested cross validation statistics, particularly for accuracy, AUC, and specificity. An external test set containing 207 unique compounds not in the training set demonstrated that at every threshold SVC outperformed the other algorithms based on a rank normalized score. A prospective validation test set was chosen using prediction scores from the SVC models with ECFP fingerprints and were tested in vitro with 15 of 19 compounds (84% accuracy) predicted as active (≥20% inhibition) showed inhibition. Of these compounds, six (abamectin, asiaticoside, berbamine, doramectin, mobocertinib, and umbralisib) appear to be novel inhibitors of OATP1B1 not previously reported. These validated machine learning models can now be used to make predictions for drug–drug interactions for human OATP1B1 alongside other machine learning models for important drug transporters in our MegaTrans software.