100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. TREC 2022 Deep Learning test collection

    • catalog.data.gov
    • gimi9.com
    • +1more
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  3. File-Test Links in Regression Testing

    • kaggle.com
    zip
    Updated Feb 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Lousada (2021). File-Test Links in Regression Testing [Dataset]. https://www.kaggle.com/datasets/joolousada/filetest-links-in-regression-testing
    Explore at:
    zip(707890 bytes)Available download formats
    Dataset updated
    Feb 5, 2021
    Authors
    João Lousada
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    In modern software engineering, Continuous Integration (CI) has become an indispensable step towards systematically managing the life cycles of software development. Large companies struggle with keeping the pipeline updated and operational, in useful time, due to the large amount of changes and addition of features, that build on top of each other and have several developers, working on different platforms. Associated with such software changes, there is always a strong component of Testing.

    In software versioning systems, e.g. GitHub or SVN, changes to a repository are made by commiting a new version to the system. To ensure the new version proper functioning, tests need to be applied.

    As teams and projects grow, exhaustive testing quickly becomes inhibitive, as more and more tests are needed to cover every piece of code, becoming adamant to select the most relevant tests earlier, without compromising software quality.

    We believe that this selection can be made through establishing a relationship between modified files and tests. Hence, when a new commit arrives with a certain amount of modified files, we apply relevant tests early-on, maximising early detection of issues.

    Content

    The dataset is composed by 3 columns: the commit ID, the list of modified files and the list of tests that were affected in that commit. The data was collected over a period of 4 years from a company of the financial sector, that was composed of ~90 developers.

    Inspiration

    Some interesting questions (tasks) can be performed on this dataset -

    • How can we establish meaningful file-test links ?
    • What Data Cleaning steps lead to the best result ?
    • How to deal with increasing data dimensionality ?
    • How to deal with deprecated/unused files and tests?
  4. u

    Learning from Total Failure: Why Do Impossible Tests Boost Learning?...

    • datacatalogue.ukdataservice.ac.uk
    Updated Aug 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hollins, T, University of Plymouth; Mitchell, C, University of Plymouth; Wills, A, University of Plymouth; Seabrooke, T, University of Southampton (2021). Learning from Total Failure: Why Do Impossible Tests Boost Learning? 2017-2021 [Dataset]. http://doi.org/10.5255/UKDA-SN-855137
    Explore at:
    Dataset updated
    Aug 12, 2021
    Authors
    Hollins, T, University of Plymouth; Mitchell, C, University of Plymouth; Wills, A, University of Plymouth; Seabrooke, T, University of Southampton
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    The project concerns the effect of an unsuccessful pre-test (effectively a guess), on the subsequent learning of information, relative to studying that information with no-initial guess. The focus of the work has been the development of a theoretical understanding of when pre-testing is or is not beneficial to subsequent learning, with a view to developing applications of the technique to educational practice. Consequently, each experiment compared the effects of studying versus guessing (and receiving feedback to study) on subsequent memory for the material, with each experiment varying in other aspects (e.g. the nature of the material, or nature of the final test). A total of 26 experimental studies have been completed. Thirteen of these experiments have been published in four outputs and for each of these, the relevant data are published in Open Science Framework (OSF) repositories, as detailed below. A further 6 studies form parts of papers that are either under review, or have been reported at conferences (or both). The remaining studies have not yet been output, but are included in manuscripts in preparation. OSF repositories for the unpublished work will be made available upon acceptance for publication.

    All data were collected from volunteer participants who were either undergraduates participating for partial course credit, or members of the public who received a small financial payment. Prior to March 2019 all work was completed in person at the University of Plymouth, but thereafter we moved to online testing using Prolific due to the impact of the global pandemic.

    Output 1 examines the impact of pre-testing on different aspects of the event, tested through different criterion memory tests across 5 experiments. The main conclusion from this output is that pre-testing boosts availability of targets (measured through recognition), but not cue-target associations (measured through recall, or associative recognition). Output 2 tested potential accounts of the pre-testing effect are that either guessing increases a person’s motivation to know the answer (before it arrives), or that the discrepancy between the guess and the actual answer induces surprise which drives learning. We tested these ideas in two experiments, and found that pre-testing increases self-reported motivation to learn a fact before it is revealed, but not surprise in the answer after it is revealed. Output 3 demonstrated that the differential pattern for recall and recognition reported in output 1 also applies to learning of related- and unrelated word pairs, and so challenges previously accepted theories of the pre-testing effect. Output 4 explored whether learning from a pre-test was related to the magnitude of the error made. Across 3 experiments, participants guessed the meaning of foreign-language words that came from one of two semantic categories. Contrary to some popular learning theories, a greater pre-testing effect was found if the initial error was closer to the target answer, rather than further away. Output 5 examines the impact of pre-testing upon memory for incidental details of the presented answer. Across two experiments we showed that while pre-testing reliably boosted memory for what the answer was, it had no impact on memory for what colour the answer was presented in (Experiment 1), or when it was presented (Experiment 2). Output 6 was a conference presentation of subset of two experiments from a larger set of 7 experiments that are in preparation for a paper submission. Collectively these studies sought to explain the discrepancy previously seen between the effects of pre-testing on recognition and recall (Outputs 1 and 3). These experiments demonstrate that pre-testing is reliably observed when tested by recognition, but for the pattern for recall depends upon the degree of similarity between different study items, a factor previously overlooked in the pre-testing literature. Output 7 followed up the findings reported in Output 2, and explored whether the curiosity elicited by pre-testing is specific to the answer being sought, or represents a generalised state such as increased attention or arousal that will boost memory for incidental material. Two experiments demonstrated that the pre-testing effect is highly specific.

    List of outputs and associated OSF repositories where published are available on the Read-me document.

    In education, a test is usually used to measure learning. However, the last decade has seen an explosion of research demonstrating that tests can also dramatically improve learning - the testing effect. Most recently, a surprising discovery has been made that a test can enhance learning even when it is given before the material has been taught. Hence, when students are tested completely unfamiliar material (e.g., foreign language vocabulary), and will inevitably get all the test questions wrong, subsequent learning of that topic is enhanced. This effect has very significant implications for educational settings, and we seek to understand why the effect occurs. The first demonstrations of the testing effect involved 3 phases. Participants first studied the material (e.g. a text). Next, one group took a test on the material, while a second group simply studied the correct answers. A final test assessed how much learning had taken place. Taking the interim test led to better final performance than restudying the material, and later research showed the effect was further enhanced if initial answers were corrected with feedback. One possible explanation for the testing effect is that after thinking of a (wrong) answer, people are highly motivated to learn the correct answer. This particular explanation suggests that testing might be helpful even before the first encounter with the to-be-studied material, as has recently been observed. For example, if you were asked to guess the meaning of a rare English word such as "roke" before ever being told its true meaning (mist), then you would be especially good at remembering that meaning on a later test. It is this benefit of initial tests prior to learning (known as test-potentiated learning - TPL) that is the focus of the current proposal. The Current Project: We will test a number of potential explanations for the effects of initial tests (TPL) in three strands of research. Strands 1 and 2 will use unfamiliar word pairs and face-word pairs. The former are foreign language items (Finnish nouns and their English language meanings), and the latter are unfamiliar faces, and facts associated with those faces (e.g. name/occupation). The Finnish vocabulary is used because it has clear implications for foreign language learning. Also, Finnish words specifically are not similar to English words, which guarantees that the answers to the initial test will be incorrect. Face-name learning has implications for more social and work-place situations. In the final Strand 3, more complex word-based materials (texts and general knowledge) will be used to extend the findings from Strands 1 and 2 to a range of classroom situations. Participants will know nothing about these materials in advance. In a prototypical experiment using Finnish vocabulary, all trials will start with the presentation of a Finnish word. In the "test" condition, participants will be asked to guess the meaning of the word before being given the true meaning. A "study" condition, in which no guess is made, will serve as the control. It is expected, given previous research in our laboratory, that guessing will enhance memory for the true meaning. Strand 1 will explore the extent to which initial tests benefit learning precisely because participants make errors, and so they are surprised by the true answer. Strand 2 will look at the extent to which people are more motivated to study, or likely to change their study strategies following a guess. That is, Strand 1 examines potential "low-level" mechanisms (e.g., error correction) of learning whereas Strand 2 looks at more "high-level" strategic processes that might result from being tested. The experiments in Strand 3 will test the generality of the findings from Strands 1 and 2 to more complex tasks such as general knowledge learning. This strand is designed to broaden the scope of more applied research that might be conducted in the future.

  5. MNIST 2 Digit Classification Dataset

    • kaggle.com
    zip
    Updated Sep 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Kumar (2023). MNIST 2 Digit Classification Dataset [Dataset]. https://www.kaggle.com/datasets/amankumar234/mnist-2-digit-classification-dataset/discussion
    Explore at:
    zip(140169 bytes)Available download formats
    Dataset updated
    Sep 19, 2023
    Authors
    Aman Kumar
    Description

    Objective :

    The goal of this dataset is to create a custom dataset for multi-digit recognition tasks by concatenating pairs of digits from the MNIST dataset into single 128x128 pixel images and assigning labels that represent two-digit numbers from '00' to '99'.

    Dataset Features :

    Image (128 x 128 pixel Numpy array): The dataset contains images of size 128 x 128 pixels. Each image is a composition of two pairs of MNIST digits. Each digit occupies a 28 x 28 pixel space within the larger 128 x 128 pixel canvas. The digits are randomly placed within the canvas to simulate real-world scenarios.

    Label (Int): The labels represent two-digit numbers ranging from '00' to '99'. These labels are assigned based on the digits present in the image and their order. For example, an image with '7' and '2' as the first and second digits would be labeled as '72' ('7' * 10 + '2'). Leading zeros are added to ensure that all labels are two characters in length.

    Dataset Size:

    Training Data: 60,000 data points Test Data: 10,000 data points

    Data Generation: To create this dataset, you would start with the MNIST dataset, which contains single-digit images of handwritten digits from '0' to '9'. For each data point in the new dataset, you would randomly select two pairs of digits from MNIST and place them on a 128 x 128 canvas. The digits are placed at random positions, and their order can also be random. After creating the multi-digit image, you assign a label by concatenating the labels of the individual digits while ensuring they are two characters in length.

    Key Features of the 2-Digit Classification Dataset:

    Multi-Digit Images: This dataset consists of multi-digit images, each containing two handwritten digits. The inclusion of multiple digits in a single image presents a unique and challenging classification task.

    Labeling Complexity: Labels are represented as two-digit numbers, adding complexity to the classification problem. The labels range from '00' to '99,' encompassing a wide variety of possible combinations.

    Diverse Handwriting Styles: The dataset captures diverse handwriting styles, making it suitable for testing the robustness and generalization capabilities of machine learning models.

    128x128 Pixel Images: Images are provided in a high-resolution format of 128x128 pixels, allowing for fine-grained analysis and leveraging the increased image information.

    Large-Scale Training and Test Sets: With 60,000 training data points and 10,000 test data points, this dataset provides ample data for training and evaluating classification models.

    Potential Use Cases:

    Multi-Digit Recognition: The dataset is ideal for developing and evaluating machine learning models that can accurately classify multi-digit sequences, which find applications in reading house numbers, license plates, and more.

    OCR (Optical Character Recognition) Systems: Researchers and developers can use this dataset to train and benchmark OCR systems for recognizing handwritten multi-digit numbers.

    Real-World Document Processing: In scenarios where documents contain multiple handwritten numbers, such as invoices, receipts, and forms, this dataset can be valuable for automating data extraction.

    Address Parsing: It can be used to build systems capable of parsing handwritten addresses and extracting postal codes or other important information.

    Authentication and Security: Multi-digit classification models can contribute to security applications by recognizing handwritten PINs, passwords, or access codes.

    Education and Handwriting Analysis: Educational institutions can use this dataset to create handwriting analysis tools and assess the difficulty of recognizing different handwritten number combinations.

    Benchmarking Deep Learning Models: Data scientists and machine learning practitioners can use this dataset as a benchmark for testing and improving deep learning models' performance in multi-digit classification tasks.

    Data Augmentation: Researchers can employ data augmentation techniques to generate even more training data by introducing variations in digit placement and size.

    Model Explainability: Developing models for interpreting and explaining the reasoning behind classifying specific multi-digit combinations can have applications in AI ethics and accountability.

    Visualizations and Data Exploration: Researchers can use this dataset to explore visualizations and data analysis techniques to gain insights into the characteristics of handwritten multi-digit numbers.

    In summary, the 2-Digit Classification Dataset offers a unique opportunity to work on a challenging multi-digit recognition problem with real-world applications, making it a valuable resource for researchers, developers, and data scientists.

    Note: Creating this dataset would require a considerable amount of preprocessing and image manipulation. ...

  6. Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  7. Data and Code for: Moving towards more holistic machine learning-based...

    • figshare.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlotte Christensen; Andre C. Ferreira; Damien Farine (2025). Data and Code for: Moving towards more holistic machine learning-based approaches for classification problems in animal studies [Dataset]. http://doi.org/10.6084/m9.figshare.27221136.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 11, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Charlotte Christensen; Andre C. Ferreira; Damien Farine
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine-learning (ML) is revolutionizing field and laboratory studies of animals. However, a challenge when deploying ML for classification tasks is ensuring the models are reliable. Currently, we evaluate models using performance metrics alone (e.g., precision, recall, F1), but these can overlook the ultimate aim, which is not the outputs themselves (e.g. detected species or individual identities, or behaviour) but their incorporation for hypothesis testing. As improving performance metrics has diminishing returns, particularly when data are inherently noisy (as human-labelled, animal-based data often are), researchers are faced with the conundrum of investing more time in maximising metrics versus answering biological questions. This raises the question: how much noise can we accept in ML models? Here, we start by describing an under-reported factor that can cause metrics to underestimate model performance. Specifically, ambiguity between categories or mistakes in labelling test data produces hard ceilings that limit performance metrics. This likely widespread issue means that many models could be performing better than their metrics suggest. Next, we argue and show that imperfect models (e.g. low F1 scores) can still be useable. Using a case study on ML-identified behaviour from vulturine guineafowl accelerometer data, we first propose a simulation framework to evaluate robustness of hypothesis testing using models that make classification errors. Second, we show how to determine the utility of a model by supplementing existing performance metrics with ‘biological validations’. This involves applying ML models to unlabelled data and using the models’ outputs to test hypotheses for which we can anticipate the outcome. Together, we show that effects sizes and expected biological patterns can be detected even when performance metrics are relatively low (e.g., F1: 60-70%). In doing so, we provide a roadmap for validation approaches of ML classification models tailored to research in animal behaviour, and other fields with noisy, biological data.

  8. Rescaled Fashion-MNIST dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Time period covered
    Apr 10, 2025
    Description

    Motivation

    The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled Fashion-MNIST dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

    Access and rights

    The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

    [4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

    The h5 files containing the dataset

    The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

    Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

    There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

  9. Z

    DCASE 2023 Challenge Task 2 Additional Training Dataset

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    Updated Apr 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yohei Kawaguchi (2023). DCASE 2023 Challenge Task 2 Additional Training Dataset [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_7830344
    Explore at:
    Dataset updated
    Apr 26, 2023
    Dataset provided by
    Kota Dohi
    Tomoya Nishida
    Takashi Endo
    Keisuke Imoto
    Noboru Harada
    Daisuke Niizumi
    Yuma Koizumi
    Yohei Kawaguchi
    Harsh Purohit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This dataset is the "additional training dataset" for the DCASE 2023 Challenge Task 2 "First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring".

    The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel audio that includes both a machine's operating sound and environmental noise. The duration of recordings varies from 6 to 18 sec, depending on the machine type. The following seven types of real/toy machines are used:

    Vacuum

    ToyTank

    ToyNscale

    ToyDrone

    bandsaw

    grinder

    shaker

    Overview of the task

    Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

    This task is the follow-up from DCASE 2020 Task 2 to DCASE 2022 Task 2. The task this year is to develop an ASD system that meets the following four requirements.

    1. Train a model using only normal sound (unsupervised learning scenario)

    Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.

    1. Detect anomalies regardless of domain shifts (domain generalization task)

    In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2.

    1. Train a model for a completely new machine type

    For a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning.

    1. Train a model using only one machine from its machine type

    While sounds from multiple machines of the same machine type can be used to enhance detection performance, it is often the case that sound data from only one machine are available for a machine type. In such a case, the system should be able to train models using only one machine from a machine type.

    The last two requirements are newly introduced in DCASE 2023 Task2 as the "first-shot problem".

    Definition

    We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

    "Machine type" indicates the type of machine, which in the development dataset is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain.

    A section is defined as a subset of the dataset for calculating performance metrics.

    The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

    Attributes are parameters that define states of machines or types of noise.

    Dataset

    This dataset consists of seven machine types. For each machine type, one section is provided, and the section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

    File names and attribute csv files

    File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

    [filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...
    

    Recording procedure

    Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

    Directory structure

    • /dev_data

      • /raw
        • /Vacuum
          • /train (only normal clips)
            • /section_00_source_train_normal_0000_.wav
            • ...
            • /section_00_source_train_normal_0989_.wav
            • /section_00_target_train_normal_0000_.wav
            • ...
            • /section_00_target_train_normal_0009_.wav
          • /test
            • /section_00_source_test_normal_0000_.wav
            • ...
            • /section_00_source_test_normal_0049_.wav
            • /section_00_source_test_anomaly_0000_.wav
            • ...
            • /section_00_source_test_anomaly_0049_.wav
            • /section_00_target_test_normal_0000_.wav
            • ...
            • /section_00_target_test_normal_0049_.wav
            • /section_00_target_test_anomaly_0000_.wav
            • ...
            • /section_00_target_test_anomaly_0049_.wav
          • attributes_00.csv (attribute csv for section 00)
      • /ToyTank (The other machine types have the same directory structure as Vacuum.)
      • /ToyNscale
      • /ToyDrone
      • /bandsaw
      • /grinder
      • /shaker

    Baseline system

    The baseline system is available on the Github repository dcase2023_task2_baseline_ae.The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

    Condition of use

    This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

    Citation

    If you use this dataset, please cite all the following papers. We will publish a paper on the description of the DCASE 2023 Task 2, so pleasure make sure to cite the paper, too.

    Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, and Masahiro Yasuda. First-shot anomaly detection for machine condition monitoring: A domain generalization baseline. In arXiv e-prints: 2303.00455, 2023. [URL]

    Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi. MIMII DG: sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task. In Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 31-35. Nancy, France, November 2022, . [URL]

    Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 1–5. Barcelona, Spain, November 2021. [URL]

    Contact

    If there is any problem, please contact us:

    Kota Dohi, kota.dohi.gr@hitachi.com

    Keisuke Imoto, keisuke.imoto@ieee.org

    Noboru Harada, noboru@ieee.org

    Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp

    Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com

  10. n

    Data for: ToadFishFinder classifier model v4: A catalog of oyster toadfish...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DelWayne Bohnenstiehl (2023). Data for: ToadFishFinder classifier model v4: A catalog of oyster toadfish (Opsanus tau) calls for machine learning [Dataset]. http://doi.org/10.5061/dryad.gtht76hr9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    North Carolina State University
    Authors
    DelWayne Bohnenstiehl
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This data repository contains labeled passive underwater acoustic data used to train and test the machine-learning model of Bohnenstiehl (in prep – 2023), Automated cataloging of oyster toadfish (Opsanus tau) calls using template matching and machine learning. The software accompanying this paper is known as ToadFishFinder, and the classifier model presented in the paper is v4. It consists of more than 10000 labeled toadfish and 10000 labeled other signals. Labeled spectrogram images are provided, along with pressure-corrected waveforms (micro-Pascals) sampled at 24 kHz. Each waveform sample is 1350 ms long. The center 850 ms of these waveform segments represent the portion of the signal used in training and testing the classifier model. Waveform data are provided in multiple formats: 1) MATLAB (.mat) files containing the 'boatwhistle' and 'other' waveforms stored in column format, and 2) individual .wav files, each containing a labeled waveform example. Codes are provided to demonstrate how these .wav files can be read into MATLAB and PYTHON. These labeled data can be used to re-train the ToadFishFinder model or develop alternative classifiers. Methods Labeled signals were extracted from passive acoustic data collected at eight sites within southwestern Pamlico Sound near its confluence with the Pamlico and Neuse River estuaries. As part of a larger effort to monitor the evolution of these reef habitats, each site was outfitted with a SoundTrap 300 hydrophone affixed ~0.5 m above the seabed a the top of a metal stake anchored with a concrete block. Monitoring extended from the fall of 2016 through the fall of 2017, and from the Spring of 2018 through the Fall of 2018. Over most of the monitoring period, the recorders were programmed to capture a 2-minute-duration recording every 20 minutes. Acoustic data were collected at a rate of 96,000 samples/second. They were subsequently resampled to a rate of 24 kHz with the application of an anti-aliasing filter.

    ToadFishFinders spectrogram correlation detector was deployed on hundreds of randomly selected files over the 2+ year monitoring period and from all eight sites. This approach ensured that the training and test datasets captured calls from estuarine soundscapes across various seasons and with varying anthropogenic, geophysical (wind, waves, rain) and biological noise. A spectrogram, filter waveform and spectrogram image were displayed for each detection, and signals labeled as ‘bwhistle’ or ‘other’ spectrogram images were retained for training and testing purposes. The final labeled catalog consisted of more than 10,000 signals within the ‘bwhistle’ class and more than 10,000 signals within the ‘other’ class. For each detection, a 850-ms-duration sample, beginning 400 ms before the detection time, is extracted from the unfiltered waveform data. A frequency-reassigned spectrogram was formed over the frequency range between 0 and 1200 Hz. Each spectrogram is converted to an RGB image and resized to 224-by-224 pixels. These spectrogram images are stored './bwhistle' and './other' folders in the compressed folder spectrograms_TFv4.zip. Waveform data snippets are also provided for these labeled signals. Each snippet is 1350 ms long, with the training and test data (used in generating the spectrograms) representing the center 850 ms of data (points 6,000–26,401). These waveforms are unfiltered, pressure corrected and sampled at 24 kHz. These waveform snippets are provided in two formats. For those working in MATLAB, two .mat files are included (bwhistle_wavefrom_database_TFv4.mat and other_wavefrom_database_TFv4.mat). Each contains a 32,401 x N matrix with the snippets stored in columns and variables indicating station names, UTC times, sample rate and a time vector for plotting. For those working in other software, these snippets are also saved as individual wavfiles stored in './bwhistle' and './other' folders within the wavclips_TFv4.zip file. Scripts for reading these wavfiles in MATLAB and PYTHON are provided. The MATLAB-based ToadFishFinder software is available here: https://github.com/drbohnen/ToadFishFinder

  11. d

    Data from: Data for machine learning predictions of pH in the glacial...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Nov 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data for machine learning predictions of pH in the glacial aquifer system, northern USA [Dataset]. https://catalog.data.gov/dataset/data-for-machine-learning-predictions-of-ph-in-the-glacial-aquifer-system-northern-usa
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    A boosted regression tree (BRT) model was developed to predict pH conditions in three-dimensions throughout the glacial aquifer system (GLAC) of the contiguous United States using pH measurements in samples from 18,258 wells and predictor variables that represent aspects of the hydrogeologic setting. Model results indicate that the carbonate content of soils and aquifer materials strongly controls pH and when coupled with long flow paths, results in the most alkaline conditions. Conversely, in areas where glacial sediments are thin and carbonate-poor, pH conditions remain acidic. At depths typical of drinking-water supplies, predicted pH > 7.5 – which is associated with arsenic mobilization – occurs more frequently than predicted pH < 6 – which is associated with water corrosivity and the mobilization of other trace elements. A novel aspect of this model was the inclusion of numerically based estimates of groundwater flow characteristics (age and flow path length) as predictor variables. The sensitivity of pH predictions to these variables was consistent with hydrologic understanding of groundwater flow systems and the geochemical evolution of groundwater quality. The model was not developed to provide precise estimates of pH at any given location. Rather, it can be used to more generally identify areas where contaminants may be mobilized into groundwater and where corrosivity issues may be of concern to set priorities among areas for future groundwater monitoring. Data are provided in 2 tables and 3 compressed files that contain various files associated with the BRT model. The 2 tables include: 1) pH_Predictions_GLAC_GeochMod_Dataset.csv (GM dataset): This table is generally a subset of the pH dataset (the measured pH data for well sites that were separated into the training and testing dataset files “trnData.txt” and “testData.txt” included in model_archive.7z) that was used to model pH conditions but includes more complete geochemical data and also includes some additional wells from Wilson and others (2019). The table includes pH, general chemical characteristics, and concentrations of major and trace elements, calculated parameters, and mineral saturation indices (SI) computed with PHREEQC (Parkhurst and Appelo, 2013) for 9,655 groundwater samples from wells in the GLAC. 2) pH_Predictions_GLAC_Variable_Descriptions.txt: A table listing all variables (short abbreviation and long description) used in the BRT model, including the importance rank of the variable, units, and reference. The 3 compressed files include: 1) model_archive.7z: contains 15 files associated with the BRT model 2) rstack_dom.7z: rstack_dom.txt 3) rstack_pub.7z : rstack_pub.txt Refer to the README.txt file in model_archive.7z for information about the files in the archive and how to use them to run the BRT model. "The "rstack" files represent raster stacks which are a collection of raster layer objects with the same spatial extent and resolution and which are vertically aligned. Rstack.dom consists of raster layer objects at the depth typically used for domestic supplies and rstack.pub, those at the depth typically used for public supplies.

  12. d

    Data from: Input Files and Code for: Machine learning can accurately assign...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters [Dataset]. https://catalog.data.gov/dataset/input-files-and-code-for-machine-learning-can-accurately-assign-geologic-basin-to-produced
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.

  13. G

    Synthetic Test Data Platform Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Synthetic Test Data Platform Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-test-data-platform-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Synthetic Test Data Platform Market Outlook



    According to our latest research, the synthetic test data platform market size reached USD 1.25 billion in 2024, with a robust compound annual growth rate (CAGR) of 33.7% projected through the forecast period. By 2033, the market is anticipated to reach approximately USD 14.72 billion, reflecting the surging demand for data privacy, compliance, and advanced testing capabilities. The primary growth driver is the increasing emphasis on data security and privacy regulations, which is prompting organizations to adopt synthetic data solutions for software testing and machine learning applications.




    The synthetic test data platform market is experiencing remarkable growth due to the exponential increase in data-driven applications and the rising complexity of software systems. Organizations across industries are under immense pressure to accelerate their digital transformation initiatives while ensuring robust data privacy and regulatory compliance. Synthetic test data platforms enable the generation of realistic, privacy-compliant datasets, allowing enterprises to test software applications and train machine learning models without exposing sensitive information. This capability is particularly crucial in sectors such as banking, healthcare, and government, where regulatory scrutiny over data usage is intensifying. Furthermore, the adoption of agile and DevOps methodologies is fueling the demand for automated, scalable, and on-demand test data generation, positioning synthetic test data platforms as a strategic enabler for modern software development lifecycles.




    Another significant growth factor is the rapid advancement in artificial intelligence (AI) and machine learning (ML) technologies. As organizations increasingly leverage AI/ML models for predictive analytics, fraud detection, and customer personalization, the need for high-quality, diverse, and unbiased training data has become paramount. Synthetic test data platforms address this challenge by generating large volumes of data that accurately mimic real-world scenarios, thereby enhancing model performance while mitigating the risks associated with data privacy breaches. Additionally, these platforms facilitate continuous integration and continuous delivery (CI/CD) pipelines by providing reliable test data at scale, reducing development cycles, and improving time-to-market for new software releases. The ability to simulate edge cases and rare events further strengthens the appeal of synthetic data solutions for critical applications in finance, healthcare, and autonomous systems.




    The market is also benefiting from the growing awareness of the limitations associated with traditional data anonymization techniques. Conventional methods often fail to guarantee complete privacy, leading to potential re-identification risks and compliance gaps. Synthetic test data platforms, on the other hand, offer a more robust approach by generating entirely new data that preserves the statistical properties of original datasets without retaining any personally identifiable information (PII). This innovation is driving adoption among enterprises seeking to balance innovation with regulatory requirements such as GDPR, HIPAA, and CCPA. The integration of synthetic data generation capabilities with existing data management and analytics ecosystems is further expanding the addressable market, as organizations look for seamless, end-to-end solutions to support their data-driven initiatives.




    From a regional perspective, North America currently dominates the synthetic test data platform market, accounting for the largest share due to the presence of leading technology vendors, stringent data privacy regulations, and a mature digital infrastructure. Europe is also witnessing significant growth, driven by the enforcement of GDPR and increasing investments in AI research and development. The Asia Pacific region is emerging as a high-growth market, fueled by rapid digitalization, expanding IT sectors, and rising awareness of data privacy issues. Latin America and the Middle East & Africa are gradually catching up, supported by government initiatives to modernize IT infrastructure and enhance cybersecurity capabilities. As organizations worldwide prioritize data privacy, regulatory compliance, and digital innovation, the demand for synthetic test data platforms is expected to surge across all major regions during the forecast period.



    <div c

  14. Z

    Training and test datasets for the PredictONCO tool

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stourac, Jan; Borko, Simeon; Khan, Rayyan; Pokorna, Petra; Dobias, Adam; Planas-Iglesias, Joan; Mazurenko, Stanislav; Pinto, Gaspar; Szotkowska, Veronika; Sterba, Jaroslav; Slaby, Ondrej; Damborsky, Jiri; Bednar, David (2023). Training and test datasets for the PredictONCO tool [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10013763
    Explore at:
    Dataset updated
    Dec 14, 2023
    Authors
    Stourac, Jan; Borko, Simeon; Khan, Rayyan; Pokorna, Petra; Dobias, Adam; Planas-Iglesias, Joan; Mazurenko, Stanislav; Pinto, Gaspar; Szotkowska, Veronika; Sterba, Jaroslav; Slaby, Ondrej; Damborsky, Jiri; Bednar, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used for training and validating the PredictONCO web tool, supporting decision-making in precision oncology by extending the bioinformatics predictions with advanced computing and machine learning. The dataset consists of 1073 single-point mutants of 42 proteins, whose effect was classified as Oncogenic (509 data points) and Benign (564 data points). All mutations were annotated with a clinically verified effect and were compiled from the ClinVar and OncoKB databases. The dataset was manually curated based on the available information in other precision oncology databases (The Clinical Knowledgebase by The Jackson Laboratory, Personalized Cancer Therapy Knowledge Base by MD Anderson Cancer Center, cBioPortal, DoCM database) or in the primary literature. To create the dataset, we also removed any possible overlaps with the data points used in the PredictSNP consensus predictor and its constituents. This was implemented to avoid any test set data leakage due to using the PredictSNP score as one of the features (see below).

    The entire dataset (SEQ) was further annotated by the pipeline of PredictONCO. Briefly, the following six features were calculated regardless of the structural information available: essentiality of the mutated residue (yes/no), the conservation of the position (the conservation grade and score), the domain where the mutation is located (cytoplasmic, extracellular, transmembrane, other), the PredictSNP score, and the number of essential residues in the protein. For approximately half of the data (STR: 377 and 76 oncogenic and benign data points, respectively), the structural information was available, and six more features were calculated: FoldX and Rosetta ddg_monomer scores, whether the residue is in the catalytic pocket (identification of residues forming the ligand-binding pocket was obtained from P2Rank), and the pKa changes (the minimum and maximum changes as well as the number of essential residues whose pKa was changed – all values obtained from PROPKA3). For both STR and SEQ datasets, 20% of the data was held out for testing. The data split was implemented at the position level to ensure that no position from the test data subset appears in the training data subset.

    For more details about the tool, please visit the help page or get in touch with us.

    14-Dec-2023 update: the file with features PredictONCO-features.txt now includes UniProt IDs, transcripts, PDB codes, and mutations.

  15. DataSheet_1_Diagnostic Performance of 2D and 3D T2WI-Based Radiomics...

    • frontiersin.figshare.com
    txt
    Updated Jun 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Wan; Jiaxuan Zhou; Xiaoying Xia; Jianfeng Hu; Peng Wang; Yu Peng; Tianjing Zhang; Jianqing Sun; Yang Song; Guang Yang; Xinchun Li (2023). DataSheet_1_Diagnostic Performance of 2D and 3D T2WI-Based Radiomics Features With Machine Learning Algorithms to Distinguish Solid Solitary Pulmonary Lesion.csv [Dataset]. http://doi.org/10.3389/fonc.2021.683587.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Qi Wan; Jiaxuan Zhou; Xiaoying Xia; Jianfeng Hu; Peng Wang; Yu Peng; Tianjing Zhang; Jianqing Sun; Yang Song; Guang Yang; Xinchun Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.

  16. G

    College Test Preparation Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). College Test Preparation Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/college-test-preparation-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    College Test Preparation Market Outlook




    According to our latest research, the global College Test Preparation market size reached USD 10.7 billion in 2024, showcasing robust momentum and a growing demand for academic advancement tools. The market is projected to expand at a CAGR of 7.1% during the forecast period, reaching a forecasted value of USD 19.9 billion by 2033. This growth trajectory is powered by increased competition for college admissions, the proliferation of digital learning platforms, and the rising awareness among students and parents about the importance of standardized test scores in shaping academic and career prospects.




    One of the principal growth drivers for the College Test Preparation market is the escalating competition for admissions into prestigious colleges and universities worldwide. As acceptance rates at top-tier institutions continue to decline, students are seeking every advantage to differentiate themselves. This has led to a surge in demand for comprehensive test preparation solutions, including online courses, practice tests, and personalized tutoring services. Parents and students are increasingly willing to invest significant resources in these services, viewing them as critical in securing high test scores and, consequently, better educational and career opportunities. The trend is particularly pronounced in regions with highly competitive academic environments, such as North America and parts of Asia, where standardized test performance can be a decisive factor in college admissions.




    Technological advancements and the digital transformation of education are also pivotal forces shaping the College Test Preparation market. The widespread adoption of online learning platforms has democratized access to high-quality test preparation resources, enabling students from diverse backgrounds to participate in rigorous preparatory programs. Interactive online courses, AI-driven practice tests, and adaptive learning modules have significantly enhanced the effectiveness and engagement of test preparation. These innovations not only make learning more flexible and personalized but also allow providers to scale their offerings globally. The integration of data analytics and real-time feedback mechanisms further empowers students to track their progress and tailor their study strategies, resulting in improved outcomes and higher satisfaction rates.




    Another significant growth factor is the increasing emphasis on lifelong learning and upskilling among working professionals and college graduates. As career trajectories become more dynamic and competitive, adults are returning to standardized tests such as the GRE, GMAT, and LSAT to pursue advanced degrees or pivot to new fields. This has broadened the target audience for test preparation providers, driving the development of specialized courses and flexible learning formats that cater to the unique needs of adult learners. Additionally, the globalization of higher education has led to a rise in international students seeking admission to institutions in the United States, Canada, Europe, and Asia, further fueling demand for test preparation products and services.



    As the demand for standardized test preparation continues to grow, TOEFL Preparation has emerged as a critical component for students aiming to study in English-speaking countries. The TOEFL exam assesses non-native English speakers' proficiency, making it a vital step for international students seeking admission to universities in the United States, Canada, and other English-speaking regions. With the globalization of education, many institutions now require TOEFL scores as part of their admissions process, further driving the need for specialized preparation resources. Providers are responding by offering tailored courses, practice tests, and interactive learning modules designed to enhance language skills and boost confidence. This focus on TOEFL Preparation not only supports students in achieving their academic goals but also contributes to the broader growth of the College Test Preparation market.




    From a regional perspective, North America remains the largest market for College Test Preparation, accounting for over 38% of global revenue in 2024. The region's leadership is underpinned by a well-established culture of standardized testing, a high concent

  17. Augmented_Balanced_Stanford_196Cars

    • kaggle.com
    zip
    Updated May 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AmeyaPat (2021). Augmented_Balanced_Stanford_196Cars [Dataset]. https://www.kaggle.com/datasets/ameyapat/augmented-balanced-stanford-196cars
    Explore at:
    zip(3845187486 bytes)Available download formats
    Dataset updated
    May 17, 2021
    Authors
    AmeyaPat
    Description

    The Stanford Car Images dataset contains about 8000 images each in train and test sets, spread over 196 classes. It is intended for research purposes only. This present dataset is derived from the Stanford Car Images dataset on the following principles. The train set is further split into training and validation, in a 2:1 ratio. The training and validation images are cropped to the bounding boxes available with the original data. This crop keeps approximately 5% margin around the car. The idea is to focus attention on the object we are trying to classify. There are two test folders - cropped as well as uncropped, so the final evaluation can be made on the untouched test data. The training and validation images are augmented using albumentations. The amount of augmentation is calibrated so that in the final output, we have roughly equal number of cars per class. The idea of having a fixed augmented set of images, rather than relying on keras on-the-fly augmentation, is to feed the model better balanced data. In addition, we are able to use keras augmentation on top of albementations augmentations, if desired.

    Citation: 3D Object Representations for Fine-Grained Categorization. Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei. 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13). Sydney, Australia. Dec. 8, 2013.

    https://ai.stanford.edu/~jkrause/cars/car_dataset.html

  18. Higher Education Testing And Assessment Market Analysis North America,...

    • technavio.com
    pdf
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Higher Education Testing And Assessment Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, Germany, China, India, Canada, France, UK, Japan, Brazil, Italy - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/higher-education-testing-and-assessment-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 10, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United States
    Description

    Snapshot img

    Higher Education Testing And Assessment Market Size 2025-2029

    The higher education testing and assessment market size is forecast to increase by USD 7.57 billion at a CAGR of 6.6% between 2024 and 2029.

    The market is witnessing significant shifts as educational institutions increasingly adopt formative assessment methods. This transition signifies a move away from traditional summative assessments towards ongoing evaluation of student progress, enabling educators to identify and address learning gaps in real-time. Moreover, the role of educational technologies in testing and assessment is evolving, with advancements in artificial intelligence, machine learning, and data analytics enabling more personalized and effective assessments. However, challenges persist in this market. One major obstacle is the weak assessment mechanism of online tests, which has become increasingly prevalent due to the shift towards remote learning. Ensuring the validity and reliability of online assessments is a significant challenge, as issues such as test security, proctoring, and test-taker authenticity must be addressed to maintain the integrity of the assessment process. Additionally, ensuring equal access to technology and internet connectivity for all students is crucial to prevent disparities in testing outcomes. Companies seeking to capitalize on market opportunities in this space must focus on developing robust and secure online assessment solutions while addressing these challenges effectively.

    What will be the Size of the Higher Education Testing And Assessment Market during the forecast period?

    Request Free SampleThe market continues to evolve, driven by advancements in technology and shifting educational priorities. Machine learning and artificial intelligence are increasingly utilized in test development and administration, enabling personalized learning and competency-based education. Colleges and universities employ these technologies to assess student performance, ensure test security, and improve test scoring. K-12 schools also adopt assessment solutions to enhance instructional design and facilitate career readiness. Test bias remains a critical concern, with natural language processing and data analysis used to mitigate its impact. Faculty development programs focus on integrating assessment for learning into the curriculum, while test administration becomes more flexible with online and computer-based options. Graduate students and international students benefit from summative and diagnostic assessments, while undergraduate students engage in formative assessments for proficiency and achievement. Continuous improvement in test development and assessment software ensures test validity and reliability, with adaptive learning and learning analytics providing valuable insights for curriculum alignment. Professional development opportunities for educators and administrators are essential to keep pace with the evolving market dynamics and effectively implement these innovative assessment strategies.

    How is this Higher Education Testing And Assessment Industry segmented?

    The higher education testing and assessment industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ProductAcademicNon-academicEnd-userEducational institutionsUniversitiesTraining organizationsOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyItalyUKAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

    By Product Insights

    The academic segment is estimated to witness significant growth during the forecast period.In the realm of higher education, the academic segment encompasses assessments for STEM subjects, a crucial component of the curriculum. Historically, pen and paper tests dominated this area. However, the integration of Learning Management Systems (LMS) and Content Management Systems (CMS) and the escalating preference for personalized learning technologies, including adaptive learning, have significantly boosted the utilization of digital tools in higher education institutions for testing and assessment. The proliferation of technology in classrooms is further fueled by students' increasing use of smartphones, tablets, and e-libraries. Machine learning and artificial intelligence are also playing pivotal roles in creating immersive, harmonious learning experiences. Formative assessments, such as diagnostic tests and quizzes, are increasingly being used to gauge student progress and inform instructional design. Test security and reliability remain paramount, with test development, administration, and scoring being critical components. Competency-based education, career readiness, and achievement tests are also gaining traction. K-12 schools are also adopting similar testing and assessment methodologies. Natur

  19. R

    Face Features Test Dataset

    • universe.roboflow.com
    zip
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Lin (2021). Face Features Test Dataset [Dataset]. https://universe.roboflow.com/peter-lin/face-features-test/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 6, 2021
    Dataset authored and provided by
    Peter Lin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Face Features Bounding Boxes
    Description

    A simple dataset for benchmarking CreateML object detection models. The images are sampled from COCO dataset with eyes and nose bounding boxes added. It’s not meant to be serious or useful in a real application. The purpose is to look at how long it takes to train CreateML models with varying dataset and batch sizes.

    Training performance is affected by model configuration, dataset size and batch configuration. Larger models and batches require more memory. I used CreateML object detection project to compare the performance.

    Hardware

    M1 Macbook Air * 8 GPU * 4/4 CPU * 16G memory * 512G SSD

    M1 Max Macbook Pro * 24 GPU * 2/8 CPU * 32G memory * 2T SSD

    Small Dataset Train: 144 Valid: 16 Test: 8

    Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 16 | 11 | 1.5 | |32 | 29 | 17 | 2.8 | |64 | 56 | 30 | 5.4 | |128 | 170 | 57 | 12 |

    Larger Dataset Train: 301 Valid: 29 Test: 18

    Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 21 | 10 | 1.5 | |32 | 42 | 17 | 3.5 | |64 | 85 | 30 | 8.4 | |128 | 281 | 54 | 16.5 |

    CreateML Settings

    For all tests, training was set to Full Network. I closed CreateML between each run to make sure memory issues didn't cause a slow down. There is a bug with Monterey as of 11/2021 that leads to memory leak. I kept an eye on the memory usage. If it looked like there was a memory leak, I restarted MacOS.

    Observations

    In general, more GPU and memory with MBP reduces the training time. Having more memory lets you train with larger datasets. On M1 Macbook Air, the practical limit is 12G before memory pressure impacts performance. On M1 Max MBP, the practical limit is 26G before memory pressure impacts performance. To work around memory pressure, use smaller batch sizes.

    On the larger dataset with batch size 128, the M1Max is 5x faster than Macbook Air. Keep in mind a real dataset should have thousands of samples like Coco or Pascal. Ideally, you want a dataset with 100K images for experimentation and millions for the real training. The new M1 Max Macbooks is a cost effective alternative to building a Windows/Linux workstation with RTX 3090 24G. For most of 2021, the price of RTX 3090 with 24G is around $3,000.00. That means an equivalent windows workstation would cost the same as the M1Max Macbook pro I used to run the benchmarks.

    Full Network vs Transfer Learning

    As of CreateML 3, training with full network doesn't fully utilize the GPU. I don't know why it works that way. You have to select transfer learning to fully use the GPU. The results of transfer learning with the larger dataset. In general, the training time is faster and loss is better.

    batchET minTrain AccVal AccTest AccTop IU TrainTop IU ValidTop IU TestPeak mem Gloss
    1647519127823131.50.41
    3287521107826112.760.02
    641375238782495.30.017
    128257522137825148.40.012

    Github Project

    The source code and full results are up on Github https://github.com/woolfel/createmlbench

  20. D

    AV Test Safety Driver Training Programs Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). AV Test Safety Driver Training Programs Market Research Report 2033 [Dataset]. https://dataintelo.com/report/av-test-safety-driver-training-programs-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AV Test Safety Driver Training Programs Market Outlook



    According to our latest research, the AV Test Safety Driver Training Programs market size reached USD 1.2 billion in 2024, with a robust CAGR of 15.7% projected through the forecast period. By 2033, the market is expected to reach USD 4.4 billion, reflecting the increasing adoption of autonomous vehicle technologies and the stringent safety regulations driving demand for comprehensive driver training programs. The surge in autonomous vehicle (AV) testing, coupled with regulatory mandates for safety driver certification, is a primary growth factor propelling the market forward.




    One of the most significant growth drivers for the AV Test Safety Driver Training Programs market is the rapid evolution and deployment of autonomous vehicle technologies across both passenger and commercial vehicle segments. As AVs progress from prototype to real-world deployment, the complexity of testing environments and the need for highly skilled safety drivers intensifies. Regulatory bodies in regions such as North America and Europe are enforcing stricter safety protocols, requiring that safety drivers undergo specialized training to ensure they can respond effectively in critical situations. This regulatory push is compelling automotive OEMs, AV developers, and testing agencies to invest heavily in structured training programs, boosting market demand. Furthermore, the rise in public road testing and the expansion of AV pilot programs globally necessitate a larger pool of certified safety drivers, further fueling market growth.




    Another key growth factor is the diversification of training modalities within the AV Test Safety Driver Training Programs market. The industry has witnessed a shift from traditional classroom-based training to more immersive and technologically advanced methods, such as simulator-based and online training. Simulator-based training, in particular, is gaining traction due to its ability to replicate real-world scenarios in a controlled environment, allowing safety drivers to hone their skills without the risk of real-world incidents. Online training platforms are also seeing increased adoption, providing flexibility and scalability for organizations with large numbers of drivers to certify. This evolution in training delivery methods not only enhances the quality and effectiveness of safety driver programs but also broadens the market’s reach to a global audience.




    Strategic collaborations between automotive OEMs, technology providers, and third-party training organizations are further accelerating market growth. These partnerships are enabling the development of standardized curricula, integration of advanced simulation technologies, and the sharing of best practices across the industry. As AV testing becomes more complex, the need for consistent and high-quality training becomes paramount. Organizations are increasingly seeking out specialized training providers with proven track records, driving demand for both in-house and third-party training solutions. The competitive landscape is also fostering innovation, with providers continuously enhancing their offerings to address emerging needs, such as cybersecurity awareness and human-machine interface training for safety drivers.




    From a regional perspective, North America currently dominates the AV Test Safety Driver Training Programs market, accounting for over 40% of global revenue in 2024. This leadership is driven by the presence of major automotive OEMs, pioneering AV technology companies, and a proactive regulatory environment. Europe follows closely, with significant investments in AV infrastructure and collaborative research initiatives. The Asia Pacific region is emerging as a high-growth market, fueled by rapid urbanization, government support for smart mobility solutions, and the expansion of AV pilot programs in countries such as China, Japan, and South Korea. Latin America and the Middle East & Africa are gradually entering the market, primarily through partnerships with global OEMs and technology providers, but their share remains comparatively modest.



    Program Type Analysis



    The Program Type segment in the AV Test Safety Driver Training Programs market encompasses a diverse range of training methodologies, each addressing specific learning objectives and regulatory requirements. Classroom training remains a foundational component, offering safet

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu