87 datasets found
  1. File Validation and Training Statistics

    • kaggle.com
    zip
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). File Validation and Training Statistics [Dataset]. https://www.kaggle.com/datasets/thedevastator/file-validation-and-training-statistics
    Explore at:
    zip(16413235 bytes)Available download formats
    Dataset updated
    Dec 1, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    File Validation and Training Statistics

    Validation, Training, and Testing Statistics for tasksource/leandojo Files

    By tasksource (From Huggingface) [source]

    About this dataset

    The tasksource/leandojo: File Validation, Training, and Testing Statistics dataset is a comprehensive collection of information regarding the validation, training, and testing processes of files in the tasksource/leandojo repository. This dataset is essential for gaining insights into the file management practices within this specific repository.

    The dataset consists of three distinct files: validation.csv, train.csv, and test.csv. Each file serves a unique purpose in providing statistics and information about the different stages involved in managing files within the repository.

    In validation.csv, you will find detailed information about the validation process undergone by each file. This includes data such as file paths within the repository (file_path), full names of each file (full_name), associated commit IDs (commit), traced tactics implemented (traced_tactics), URLs pointing to each file (url), and respective start and end dates for validation.

    train.csv focuses on providing valuable statistics related to the training phase of files. Here, you can access data such as file paths within the repository (file_path), full names of individual files (full_name), associated commit IDs (commit), traced tactics utilized during training activities (traced_tactics), URLs linking to each specific file undergoing training procedures (url).

    Lastly, test.csv encompasses pertinent statistics concerning testing activities performed on different files within the tasksource/leandojo repository. This data includes information such as file paths within the repo structure (file_path), full names assigned to each individual file tested (full_name) , associated commit IDs linked with these files' versions being tested(commit) , traced tactics incorporated during testing procedures regarded(traced_tactics) ,relevant URLs directing to specific tested files(url).

    By exploring this comprehensive dataset consisting of three separate CSV files - validation.csv, train.csv, test.csv - researchers can gain crucial insights into how effective strategies pertaining to validating ,training or testing tasks have been implemented in order to maintain high-quality standards within the tasksource/leandojo repository

    How to use the dataset

    • Familiarize Yourself with the Dataset Structure:

      • The dataset consists of three separate files: validation.csv, train.csv, and test.csv.
      • Each file contains multiple columns providing different information about file validation, training, and testing.
    • Explore the Columns:

      • 'file_path': This column represents the path of the file within the repository.
      • 'full_name': This column displays the full name of each file.
      • 'commit': The commit ID associated with each file is provided in this column.
      • 'traced_tactics': The tactics traced in each file are listed in this column.
      • 'url': This column provides the URL of each file.
    • Understand Each File's Purpose:

    Validation.csv - This file contains information related to the validation process of files in the tasksource/leandojo repository.

    Train.csv - Utilize this file if you need statistics and information regarding the training phase of files in tasksource/leandojo repository.

    Test.csv - For insights into statistics and information about testing individual files within tasksource/leandojo repository, refer to this file.

    • Generate Insights & Analyze Data:
    • Once you have a clear understanding of each column's purpose, you can start generating insights from your analysis using various statistical techniques or machine learning algorithms.
    • Explore patterns or trends by examining specific columns such as 'traced_tactics' or analyzing multiple columns together.

    • Combine Multiple Files (if necessary):

    • If required, you can merge/correlate data across different csv files based on common fields such as 'file_path', 'full_name', or 'commit'.

    • Visualize the Data (Optional):

    • To enhance your analysis, consider creating visualizations such as plots, charts, or graphs. Visualization can offer a clear representation of patterns or relationships within the dataset.

    • Obtain Further Information:

    • If you need additional details about any specific file, make use of the provided 'url' column to access further information.

    Remember that this guide provides a general overview of how to utilize this dataset effectively. Feel ...

  2. Data from: Robust Validation: Confident Predictions Even When Distributions...

    • tandf.figshare.com
    bin
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi (2023). Robust Validation: Confident Predictions Even When Distributions Shift* [Dataset]. http://doi.org/10.6084/m9.figshare.24904721.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy—coming from robust statistics and optimization—is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.’s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.

  3. OpenAI Summarization Corpus

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenAI Summarization Corpus [Dataset]. https://www.kaggle.com/datasets/thedevastator/openai-summarization-corpus/code
    Explore at:
    zip(35399096 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OpenAI Summarization Corpus

    Training and Validation Data from TL;DR, CNN, and Daily Mail

    By Huggingface Hub [source]

    About this dataset

    This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

    To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
    - Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content

    Research Ideas

    • Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
    • Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
    • Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |

    File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...

  4. NESDIS STAR Research Dataset: Model data and code associated with research...

    • fisheries.noaa.gov
    text (structured)
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yingxin Gu; Ivan Csiszar; Marina Tsidulko; Wei Guo (2025). NESDIS STAR Research Dataset: Model data and code associated with research to use multispectral information and machine learning to improve daytime fire radiative power estimation from METImage measurements [Dataset]. http://doi.org/10.5281/zenodo.16285915
    Explore at:
    text (structured)Available download formats
    Dataset updated
    Jul 24, 2025
    Dataset provided by
    National Environmental Satellite, Data, and Information Service
    Authors
    Yingxin Gu; Ivan Csiszar; Marina Tsidulko; Wei Guo
    Time period covered
    2018
    Area covered
    Description

    Model validation data files, machine learning (ML) model testing and training data files, and IDL code files associated with research to use multispectral information and machine learning to improve daytime fire radiative power estimation from METImage measurements.

  5. t

    Training and validation dataset 2 of milling processes for time series...

    • service.tib.eu
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Training and validation dataset 2 of milling processes for time series prediction - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1738
    Explore at:
    Dataset updated
    Nov 28, 2024
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Abstract: Das Ziel des Datensatzes ist das Training und die Validierung von Modellen zur Vorhersage von Zeitreihen für Fräsprozesse. Dazu wurden an einer DMC 60H Prozesse mit einer Abtastrate von 500 Hz durch eine Siemens Industrial Edge aufgenommen. Die Maschine wurde steuerungstechnisch aufgerüstet. Es wurden Prozesse für das Modelltraining und die Validierung aufgenommen, welche sowohl für die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne Werkstück (Aircut) erstellt, um möglichst viele Fälle abdecken zu können. Es handelt sich um die gleiche Versuchsreihe wie in "Training and validation dataset of milling processes for time series prediction" mit der DOI 10.5445/IR/1000157789 und hat zum Ziel, eine Untersuchung der Übertragbarkeit von Modellen zwischen verschiedenen Maschinen zu ermöglichen. Abstract: The aim of the dataset is to train and validate models for predicting time series for milling processes. For this purpose, processes were recorded at a sampling rate of 500 Hz by a Siemens Industrial Edge on a DMC 60H. The machine was upgraded in terms of control technology. Processes for model training and validation were recorded, suitable for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. This is the same series of experiments as in "Training and validation dataset of milling processes for time series prediction" with DOI 10.5445/IR/1000157789 and allows an investigation of the transferability of models between different machines. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine Experimental data: -Machine: Retrofitted DMC 60H -Material: S235JR, 2007 T4 -Tools: -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 5mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 10mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 20mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 10mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).

  6. Google-Fast or Slow?tile-xla valid data csv format

    • kaggle.com
    zip
    Updated Sep 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Jain (2023). Google-Fast or Slow?tile-xla valid data csv format [Dataset]. https://www.kaggle.com/datasets/rishabh15virgo/google-fast-or-slow-tile-xla-validation-dataset
    Explore at:
    zip(694187 bytes)Available download formats
    Dataset updated
    Sep 2, 2023
    Authors
    Rishabh Jain
    Description

    Your goal

    Train a machine learning model based on the runtime data provided to you in the training dataset and further predict the runtime of graphs and configurations in the test dataset.

    For Data understanding , EDA and Baseline model you can refer to my notebook

    https://www.kaggle.com/code/rishabh15virgo/first-impression-understand-data-eda-baseline-15

    Training and Test dataset:

    Train Dataset :

    https://www.kaggle.com/datasets/rishabh15virgo/google-fast-or-slowtile-xla-train-data-csv-format

    Test Dataset :

    https://www.kaggle.com/datasets/rishabh15virgo/google-fast-or-slowtile-xla-test-data-csv-format

    Data Information

    Tile .npz files Suppose a .npz file stores a graph (representing a kernel) with n nodes and m edges. In addition, suppose we compile the graph with c different configurations, and run each on a TPU. Crucially, the configuration is at the graph-level. Then, the .npz file stores the following dictionary

    Key "node_feat": contains float32 matrix with shape (n, 140). The uth row contains the feature vector for node u < n . Nodes are ordered topologically. Key "node_opcode" contains int32 vector with shape (n, ). The uth entry stores the op-code for node u. Key **"edge_index" **contains int32 matrix with shape (m, 2). If entry i is = u, v, then there is a directed edge from node u to node v, where u consumes the output of v. Key "config_feat" contains float32 matrix with shape (c, 24) with row j containing the (graph-level) configuration feature vector. Keys "config_runtime" and "config_runtime_normalizers": both are int64 vectors of length c. Entry j stores the runtime (in nanoseconds) of the given graph compiled with configuration j and a default configuration, respectively. Samples from the same graph may have slightly different "config_runtime_normalizers" because they are measured from different runs on multiple machines. Finally, for the tile collection, your job is to predict the indices of the best configurations (i.e., ones leading to the smallest d["config_runtime"] / d["config_runtime_normalizers"]).

  7. f

    Data from: Isometric Stratified Ensembles: A Partial and Incremental...

    • figshare.com
    xlsx
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    ACS Publications
    Authors
    Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.

  8. H

    Student Performance Prediction Data set

    • dataverse.harvard.edu
    Updated May 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ephrem Admasu Yekun (2020). Student Performance Prediction Data set [Dataset]. http://doi.org/10.7910/DVN/WHBU4P
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Ephrem Admasu Yekun
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We use the data set for training, validation, and testing of high school students performance prediction. We use the data set for training, validation, and testing of high school students performance prediction.

  9. g

    Map georeferencing challenge training and validation data | gimi9.com

    • gimi9.com
    Updated Dec 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Map georeferencing challenge training and validation data | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_map-georeferencing-challenge-training-and-validation-data/
    Explore at:
    Dataset updated
    Dec 23, 2023
    Description

    Extracting useful and accurate information from scanned geologic and other earth science maps is a time-consuming and laborious process involving manual human effort. To address this limitation, the USGS partnered with the Defense Advanced Research Projects Agency (DARPA) to run the AI for Critical Mineral Assessment Competition, soliciting innovative solutions for automatically georeferencing and extracting features from maps. The competition opened for registration in August 2022 and concluded in December 2022. Training and validation data from the map georeferencing challenge are provided here, as well as competition details and a baseline solution. The data were derived from published sources and are provided to the public to support continued development of automated georeferencing and feature extraction tools. References for all maps are included with the data.

  10. R

    Training and development dataset for information extraction in plant...

    • entrepot.recherche.data.gouv.fr
    zip
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MaIAGE; Plateforme ESV; MaIAGE; Plateforme ESV (2025). Training and development dataset for information extraction in plant epidemiomonitoring [Dataset]. http://doi.org/10.57745/ZDNOGF
    Explore at:
    zip(479001)Available download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    MaIAGE; Plateforme ESV; MaIAGE; Plateforme ESV
    License

    https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/ZDNOGFhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/ZDNOGF

    Dataset funded by
    INRAE
    Agence nationale de la recherche
    PIA DATAIA
    Description

    The “Training and development dataset for information extraction in plant epidemiomonitoring” is the annotation set of the “Corpus for the epidemiomonitoring of plant”. The annotations include seven entity types (e.g. species, locations, disease), their normalisation by the NCBI taxonomy and GeoNames and binary (seven) and ternary relationships. The annotations refer to character positions within the documents of the corpus. The annotation guidelines give their definitions and representative examples. Both datasets are intended for the training and validation of information extraction methods.

  11. S

    Two residential districts datasets from Kielce, Poland for building semantic...

    • scidb.cn
    Updated Sep 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agnieszka Łysak (2022). Two residential districts datasets from Kielce, Poland for building semantic segmentation task [Dataset]. http://doi.org/10.57760/sciencedb.02955
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Agnieszka Łysak
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    Kielce, Poland
    Description

    Today, deep neural networks are widely used in many computer vision problems, also for geographic information systems (GIS) data. This type of data is commonly used for urban analyzes and spatial planning. We used orthophotographic images of two residential districts from Kielce, Poland for research including urban sprawl automatic analysis with Transformer-based neural network application.Orthophotomaps were obtained from Kielce GIS portal. Then, the map was manually masked into building and building surroundings classes. Finally, the ortophotomap and corresponding classification mask were simultaneously divided into small tiles. This approach is common in image data preprocessing for machine learning algorithms learning phase. Data contains two original orthophotomaps from Wietrznia and Pod Telegrafem residential districts with corresponding masks and also their tiled version, ready to provide as a training data for machine learning models.Transformed-based neural network has undergone a training process on the Wietrznia dataset, targeted for semantic segmentation of the tiles into buildings and surroundings classes. After that, inference of the models was used to test model's generalization ability on the Pod Telegrafem dataset. The efficiency of the model was satisfying, so it can be used in automatic semantic building segmentation. Then, the process of dividing the images can be reversed and complete classification mask retrieved. This mask can be used for area of the buildings calculations and urban sprawl monitoring, if the research would be repeated for GIS data from wider time horizon.Since the dataset was collected from Kielce GIS portal, as the part of the Polish Main Office of Geodesy and Cartography data resource, it may be used only for non-profit and non-commertial purposes, in private or scientific applications, under the law "Ustawa z dnia 4 lutego 1994 r. o prawie autorskim i prawach pokrewnych (Dz.U. z 2006 r. nr 90 poz 631 z późn. zm.)". There are no other legal or ethical considerations in reuse potential.Data information is presented below.wietrznia_2019.jpg - orthophotomap of Wietrznia districtmodel's - used for training, as an explanatory imagewietrznia_2019.png - classification mask of Wietrznia district - used for model's training, as a target imagewietrznia_2019_validation.jpg - one image from Wietrznia district - used for model's validation during training phasepod_telegrafem_2019.jpg - orthophotomap of Pod Telegrafem district - used for model's evaluation after training phasewietrznia_2019 - folder with wietrznia_2019.jpg (image) and wietrznia_2019.png (annotation) images, divided into 810 tiles (512 x 512 pixels each), tiles with no information were manually removed, so the training data would contain only informative tilestiles presented - used for the model during training (images and annotations for fitting the model to the data)wietrznia_2019_vaidation - folder with wietrznia_2019_validation.jpg image divided into 16 tiles (256 x 256 pixels each) - tiles were presented to the model during training (images for validation model's efficiency); it was not the part of the training datapod_telegrafem_2019 - folder with pod_telegrafem.jpg image divided into 196 tiles (256 x 265 pixels each) - tiles were presented to the model during inference (images for evaluation model's robustness)Dataset was created as described below.Firstly, the orthophotomaps were collected from Kielce Geoportal (https://gis.kielce.eu). Kielce Geoportal offers a .pst recent map from April 2019. It is an orthophotomap with a resolution of 5 x 5 pixels, constructed from a plane flight at 700 meters over ground height, taken with a camera for vertical photos. Downloading was done by WMS in open-source QGIS software (https://www.qgis.org), as a 1:500 scale map, then converted to a 1200 dpi PNG image.Secondly, the map from Wietrznia residential district was manually labelled, also in QGIS, in the same scope, as the orthophotomap. Annotation based on land cover map information was also obtained from Kielce Geoportal. There are two classes - residential building and surrounding. Second map, from Pod Telegrafem district was not annotated, since it was used in the testing phase and imitates situation, where there is no annotation for the new data presented to the model.Next, the images was converted to an RGB JPG images, and the annotation map was converted to 8-bit GRAY PNG image.Finally, Wietrznia data files were tiled to 512 x 512 pixels tiles, in Python PIL library. Tiles with no information or a relatively small amount of information (only white background or mostly white background) were manually removed. So, from the 29113 x 15938 pixels orthophotomap, only 810 tiles with corresponding annotations were left, ready to train the machine learning model for the semantic segmentation task. Pod Telegrafem orthophotomap was tiled with no manual removing, so from the 7168 x 7168 pixels ortophotomap were created 197 tiles with 256 x 256 pixels resolution. There was also image of one residential building, used for model's validation during training phase, it was not the part of the training data, but was a part of Wietrznia residential area. It was 2048 x 2048 pixel ortophotomap, tiled to 16 tiles 256 x 265 pixels each.

  12. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  13. MSL Curiosity Rover Images with Science and Engineering Classes

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. http://doi.org/10.5281/zenodo.4033453
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 17, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

    Data Set Description

    The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

    Directory Contents

    • images - contains all 6,820 images
    • class_map.csv - string-integer class mappings
    • train-set-v2.1.txt - label file for the training set
    • val-set-v2.1.txt - label file for the validation set
    • test-set-v2.1.txt - label file for the test set

    The label files are formatted as below:

    "Image-file-name class_in_integer_representation"

    Labeling Process

    Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

    • If all three labels agree with each other, then use the label as the final label.
    • If the three labels do not agree with each other, then we manually review the labels and decide the final label.
    • We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

    Classes

    There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

    Class name, counts (training set), counts (validation set), counts (test set), integer representation

    Arm cover, 10, 1, 4, 0

    Other rover part, 190, 11, 10, 1

    Artifact, 680, 62, 132, 2

    Nearby surface, 1554, 74, 187, 3

    Close-up rock, 1422, 50, 84, 4

    DRT, 8, 4, 6, 5

    DRT spot, 214, 1, 7, 6

    Distant landscape, 342, 14, 34, 7

    Drill hole, 252, 5, 12, 8

    Night sky, 40, 3, 4, 9

    Float, 190, 5, 1, 10

    Layers, 182, 21, 17, 11

    Light-toned veins, 42, 4, 27, 12

    Mastcam cal target, 122, 12, 29, 13

    Sand, 228, 19, 16, 14

    Sun, 182, 5, 19, 15

    Wheel, 212, 5, 5, 16

    Wheel joint, 62, 1, 5, 17

    Wheel tracks, 26, 3, 1, 18

    Image Augmentation

    Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

    • 90 degrees clockwise rotation (file name ends with -r90.jpg)
    • 180 degrees clockwise rotation (file name ends with -r180.jpg)
    • 270 degrees clockwise rotation (file name ends with -r270.jpg)
    • Horizontal flip (file name ends with -fh.jpg)
    • Vertical flip (file name ends with -fv.jpg)

    Acknowledgment

    The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.

  14. S

    Feature data of training set and verification set in utLIFE-PC article

    • scidb.cn
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LOU; Xing Nianzeng (2024). Feature data of training set and verification set in utLIFE-PC article [Dataset]. http://doi.org/10.57760/sciencedb.14508
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2024
    Dataset provided by
    Science Data Bank
    Authors
    LOU; Xing Nianzeng
    Description

    File description: 1. train.Mutation_Meth_CNV_data.xls : The feature matrix file used in the training model includes sample name, point mutation data, methylation data and CNV data. The first column must be the sample name. 2. train.sample_label.xls : Pathological information of the training set samples, where 1 represents prostate cancer and 0 represents non-prostate cancer. 3. validation.Mutation_Meth_CNV_data.xls : The feature matrix file used in validation set includes sample name, point mutation data, methylation dataand CNV data. The first column must be the sample name. 4. validation.sample_label.xls : Pathological information of the validation set samples, where 1 represents prostate cancer and 0 represents non-prostate cancer.

  15. f

    Bootstrap Cross-Validation Improves Model Selection in Pharmacometrics

    • tandf.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Stephens Cavenaugh (2023). Bootstrap Cross-Validation Improves Model Selection in Pharmacometrics [Dataset]. http://doi.org/10.6084/m9.figshare.13194899.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    James Stephens Cavenaugh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cross-validation assesses the predictive ability of a model, allowing one to rank models accordingly. Although the nonparametric bootstrap is almost always used to assess the variability of a parameter, it can be used as the basis for cross-validation if one keeps track of which items were not selected in a given bootstrap iteration. The items which were selected constitute the training data and the omitted items constitute the testing data. This bootstrap cross-validation (BS-CV) allows model selection to be made on the basis of predictive ability by comparing the median values of ensembles of summary statistics of testing data. BS-CV is herein demonstrated using several summary statistics, including a new one termed the simple metric for prediction quality, and using the warfarin data included in the Monolix distribution with 13 pharmacokinetics (PK) models and 12 pharmacodynamics models. Of note the two best PK models by Akaike’s information criterion (AIC) had the worst predictive ability, underscoring the danger of using single realizations of a random variable (such as AIC) as the basis for model selection. Using these data BS-CV was able to discriminate between similar indirect response models (inhibition of input vs. stimulation of output). This could be useful in situations in which the mechanism of action is unknown (unlike warfarin).

  16. Z

    Data from: DCASE 2021 Task 5: Few-shot Bioacoustic Event Detection...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Sep 4, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morfi, Veronica; Stowell, Dan; Lostanlen, Vincent; Strandburg-Peshkin, Ariana; Gill, Lisa; Pamula, Hanna; Benvent, David; Nolasco, Ines; Singh, Shubhr; Sridhar, Sripathi; Duteil, Mathieu; Farnsworth, Andrew (2021). DCASE 2021 Task 5: Few-shot Bioacoustic Event Detection Development Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4543503
    Explore at:
    Dataset updated
    Sep 4, 2021
    Dataset provided by
    AGH University of Science and Technology
    -
    Cornell Lab of Ornithology
    University of Konstanz & Max Planck Institute of Animal Behavior
    Centre National de la Recherche Scientifique (CNRS)
    BIOTOPIA Naturkundemuseum Bayern
    Queen Mary University of London
    University of Konstanz
    Authors
    Morfi, Veronica; Stowell, Dan; Lostanlen, Vincent; Strandburg-Peshkin, Ariana; Gill, Lisa; Pamula, Hanna; Benvent, David; Nolasco, Ines; Singh, Shubhr; Sridhar, Sripathi; Duteil, Mathieu; Farnsworth, Andrew
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description

    The development set for task 5 of DCASE 2021 "Few-shot Bioacoustic Event Detection" consists of 19 audio files acquired from different bioacoustic sources. The dataset is split into training and validation Sets.

    Multi-class annotations are provided for the training set with positive (POS), negative (NEG) and unkwown (UNK) values for each class. UNK indicates uncertainty about a class.

    Single-class (class of interest) annotations are provided for the validation set, with events marked as positive (POS) or unkwown (UNK) provided for the class of interest.

    Folder Structure

    Development_Set.zip

    |_Development_Set/

    |_Training_Set/
    
    
      |_BV/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
      |_HT/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
      |_JD/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
      |_MT/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
    |_Validation_Set/
    
    
      |_HV/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
      |_PB/
    
    
        |_*.wav
    
    
        |_*.csv
    

    Development_Set_Audio.zip has the same structure but contains only the *.wav files.

    Development_Set_Annotations.zip has the same structure but contains only the *.csv files

    Dataset statistics

    Some statistics on this dataset are as follows, split between training and validation set and their sub-folders:

    TRAINING SET

    Number of audio recordings | 11 Total duration | 14 hours and 20 mins Total classes (excl. UNK) | 19

    Total events (excl. UNK) | 4,686

    TRAINING SET/BV

    Number of audio recordings | 5 Total duration | 10 hours Total classes (excl. UNK) | 11 Total events (excl. UNK) | 2,662

    Sampling rate | 24,000 Hz

    TRAINING SET/HT

    Number of audio recordings | 3 Total duration | 3 hours Total classes (excl. UNK) | 3 Total events (excl. UNK) | 435

    Sampling rate | 6,000 Hz

    TRAINING SET/JD

    Number of audio recordings | 1 Total duration | 10 mins Total classes (excl. UNK) | 1 Total events (excl. UNK) | 355

    Sampling rate | 22,050 Hz

    TRAINING SET/MT

    Number of audio recordings | 2 Total duration | 1 hour and 10 mins Total classes (excl. UNK) | 4 Total events (excl. UNK) | 1,234

    Sampling rate | 8,000 Hz

    VALIDATION SET

    Number of audio recordings | 8 Total duration | 5 hours Total classes (excl. UNK) | 4

    Total events (excl. UNK) | 310

    VALIDATION SET/HV

    Number of audio recordings | 2 Total duration | 2 hours Total classes (excl. UNK) | 2 Total events (excl. UNK) | 50

    Sampling rate | 6,000 Hz

    VALIDATION SET/PB

    Number of audio recordings | 6 Total duration | 3 hours Total classes (excl. UNK) | 2 Total events (excl. UNK) | 260

    Sampling rate | 44,100 Hz

    Annotation structure

    Each line of the annotation csv represents an event in the audio file. The column descriptions are as follows:

    TRAINING SET

    Audiofilename, Starttime, Endtime, CLASS_1, CLASS_2, ...CLASS_N

    VALIDATION SET

    Audiofilename, Starttime, Endtime, Q

    Classes

    DCASE2021_task5_training_set_classes.csv and DCASE2021_task5_validation_set_classes.csv provide a table with class code correspondace to class name for all classes in the Development set.

    DCASE2021_task5_training_set_classes.csv

    dataset, class_code, class_name

    DCASE2021_task5_validation_set_classes.csv

    dataset, recording, class_code, class_name

    Evaluation Set

    The Evaluation set for the same task can be found at: https://doi.org/10.5281/zenodo.5413149

    Open Access

    This dataset is available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Contact info

    Please send any feedback or questions to: Veronica Morfi: g.v.morfi@qmul.ac.uk

  17. Data from: Development and validation of HBV surveillance models using big...

    • tandf.figshare.com
    docx
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong (2024). Development and validation of HBV surveillance models using big data and machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.25201473.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Weinan Dong; Cecilia Clara Da Roza; Dandan Cheng; Dahao Zhang; Yuling Xiang; Wai Kay Seto; William C. W. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The construction of a robust healthcare information system is fundamental to enhancing countries’ capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China’s rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization’s (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital’s Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients’ physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264) This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections.We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China. This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections. We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China.

  18. Z

    Data from: DCASE 2024 Task 5: Few-shot Bioacoustic Event Detection...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghani, Burooj; Nolasco, Inês; Jensen, Frants; Pamula, Hanna; Whitehead, Helen; Liang, Jinhua; Singh, Shubhr; Strandburg-Peshkin, Ariana; Gill, Lisa; Morford, Joe; Emmerson, Michael; Grout, Emily; Kiskin, Ivan; Vidaña-Vila, Ester; Lostanlen, Vincent; Stowell, Dan (2024). DCASE 2024 Task 5: Few-shot Bioacoustic Event Detection Development Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10829603
    Explore at:
    Dataset updated
    Mar 31, 2024
    Dataset provided by
    La Salle, Universitat Ramon Llull
    AGH University of Krakow
    Tilburg University
    University of Oxford
    Queen Mary University of London
    Biotopia
    Centre National de la Recherche Scientifique
    University of Konstanz
    University of Salford
    Naturalis Biodiversity Center
    Syracuse University
    University of Surrey
    Authors
    Ghani, Burooj; Nolasco, Inês; Jensen, Frants; Pamula, Hanna; Whitehead, Helen; Liang, Jinhua; Singh, Shubhr; Strandburg-Peshkin, Ariana; Gill, Lisa; Morford, Joe; Emmerson, Michael; Grout, Emily; Kiskin, Ivan; Vidaña-Vila, Ester; Lostanlen, Vincent; Stowell, Dan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description:

    The development set for task 5 of DCASE 2024 "Few-shot Bioacoustic Event Detection" consists of 217 audio files acquired from different bioacoustic sources. The dataset is split into training and validation sets.

    Multi-class annotations are provided for the training set with positive (POS), negative (NEG) and unkwown (UNK) values for each class. UNK indicates uncertainty about a class.

    Single-class (class of interest) annotations are provided for the validation set, with events marked as positive (POS) or unkwown (UNK) provided for the class of interest.

    Folder Structure:

    Development_set.zip

    |_Development_Set/

    |_Training_Set/
    
      |_JD/
    
        |_*.wav
    
        |_*.csv
    
      |_HT/
    
        |_*.wav
    
        |_*.csv
    
      |_BV/
    
        |_*.wav
    
        |_*.csv
    
      |_MT/
    
        |_*.wav
    
        |_*.csv
    
      |_WMW/
    
        |_*.wav
    
        |_*.csv
    
    
    
    |_Validation_Set/
    
      |_HB/
    
        |_*.wav
    
        |_*.csv
    
      |_PB/
    
        |_*.wav
    
        |_*.csv
    
      |_ME/
    
        |_*.wav
    
        |_*.csv
    
      |_PB24/
    
        |_*.wav
    
        |_*.csv
    
      |_RD/
    
        |_*.wav
    
        |_*.csv
    
      |_PW/
    
        |_*.wav
    
        |_*.csv
    

    Development_set_annotations.zip has the same structure but contains only the *.csv files

    Dataset statistics

    Some statistics on this dataset are as follows, split between training and validation set and their sub-folders:

    -----------------------------------------------------TRAINING SET-----------------------------------------------------Number of audio recordings | 174Total duration | 21 hoursTotal classes | 47Total events | 14229-----------------------------------------------------TRAINING SET/BV-----------------------------------------------------Number of audio recordings | 5Total duration | 10 hoursTotal classes | 11Total events | 9026Sampling rate | 24000 Hz-----------------------------------------------------TRAINING SET/HT-----------------------------------------------------Number of audio recordings | 5Total duration | 5 hoursTotal classes | 5Total events | 611Sampling rate | 6000 Hz-----------------------------------------------------TRAINING SET/JD-----------------------------------------------------Number of audio recordings | 1Total duration | 10 minsTotal classes | 1Total events | 357Sampling rate | 22050 Hz-----------------------------------------------------TRAINING SET/MT-----------------------------------------------------Number of audio recordings | 2Total duration | 1 hour and 10 minsTotal classes | 4Total events | 1294Sampling rate | 8000 Hz-----------------------------------------------------TRAINING SET/WMW-----------------------------------------------------Number of audio recordings | 161Total duration | 4 hours and 40 minsTotal classes | 26Total events | 2941Sampling rate | various sampling rates-----------------------------------------------------

    -----------------------------------------------------VALIDATION SET-----------------------------------------------------Number of audio recordings | 43Total duration | 49 hours and 57 minutesTotal classes | 7Total events | 3504-----------------------------------------------------VALIDATION SET/HB-----------------------------------------------------Number of audio recordings | 10Total duration | 2 hours and 38 minutesTotal classes | 1Total events | 712Sampling rate | 44100 Hz-----------------------------------------------------VALIDATION SET/PB-----------------------------------------------------Number of audio recordings | 6Total duration | 3 hoursTotal classes | 2Total events | 292Sampling rate | 44100 Hz-----------------------------------------------------VALIDATION SET/ME-----------------------------------------------------Number of audio recordings | 2Total duration | 20 minutesTotal classes | 2Total events | 73Sampling rate | 44100 Hz-----------------------------------------------------VALIDATION SET/PB24-----------------------------------------------------Number of audio recordings | 4Total duration | 2 hoursTotal classes | 2Total events | 350Sampling rate | 44100 Hz-----------------------------------------------------VALIDATION SET/RD-----------------------------------------------------Number of audio recordings | 6Total duration | 18 hoursTotal classes | 1Total events | 1372Sampling rate | 48000 Hz-----------------------------------------------------VALIDATION SET/PW-----------------------------------------------------Number of audio recordings | 15Total duration | 24 hoursTotal classes | 1Total events | 705Sampling rate | 96000 Hz-----------------------------------------------------

    Annotation structure

    Each line of the annotation csv represents an event in the audio file. The column descriptions are as follows:

    TRAINING SET---------------------Audiofilename, Starttime, Endtime, CLASS_1, CLASS_2, ...CLASS_N

    VALIDATION SET---------------------Audiofilename, Starttime, Endtime, Q

    Classes

    DCASE2024_task5_training_set_classes.csv and DCASE2024_task5_validation_set_classes.csv provide a table with class code correspondence to class name for all classes in the Development set. Additionally, DCASE2024_task5_validation_set_classes.csv also provides a recording names column.

    DCASE2024_task5_training_set_classes.csv---------------------dataset, class_code, class_name

    DCASE2024_task5_validation_set_classes.csv---------------------dataset, recording, class_code, class_name

    Evaluation Set

    The Evaluation set for this task will be released on the 1 June 2024

    Open Access:

    This dataset is available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Contact info:

    Please send any feedback or questions to:

    Burooj Ghani - burooj.ghani@naturalis.nl | Ines Nolasco - i.dealmeidanolasco@qmul.ac.uk

    Alternately, join us on Slack: task-fewshot-bio-sed

  19. FATURA Dataset

    • zenodo.org
    zip
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Limam; Marwa Dhiaf; Yousri Kessentini; Mahmoud Limam; Marwa Dhiaf; Yousri Kessentini (2023). FATURA Dataset [Dataset]. http://doi.org/10.5281/zenodo.10371464
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mahmoud Limam; Marwa Dhiaf; Yousri Kessentini; Mahmoud Limam; Marwa Dhiaf; Yousri Kessentini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of 10000 jpg images with white backgrounds, 10000 jpg images with colored backgrounds (the same colors used in the paper) as well as 3x10000 json annotation files. The images are generated from 50 different templates. For each template, 200 images were generated. We provide annotations in three formats: our own original format, the COCO format and a format compatible with HuggingFace Transformers. Background color varies across templates but not across instances from the same template.

    In terms of objects, the dataset contains 24 different classes. The classes vary considerably in their numbers of occurrences and thus, the dataset is somewhat imbalanced.

    The annotations contain bounding box coordinates, bounding box text and object classes.

    We propose two methods for training and evaluating models. The models were trained until convergence ie until the model reaches optimal performance on the validation split and started overfitting. The model version used for evaluation is the one with the best validation performance.

    First Evaluation strategy:
    For each template, the generated images are randomly split into 3 subsets: training, validation and testing.
    In this scenario, the model trains on all templates and is thus tested on new images rather than new layouts.

    Second Evaluation strategy:
    The real templates are randomly split into a training set, and a common set of templates for validation and testing. All the variants created from the training templates are used as training dataset. The same is done to form the validation and testing datasets. The validation and testing sets are made up of the same templates but of different images.
    This approach tests the models' performance on different unseen templates/layouts, rather than the same templates with different content.

    We provide the data splits we used for every evaluation scenario. We also provide the background colors we used as augmentation for each template.

  20. o

    madelon

    • openml.org
    Updated May 22, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). madelon [Dataset]. https://www.openml.org/d/1485
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2015
    Description

    Author: Isabelle Guyon
    Source: UCI
    Please cite: Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, 2004. Result analysis of the NIPS 2003 feature selection challenge.

    Abstract:

    MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

    Source:

    Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 90708 isabelle '@' clopinet.com

    Data Set Information:

    MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). It was added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

    This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. The original data was split into training, validation and test set. Target values are provided only for two first sets (not for the test set). So, this dataset version contains all the examples from training and validation partitions.

    There is no attribute information provided to avoid biasing the feature selection process.

    Relevant Papers:

    The best challenge entrants wrote papers collected in the book: Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

    Isabelle Guyon, et al, 2007. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition Letters 28 (2007) 1438–1444.

    Isabelle Guyon, et al. 2006. Feature selection with the CLOP package. Technical Report.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). File Validation and Training Statistics [Dataset]. https://www.kaggle.com/datasets/thedevastator/file-validation-and-training-statistics
Organization logo

File Validation and Training Statistics

Validation, Training, and Testing Statistics for tasksource/leandojo Files

Explore at:
zip(16413235 bytes)Available download formats
Dataset updated
Dec 1, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

File Validation and Training Statistics

Validation, Training, and Testing Statistics for tasksource/leandojo Files

By tasksource (From Huggingface) [source]

About this dataset

The tasksource/leandojo: File Validation, Training, and Testing Statistics dataset is a comprehensive collection of information regarding the validation, training, and testing processes of files in the tasksource/leandojo repository. This dataset is essential for gaining insights into the file management practices within this specific repository.

The dataset consists of three distinct files: validation.csv, train.csv, and test.csv. Each file serves a unique purpose in providing statistics and information about the different stages involved in managing files within the repository.

In validation.csv, you will find detailed information about the validation process undergone by each file. This includes data such as file paths within the repository (file_path), full names of each file (full_name), associated commit IDs (commit), traced tactics implemented (traced_tactics), URLs pointing to each file (url), and respective start and end dates for validation.

train.csv focuses on providing valuable statistics related to the training phase of files. Here, you can access data such as file paths within the repository (file_path), full names of individual files (full_name), associated commit IDs (commit), traced tactics utilized during training activities (traced_tactics), URLs linking to each specific file undergoing training procedures (url).

Lastly, test.csv encompasses pertinent statistics concerning testing activities performed on different files within the tasksource/leandojo repository. This data includes information such as file paths within the repo structure (file_path), full names assigned to each individual file tested (full_name) , associated commit IDs linked with these files' versions being tested(commit) , traced tactics incorporated during testing procedures regarded(traced_tactics) ,relevant URLs directing to specific tested files(url).

By exploring this comprehensive dataset consisting of three separate CSV files - validation.csv, train.csv, test.csv - researchers can gain crucial insights into how effective strategies pertaining to validating ,training or testing tasks have been implemented in order to maintain high-quality standards within the tasksource/leandojo repository

How to use the dataset

  • Familiarize Yourself with the Dataset Structure:

    • The dataset consists of three separate files: validation.csv, train.csv, and test.csv.
    • Each file contains multiple columns providing different information about file validation, training, and testing.
  • Explore the Columns:

    • 'file_path': This column represents the path of the file within the repository.
    • 'full_name': This column displays the full name of each file.
    • 'commit': The commit ID associated with each file is provided in this column.
    • 'traced_tactics': The tactics traced in each file are listed in this column.
    • 'url': This column provides the URL of each file.
  • Understand Each File's Purpose:

Validation.csv - This file contains information related to the validation process of files in the tasksource/leandojo repository.

Train.csv - Utilize this file if you need statistics and information regarding the training phase of files in tasksource/leandojo repository.

Test.csv - For insights into statistics and information about testing individual files within tasksource/leandojo repository, refer to this file.

  • Generate Insights & Analyze Data:
  • Once you have a clear understanding of each column's purpose, you can start generating insights from your analysis using various statistical techniques or machine learning algorithms.
  • Explore patterns or trends by examining specific columns such as 'traced_tactics' or analyzing multiple columns together.

  • Combine Multiple Files (if necessary):

  • If required, you can merge/correlate data across different csv files based on common fields such as 'file_path', 'full_name', or 'commit'.

  • Visualize the Data (Optional):

  • To enhance your analysis, consider creating visualizations such as plots, charts, or graphs. Visualization can offer a clear representation of patterns or relationships within the dataset.

  • Obtain Further Information:

  • If you need additional details about any specific file, make use of the provided 'url' column to access further information.

Remember that this guide provides a general overview of how to utilize this dataset effectively. Feel ...

Search
Clear search
Close search
Google apps
Main menu