73 datasets found
  1. f

    The optimal feature subsets for testing genomes.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dapeng Xiong; Fen Xiao; Li Liu; Kai Hu; Yanping Tan; Shunmin He; Xieping Gao (2023). The optimal feature subsets for testing genomes. [Dataset]. http://doi.org/10.1371/journal.pone.0043126.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Dapeng Xiong; Fen Xiao; Li Liu; Kai Hu; Yanping Tan; Shunmin He; Xieping Gao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A-F is E. coli K12, E. coli O157 Sakai, S. enterica Typhi CT18, S. enterica Paratypi ATCC 9150, C. pneumoniae CWL029 and S. agalactiae 2603, respectively. “Yes” indicates that the corresponding feature is included in the optimal feature subset.

  2. f

    Feature selection performance of different methods on the real datasets.

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahi Jain; Wei Xu (2023). Feature selection performance of different methods on the real datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0246159.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rahi Jain; Wei Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Feature selection performance of different methods on the real datasets.

  3. f

    RMSE performance of different methods on the real datasets for test data.

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahi Jain; Wei Xu (2023). RMSE performance of different methods on the real datasets for test data. [Dataset]. http://doi.org/10.1371/journal.pone.0246159.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rahi Jain; Wei Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RMSE performance of different methods on the real datasets for test data.

  4. d

    Data from: Test case selection through novel methodologies for software...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sekar Kidambi Raju; Sathiamoorthy Gopalan; Sanjay Kumar; Arunkumar Sukumar; Areej Alasiry; Raja Marappan; Mehrez Marzougui; Anteneh Wogasso Wodajo (2025). Test case selection through novel methodologies for software application developments [Dataset]. http://doi.org/10.5061/dryad.0gb5mkm6b
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Sekar Kidambi Raju; Sathiamoorthy Gopalan; Sanjay Kumar; Arunkumar Sukumar; Areej Alasiry; Raja Marappan; Mehrez Marzougui; Anteneh Wogasso Wodajo
    Time period covered
    Jan 1, 2023
    Description

    Test case selection is to minimize the time and effort spent for software testing in real time practice. During the course of software testing, the software firms are in want of techniques to finish the testing in a stipulated time, whilst uncompromising on quality. The motto is to select subset of test cases rather to take up all available test cases to uncover most of the bugs. Clustering of test cases using ranking and also based on similarity coefficients is to be implemented. The experimented results have to show up the techniques proposed improving the catching up of errors in a comparatively shorter duration. In this research, eleven different features were considered in order to cluster the test cases. There are two methodologies implemented. In the first methodology, each cluster will cover set of specific features to a certain percentage. Depending on the feature’s coverage, cluster of test cases can be selected. These clusters were formed using ranking methodology. In the sec...

  5. t

    Bank Marketing Dataset (UCI) - Test Upload

    • invenio01-demo.tugraz.at
    zip
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S. Moro; P. Rita; P. Cortez; S. Moro; P. Rita; P. Cortez (2025). Bank Marketing Dataset (UCI) - Test Upload [Dataset]. http://doi.org/10.24432/c5k306
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    UCI Machine Learning Repository
    Authors
    S. Moro; P. Rita; P. Cortez; S. Moro; P. Rita; P. Cortez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is related to direct marketing campaigns conducted by a Portuguese banking institution, with campaigns relying on phone calls. Often multiple contacts with the same client were necessary to determine whether they would subscribe ('yes') or not ('no') to a bank term deposit. The dataset includes four files:

    1. bank-additional-full.csv: Contains all 41,188 examples with 20 input features, organized chronologically from May 2008 to November 2010, closely aligned with the data analyzed in [Moro et al., 2014].
    2. bank-additional.csv: A subset of 4,119 examples (10% of the full data), randomly selected, with 20 input features.
    3. bank-full.csv: The older version of the dataset, comprising all examples (41,188) with 17 input features, also organized chronologically.
    4. bank.csv: A 10% random subset of the older version, containing 4,119 examples and 17 input features.

    The smaller subsets are designed for testing computationally intensive machine learning algorithms (e.g., SVM). The primary classification objective is to predict whether a client will subscribe to a term deposit ('yes' or 'no'), based on the target variable y.

  6. u

    Data from: Dataset of the paper “A multistart tabu search–based method for...

    • investigacion.ubu.es
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pacheco Bonrostro, Joaquin; Saiz Vázquez, Olalla; Casado Yusta, Silvia; Ubillos Landa, Silvia; Pacheco Bonrostro, Joaquin; Saiz Vázquez, Olalla; Casado Yusta, Silvia; Ubillos Landa, Silvia (2023). Dataset of the paper “A multistart tabu search–based method for feature selection in medical applications". Scientific Reports, 13, 17140 [Dataset]. https://investigacion.ubu.es/documentos/682b043be0cd0116a732e8dc
    Explore at:
    Dataset updated
    2023
    Authors
    Pacheco Bonrostro, Joaquin; Saiz Vázquez, Olalla; Casado Yusta, Silvia; Ubillos Landa, Silvia; Pacheco Bonrostro, Joaquin; Saiz Vázquez, Olalla; Casado Yusta, Silvia; Ubillos Landa, Silvia
    Description

    In the design of classification models, irrelevant or noisy features are often generated. In some cases, there may even be negative interactions among features. These weaknesses can degrade the performance of the models. Feature selection is a task that searches for a small subset of relevant features from the original set that generate the most efficient models possible. In addition to improving the efficiency of the models, feature selection confers other advantages, such as greater ease in the generation of the necessary data as well as clearer and more interpretable models. In the case of medical applications, feature selection may help to distinguish which characteristics, habits, and factors have the greatest impact on the onset of diseases. However, feature selection is a complex task due to the large number of possible solutions. In the last few years, methods based on different metaheuristic strategies, mainly evolutionary algorithms, have been proposed. The motivation of this work is to develop a method that outperforms previous methods, with the benefits that this implies especially in the medical field. More precisely, the present study proposes a simple method based on tabu search and multistart techniques. The proposed method was analyzed and compared to other methods by testing their performance on several medical databases. Specifically, eight databases belong to the well-known repository of the University of California in Irvine and one of our own design were used. In these computational tests, the proposed method outperformed other recent methods as gauged by various metrics and classifiers. The analyses were accompanied by statistical tests, the results of which showed that the superiority of our method is significant and therefore strengthened these conclusions. In short, the contribution of this work is the development of a method that, on the one hand, is based on different strategies than those used in recent methods, and on the other hand, improves the performance of these methods.

  7. Z

    Training and test datasets for the PredictONCO tool

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sterba, Jaroslav (2023). Training and test datasets for the PredictONCO tool [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10013763
    Explore at:
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    Pinto, Gaspar
    Damborsky, Jiri
    Bednar, David
    Planas-Iglesias, Joan
    Stourac, Jan
    Dobias, Adam
    Szotkowska, Veronika
    Sterba, Jaroslav
    Mazurenko, Stanislav
    Borko, Simeon
    Slaby, Ondrej
    Pokorna, Petra
    Khan, Rayyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used for training and validating the PredictONCO web tool, supporting decision-making in precision oncology by extending the bioinformatics predictions with advanced computing and machine learning. The dataset consists of 1073 single-point mutants of 42 proteins, whose effect was classified as Oncogenic (509 data points) and Benign (564 data points). All mutations were annotated with a clinically verified effect and were compiled from the ClinVar and OncoKB databases. The dataset was manually curated based on the available information in other precision oncology databases (The Clinical Knowledgebase by The Jackson Laboratory, Personalized Cancer Therapy Knowledge Base by MD Anderson Cancer Center, cBioPortal, DoCM database) or in the primary literature. To create the dataset, we also removed any possible overlaps with the data points used in the PredictSNP consensus predictor and its constituents. This was implemented to avoid any test set data leakage due to using the PredictSNP score as one of the features (see below).

    The entire dataset (SEQ) was further annotated by the pipeline of PredictONCO. Briefly, the following six features were calculated regardless of the structural information available: essentiality of the mutated residue (yes/no), the conservation of the position (the conservation grade and score), the domain where the mutation is located (cytoplasmic, extracellular, transmembrane, other), the PredictSNP score, and the number of essential residues in the protein. For approximately half of the data (STR: 377 and 76 oncogenic and benign data points, respectively), the structural information was available, and six more features were calculated: FoldX and Rosetta ddg_monomer scores, whether the residue is in the catalytic pocket (identification of residues forming the ligand-binding pocket was obtained from P2Rank), and the pKa changes (the minimum and maximum changes as well as the number of essential residues whose pKa was changed – all values obtained from PROPKA3). For both STR and SEQ datasets, 20% of the data was held out for testing. The data split was implemented at the position level to ensure that no position from the test data subset appears in the training data subset.

    For more details about the tool, please visit the help page or get in touch with us.

    14-Dec-2023 update: the file with features PredictONCO-features.txt now includes UniProt IDs, transcripts, PDB codes, and mutations.

  8. m

    Supplementary Materials for Optimizing Real-Time Phenotyping in Critical...

    • data.mendeley.com
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piotr Picheta (2025). Supplementary Materials for Optimizing Real-Time Phenotyping in Critical Care Using Machine Learning on Electronic Health Records [Dataset]. http://doi.org/10.17632/n4jn62rh2m.1
    Explore at:
    Dataset updated
    Jul 29, 2025
    Authors
    Piotr Picheta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the study "Optimizing Real-Time Phenotyping in Critical Care Using Machine Learning on Electronic Health Records," which hypothesizes that a patient's latent disease state can be continuously and accurately estimated from real-time biomedical signals without requiring full ICU trajectories. It supports replication and evaluation of our predictive framework, which dynamically models phenotype probabilities as data accumulates. All elements are reported in line with the TRIPOD statement to ensure transparency and reproducibility.

    The training and test data are derived from the MIMIC-IV database and consist of vectorized representations of multivariate, irregularly sampled biomedical time series and associated phenotype labels. These were generated through a structured pipeline that includes cohort selection, event aggregation using fixed-length time bins, and feature engineering to represent both value trends and missingness. Supplementary Tables S.1 to S.6 describe the variables used in this transformation, their sources within the EHR, aggregation methods, and descriptive statistics for both static (e.g., demographics, admission data) and dynamic (e.g., vital signs, lab results, ventilator settings) features across the train and test sets.

    Table S.7 summarizes the model’s real-time phenotyping performance using multiple evaluation perspectives. The results reveal strong generalization and early predictive value: in the (ls) setting, the model achieved good diagnostic performance (AUROC ≥ 0.8) for 69% of phenotypes and excellent performance (AUROC ≥ 0.9) for 30%. In the real-time (fs) setting—using only the earliest recorded physiological data—the model still achieved good performance for 40% of phenotypes and excellent performance for 5%, demonstrating the feasibility of early, actionable phenotyping. The intermediate (td) evaluation shows that predictive quality improves consistently as more data becomes available, supporting the framework’s ability to track dynamic disease progression in real time.

    To interpret and use the data: - Each patient stay is represented as a multivariate time series with associated phenotype labels. - Time series are aligned in fixed time intervals (e.g., 2 hours), where each variable is aggregated using statistical functions (e.g., mean, last, sum). - The phenotype labels correspond to ICD-9-CM diagnostic categories assigned at discharge but are used here as latent variables to be estimated continuously.

    This dataset enables reproducibility of the results and further research in developing machine learning models for early, interpretable, and actionable phenotyping in critical care.

  9. D

    Test Data Generation Tools Market Report | Global Forecast From 2025 To 2033...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Test Data Generation Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-test-data-generation-tools-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Test Data Generation Tools Market Outlook



    The global market size for Test Data Generation Tools was valued at USD 800 million in 2023 and is projected to reach USD 2.2 billion by 2032, growing at a CAGR of 12.1% during the forecast period. The surge in the adoption of agile and DevOps practices, along with the increasing complexity of software applications, is driving the growth of this market.



    One of the primary growth factors for the Test Data Generation Tools market is the increasing need for high-quality test data in software development. As businesses shift towards more agile and DevOps methodologies, the demand for automated and efficient test data generation solutions has surged. These tools help in reducing the time required for test data creation, thereby accelerating the overall software development lifecycle. Additionally, the rise in digital transformation across various industries has necessitated the need for robust testing frameworks, further propelling the market growth.



    The proliferation of big data and the growing emphasis on data privacy and security are also significant contributors to market expansion. With the introduction of stringent regulations like GDPR and CCPA, organizations are compelled to ensure that their test data is compliant with these laws. Test Data Generation Tools that offer features like data masking and data subsetting are increasingly being adopted to address these compliance requirements. Furthermore, the increasing instances of data breaches have underscored the importance of using synthetic data for testing purposes, thereby driving the demand for these tools.



    Another critical growth factor is the technological advancements in artificial intelligence and machine learning. These technologies have revolutionized the field of test data generation by enabling the creation of more realistic and comprehensive test data sets. Machine learning algorithms can analyze large datasets to generate synthetic data that closely mimics real-world data, thus enhancing the effectiveness of software testing. This aspect has made AI and ML-powered test data generation tools highly sought after in the market.



    Regional outlook for the Test Data Generation Tools market shows promising growth across various regions. North America is expected to hold the largest market share due to the early adoption of advanced technologies and the presence of major software companies. Europe is also anticipated to witness significant growth owing to strict regulatory requirements and increased focus on data security. The Asia Pacific region is projected to grow at the highest CAGR, driven by rapid industrialization and the growing IT sector in countries like India and China.



    Synthetic Data Generation has emerged as a pivotal component in the realm of test data generation tools. This process involves creating artificial data that closely resembles real-world data, without compromising on privacy or security. The ability to generate synthetic data is particularly beneficial in scenarios where access to real data is restricted due to privacy concerns or regulatory constraints. By leveraging synthetic data, organizations can perform comprehensive testing without the risk of exposing sensitive information. This not only ensures compliance with data protection regulations but also enhances the overall quality and reliability of software applications. As the demand for privacy-compliant testing solutions grows, synthetic data generation is becoming an indispensable tool in the software development lifecycle.



    Component Analysis



    The Test Data Generation Tools market is segmented into software and services. The software segment is expected to dominate the market throughout the forecast period. This dominance can be attributed to the increasing adoption of automated testing tools and the growing need for robust test data management solutions. Software tools offer a wide range of functionalities, including data profiling, data masking, and data subsetting, which are essential for effective software testing. The continuous advancements in software capabilities also contribute to the growth of this segment.



    In contrast, the services segment, although smaller in market share, is expected to grow at a substantial rate. Services include consulting, implementation, and support services, which are crucial for the successful deployment and management of test data generation tools. The increasing complexity of IT inf

  10. Dataset for "Machine learning predictions on an extensive geotechnical...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrico Soranzo; Enrico Soranzo (2024). Dataset for "Machine learning predictions on an extensive geotechnical dataset of laboratory tests in Austria" [Dataset]. http://doi.org/10.5281/zenodo.14251191
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Enrico Soranzo; Enrico Soranzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 30, 2024
    Description

    This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.

    The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.

    Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.

    This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.

    Key Features:

    • Temporal Coverage: Over 20 years of data.
    • Geographical Coverage: Vienna, Lower Austria, and Burgenland.
    • Tests Included:
      • Particle Size Distribution
      • Atterberg Limits
      • Proctor Tests
      • Permeability Tests
      • Direct Shear Tests
    • Number of Variables: 24
    • Potential Applications: Correlation analysis, predictive modeling, and geotechnical design.

    Technical Details:

    • Missing values have been addressed using K-Nearest Neighbors (KNN) imputation, and anomalies identified using Local Outlier Factor (LOF) methods in previous studies.
    • Data normalization and standardization steps are recommended for specific analyses.

    Acknowledgments:
    The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).

  11. Feature selection performance of different approaches in simulated...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahi Jain; Wei Xu (2023). Feature selection performance of different approaches in simulated scenarios. [Dataset]. http://doi.org/10.1371/journal.pone.0246159.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rahi Jain; Wei Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Feature selection performance of different approaches in simulated scenarios.

  12. c

    Curated Breast Imaging Subset of Digital Database for Screening Mammography

    • cancerimagingarchive.net
    csv, dicom, n/a
    Updated Sep 14, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2017). Curated Breast Imaging Subset of Digital Database for Screening Mammography [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.7O02S9CY
    Explore at:
    csv, dicom, n/aAvailable download formats
    Dataset updated
    Sep 14, 2017
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Sep 14, 2017
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    This CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The scale of the database along with ground truth validation makes the DDSM a useful tool in the development and testing of decision support systems. The CBIS-DDSM collection includes a subset of the DDSM data selected and curated by a trained mammographer. The images have been decompressed and converted to DICOM format. Updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data are also included. A manuscript describing how to use this dataset in detail is available at https://www.nature.com/articles/sdata2017177.

    Published research results from work in developing decision support systems in mammography are difficult to replicate due to the lack of a standard evaluation data set; most computer-aided diagnosis (CADx) and detection (CADe) algorithms for breast cancer in mammography are evaluated on private data sets or on unspecified subsets of public databases. Few well-curated public datasets have been provided for the mammography community. These include the DDSM, the Mammographic Imaging Analysis Society (MIAS) database, and the Image Retrieval in Medical Applications (IRMA) project. Although these public data sets are useful, they are limited in terms of data set size and accessibility.

    For example, most researchers using the DDSM do not leverage all its images for a variety of historical reasons. When the database was released in 1997, computational resources to process hundreds or thousands of images were not widely available. Additionally, the DDSM images are saved in non-standard compression files that require the use of decompression code that has not been updated or maintained for modern computers. Finally, the ROI annotations for the abnormalities in the DDSM were provided to indicate a general position of lesions, but not a precise segmentation for them. Therefore, many researchers must implement segmentation algorithms for accurate feature extraction. This causes an inability to directly compare the performance of methods or to replicate prior results. The CBIS-DDSM collection addresses that challenge by publicly releasing an curated and standardized version of the DDSM for evaluation of future CADx and CADe systems (sometimes referred to generally as CAD) research in mammography.

    Please note that the image data for this collection is structured such that each participant has multiple patient IDs. For example, participant 00038 has 10 separate patient IDs which provide information about the scans within the IDs (e.g. Calc-Test_P_00038_LEFT_CC, Calc-Test_P_00038_RIGHT_CC_1). This makes it appear as though there are 6,671 patients according to the DICOM metadata, but there are only 1,566 actual participants in the cohort.

    For scientific and other inquiries about this dataset, please contact TCIA's Helpdesk.

  13. InternVideo-VideoMAE-L verb features for Ego4D NLQ

    • zenodo.org
    zip
    Updated Dec 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guo Chen; Guo Chen (2022). InternVideo-VideoMAE-L verb features for Ego4D NLQ [Dataset]. http://doi.org/10.5281/zenodo.7343075
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guo Chen; Guo Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The video features for Ego4D Natural Language Queries subset, containing training set, validation set and testing set.

    The features are extracted by VideoMAE-L pretrained on Ego4D-Verb subset.

  14. f

    Summary of the real datasets.

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahi Jain; Wei Xu (2023). Summary of the real datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0246159.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rahi Jain; Wei Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of the real datasets.

  15. Data from: Written and spoken digits database for multimodal learning

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin
    Updated Jan 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lyes Khacef; Lyes Khacef; Laurent Rodriguez; Benoit Miramond; Laurent Rodriguez; Benoit Miramond (2021). Written and spoken digits database for multimodal learning [Dataset]. http://doi.org/10.5281/zenodo.4452953
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 21, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lyes Khacef; Lyes Khacef; Laurent Rodriguez; Benoit Miramond; Laurent Rodriguez; Benoit Miramond
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database description:

    The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion [1].

    The written digits database is the original MNIST handwritten digits database [2] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.

    The spoken digits database was extracted from Google Speech Commands [3], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.

    To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [2] and [3] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).

    The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.

    Files:

    • data_wr_train.npy: 60000 samples of 784-dimentional written digits for training;
    • data_sp_train.npy: 60000 samples of 507-dimentional spoken digits for training;
    • labels_train.npy: 60000 labels for the training subset;
    • data_wr_test.npy: 10000 samples of 784-dimentional written digits for test;
    • data_sp_test.npy: 10000 samples of 507-dimentional spoken digits for test;
    • labels_test.npy: 10000 labels for the test subset.

    References:

    1. Khacef, L. et al. (2020), "Brain-Inspired Self-Organization with Cellular Neuromorphic Computing for Multimodal Unsupervised Learning".
    2. LeCun, Y. & Cortes, C. (1998), “MNIST handwritten digit database”.
    3. Warden, P. (2018), “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition”.
  16. Benchmark dataset for agricultural KGML model development with PyKGML

    • zenodo.org
    bin
    Updated Jul 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yufeng Yang; Yufeng Yang; LICHENG LIU; LICHENG LIU (2025). Benchmark dataset for agricultural KGML model development with PyKGML [Dataset]. http://doi.org/10.5281/zenodo.15580485
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yufeng Yang; Yufeng Yang; LICHENG LIU; LICHENG LIU
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 4, 2025
    Description

    This benchmark dataset works as the demonstrative data in the testing of PyKGML, the Python library for the efficient development of knowledge-guided machine learning (KGML) models.

    The dataset are developed using agroecosystem data in the two KGML studies:

    1. "KGML-ag: A Modeling Framework of Knowledge-Guided Machine Learning to Simulate Agroecosystems: A Case Study of Estimating N2O Emission using Data from Mesocosm Experiments".
    Licheng Liu, Shaoming Xu, Zhenong Jin*, Jinyun Tang, Kaiyu Guan, Timothy J. Griffis, Matt D. Erickson, Alexander L. Frie, Xiaowei Jia, Taegon Kim, Lee T. Miller, Bin Peng, Shaowei Wu, Yufeng Yang, Wang Zhou, Vipin Kumar.

    2. "Knowledge-guided machine learning can improve carbon cycle quantification in agroecosystems".

    Licheng Liu, Wang Zhou, Kaiyu Guan, Bin Peng, Shaoming Xu, Jinyun Tang, Qing Zhu, Jessica Till, Xiaowei Jia, Chongya Jiang, Sheng Wang, Ziqi Qin, Hui Kong, Robert Grant, Symon Mezbahuddin, Vipin Kumar, Zhenong Jin.

    All the files belong to Dr. Licheng Liu, University of Minnesota. lichengl@umn.edu
    There are two parts in this dataset, the CO2 data from study 1 and the N2O data from study 2, both contain a pre-training subset and a fine-tuning subset. Data descriptions are as follows:
    1. CO2 dataset:
    • Synthetic data of ecosys:
      • - 100 simulations at random corn fields in the Midwest.
      • - Daily sequences over 18 years (2000-2018).
    • Field observations:
      • Eddy-covariance observations from 11 flux towers in the Midwest.
      • A total of 102 site-years of daily sequences.
    • Input variables (19):
      • Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), (max-min) air T (TDIF_AIR), max air humidity (HMAX_AIR), (max-min) air humidity (HDIF_AIR), wind speed (WIND), precipitation (PRECN).
      • Soil properties (9): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), field capacity (TFC), wilting point (TWP), saturate hydraulic conductivity (TKSat), soil organic carbon concetration (TSOC), pH (TPH), cation exchange capacity (TCEC)
      • Other (3): year (Year), crop type (Crop_Type), gross primary productivity (GPP)
    • Output variables (3):
      • Autotrophic respiration (Ra), heterotrophic respiration (Rh), net ecosystem exchange (NEE).
    2. N2O dataset:
    • Synthetic data of ecosys:
      • 1980 simulations at 99 counties x 20 N-fertilizer rates in the 3I states (Illinois, Iowa, Indiana).
      • Daily sequences over 18 years (2000-2018).
    • Field observations:
      • 6 chamber observations in a mesocosm environment facility at the University of Minnesota.
      • Daily sequences of 122 days x 3 years (2016-2018) x 1000 augmentations from hourly data at each chamber.
    • input variables (16):
      • Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), min air T (TMIN_AIR), max air humidity (HMAX_AIR), min air humidity (HMIN_AIR), wind speed (WIND), precipitation (PRECN).
      • Soil properties (6): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), pH (TPH), cation exchange capacity (TCEC), soil organic carbon concetration (TSOC)
      • Management (3): N-fertilizer rate (FERTZR_N), planting day of year (PDOY), crop type (PLANTT).
    • Output variables (3):
      • Soil N2O fluxes (N2O_FLUX), soil CO2 fluxes (CO2_FLUX), soil water content at 10 cm (WTR_3), soil ammonium concentration at 10 cm (NH4_3), soil nitrate concentration at 10 cm (NO3_3).
    Each file is a serialized Python dictionary containing the following keys and values:

    data={'X_train': X_train,
    'X_test': X_test,
    'Y_train': Y_train,
    'Y_test': Y_test,
    'x_scaler': x_scaler,
    'y_scaler': y_scaler,
    'input_features': input_features,
    'output_features': output_features}
    • X_train, X_test: Feature matrices for training and testing.

    • Y_train, Y_test: Target values for training and testing.

    • x_scaler: The scaler (mean, std) used for normalizing input features.
    • y_scaler: The scaler (mean, std) used for normalizing output features.

    • input_features: A list of input feature names.

    • output_features: A list of output feature names.

    Please download and use the latest version of this dataset, as it contains important updates.

    Contact: Dr. Licheng Liu (lichengl@umn.edu), Dr. Yufeng Yang (yang6956@umn.edu)

  17. Image Geo-localization dataset

    • kaggle.com
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamidreza_Sj (2025). Image Geo-localization dataset [Dataset]. https://www.kaggle.com/datasets/hamidrezasj/image-geo-localization-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hamidreza_Sj
    Description

    Image Geo-Localization Dataset

    Overview

    This dataset is designed for Visual Geo-Localization (VG), also known as Visual Place Recognition (VPR). The task involves determining the geographic location of a given image by retrieving the most visually similar images from a database. This dataset provides a diverse collection of urban images, enabling researchers and practitioners to train and evaluate geo-localization models under challenging real-world conditions.

    Dataset Details

    This dataset consists of images curated for training and evaluation of visual geo-localization models. The data is drawn from multiple sources to ensure diversity in lighting conditions, perspectives, and geographical contexts.

    1️⃣ GSV-Cities (Subset)

    • Purpose: Used for training the model.
    • Description: A subset of Google Street View city images, covering various urban environments.
    • Key Features: Diverse cityscapes to facilitate robust feature learning. Includes different architectural styles, seasons, and lighting conditions.

    2️⃣ SF-XS (San Francisco Extra Small)

    • Purpose: Used for testing geo-localization models.
    • Description: A challenging dataset containing images from San Francisco, USA.
    • Key Challenges: Urban landscapes with similar-looking structures. Perspective changes due to camera viewpoints. Variations in weather and time of day.

    3️⃣ Tokyo-XS (Tokyo Extra Small)

    • Purpose: Used for testing geo-localization models.
    • Description: A dataset containing images from Tokyo, Japan, offering significant differences from Western cities.
    • Key Challenges: Cultural and architectural diversity. Extreme viewpoint and lighting variations. High-density urban scenery.

    Usage & Applications

    This dataset is ideal for: ✅ Training and testing deep learning models for visual geo-localization. ✅ Studying the impact of lighting, perspective, and cultural diversity on place recognition. ✅ Benchmarking retrieval-based localization methods. ✅ Exploring feature extraction techniques for geo-localization tasks.

    How to Use This Dataset

    1. Download the dataset.
    2. Use the GSV-Cities subset for model training.
    3. Evaluate performance on SF-XS and Tokyo-XS datasets.

    If you find this dataset useful, please consider citing it in your research or giving it an upvote on Kaggle! 🚀

  18. a

    Inria Aerial Image Labeling Dataset

    • academictorrents.com
    bittorrent
    Updated Apr 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emmanuel Maggiori and Yuliya Tarabalka and Guillaume Charpiat and Pierre Alliez (2019). Inria Aerial Image Labeling Dataset [Dataset]. https://academictorrents.com/details/cf445f6073540af0803ee345f46294f088e7bba5
    Explore at:
    bittorrent(20957265875)Available download formats
    Dataset updated
    Apr 27, 2019
    Dataset authored and provided by
    Emmanuel Maggiori and Yuliya Tarabalka and Guillaume Charpiat and Pierre Alliez
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    The Inria Aerial Image Labeling addresses a core topic in remote sensing: the automatic pixelwise labeling of aerial imagery. Dataset features: Coverage of 810 km² (405 km² for training and 405 km² for testing) Aerial orthorectified color imagery with a spatial resolution of 0.3 m Ground truth data for two semantic classes: building and not building (publicly disclosed only for the training subset) The images cover dissimilar urban settlements, ranging from densely populated areas (e.g., San Francisco’s financial district) to alpine towns (e.g,. Lienz in Austrian Tyrol). Instead of splitting adjacent portions of the same images into the training and test subsets, different cities are included in each of the subsets. For example, images over Chicago are included in the training set (and not on the test set) and images over San Francisco are included on the test set (and not on the training set). The ultimate goal of this dataset is to assess the generalization power of the techniqu

  19. f

    Description of the simulation data.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahi Jain; Wei Xu (2023). Description of the simulation data. [Dataset]. http://doi.org/10.1371/journal.pone.0246159.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rahi Jain; Wei Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of the simulation data.

  20. o

    Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Keshavarz; Meiyappan Nagappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. http://doi.org/10.5281/zenodo.5907002
    Explore at:
    Dataset updated
    Jan 26, 2022
    Authors
    Hossein Keshavarz; Meiyappan Nagappan
    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track. The datasets are available under directory dataset. There are 4 datasets in this directory. 1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system. 2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016). 3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification. 4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data. In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here. The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset. More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance). References: 1. GumTree * https://github.com/GumTreeDiff/gumtree Jean-R��my Falleri, Flor��al Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ���14,Vasteras, Sweden - September 15 - 19, 2014. 313���324 2. PyDriller * https://pydriller.readthedocs.io/en/latest/ * Davide Spadini, Maur��cio Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908���911

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dapeng Xiong; Fen Xiao; Li Liu; Kai Hu; Yanping Tan; Shunmin He; Xieping Gao (2023). The optimal feature subsets for testing genomes. [Dataset]. http://doi.org/10.1371/journal.pone.0043126.t002

The optimal feature subsets for testing genomes.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Dapeng Xiong; Fen Xiao; Li Liu; Kai Hu; Yanping Tan; Shunmin He; Xieping Gao
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A-F is E. coli K12, E. coli O157 Sakai, S. enterica Typhi CT18, S. enterica Paratypi ATCC 9150, C. pneumoniae CWL029 and S. agalactiae 2603, respectively. “Yes” indicates that the corresponding feature is included in the optimal feature subset.

Search
Clear search
Close search
Google apps
Main menu