Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A-F is E. coli K12, E. coli O157 Sakai, S. enterica Typhi CT18, S. enterica Paratypi ATCC 9150, C. pneumoniae CWL029 and S. agalactiae 2603, respectively. “Yes” indicates that the corresponding feature is included in the optimal feature subset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature selection performance of different methods on the real datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RMSE performance of different methods on the real datasets for test data.
Test case selection is to minimize the time and effort spent for software testing in real time practice. During the course of software testing, the software firms are in want of techniques to finish the testing in a stipulated time, whilst uncompromising on quality. The motto is to select subset of test cases rather to take up all available test cases to uncover most of the bugs. Clustering of test cases using ranking and also based on similarity coefficients is to be implemented. The experimented results have to show up the techniques proposed improving the catching up of errors in a comparatively shorter duration. In this research, eleven different features were considered in order to cluster the test cases. There are two methodologies implemented. In the first methodology, each cluster will cover set of specific features to a certain percentage. Depending on the feature’s coverage, cluster of test cases can be selected. These clusters were formed using ranking methodology. In the sec...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is related to direct marketing campaigns conducted by a Portuguese banking institution, with campaigns relying on phone calls. Often multiple contacts with the same client were necessary to determine whether they would subscribe ('yes') or not ('no') to a bank term deposit. The dataset includes four files:
The smaller subsets are designed for testing computationally intensive machine learning algorithms (e.g., SVM). The primary classification objective is to predict whether a client will subscribe to a term deposit ('yes' or 'no'), based on the target variable y.
In the design of classification models, irrelevant or noisy features are often generated. In some cases, there may even be negative interactions among features. These weaknesses can degrade the performance of the models. Feature selection is a task that searches for a small subset of relevant features from the original set that generate the most efficient models possible. In addition to improving the efficiency of the models, feature selection confers other advantages, such as greater ease in the generation of the necessary data as well as clearer and more interpretable models. In the case of medical applications, feature selection may help to distinguish which characteristics, habits, and factors have the greatest impact on the onset of diseases. However, feature selection is a complex task due to the large number of possible solutions. In the last few years, methods based on different metaheuristic strategies, mainly evolutionary algorithms, have been proposed. The motivation of this work is to develop a method that outperforms previous methods, with the benefits that this implies especially in the medical field. More precisely, the present study proposes a simple method based on tabu search and multistart techniques. The proposed method was analyzed and compared to other methods by testing their performance on several medical databases. Specifically, eight databases belong to the well-known repository of the University of California in Irvine and one of our own design were used. In these computational tests, the proposed method outperformed other recent methods as gauged by various metrics and classifiers. The analyses were accompanied by statistical tests, the results of which showed that the superiority of our method is significant and therefore strengthened these conclusions. In short, the contribution of this work is the development of a method that, on the one hand, is based on different strategies than those used in recent methods, and on the other hand, improves the performance of these methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used for training and validating the PredictONCO web tool, supporting decision-making in precision oncology by extending the bioinformatics predictions with advanced computing and machine learning. The dataset consists of 1073 single-point mutants of 42 proteins, whose effect was classified as Oncogenic (509 data points) and Benign (564 data points). All mutations were annotated with a clinically verified effect and were compiled from the ClinVar and OncoKB databases. The dataset was manually curated based on the available information in other precision oncology databases (The Clinical Knowledgebase by The Jackson Laboratory, Personalized Cancer Therapy Knowledge Base by MD Anderson Cancer Center, cBioPortal, DoCM database) or in the primary literature. To create the dataset, we also removed any possible overlaps with the data points used in the PredictSNP consensus predictor and its constituents. This was implemented to avoid any test set data leakage due to using the PredictSNP score as one of the features (see below).
The entire dataset (SEQ) was further annotated by the pipeline of PredictONCO. Briefly, the following six features were calculated regardless of the structural information available: essentiality of the mutated residue (yes/no), the conservation of the position (the conservation grade and score), the domain where the mutation is located (cytoplasmic, extracellular, transmembrane, other), the PredictSNP score, and the number of essential residues in the protein. For approximately half of the data (STR: 377 and 76 oncogenic and benign data points, respectively), the structural information was available, and six more features were calculated: FoldX and Rosetta ddg_monomer scores, whether the residue is in the catalytic pocket (identification of residues forming the ligand-binding pocket was obtained from P2Rank), and the pKa changes (the minimum and maximum changes as well as the number of essential residues whose pKa was changed – all values obtained from PROPKA3). For both STR and SEQ datasets, 20% of the data was held out for testing. The data split was implemented at the position level to ensure that no position from the test data subset appears in the training data subset.
For more details about the tool, please visit the help page or get in touch with us.
14-Dec-2023 update: the file with features PredictONCO-features.txt now includes UniProt IDs, transcripts, PDB codes, and mutations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the study "Optimizing Real-Time Phenotyping in Critical Care Using Machine Learning on Electronic Health Records," which hypothesizes that a patient's latent disease state can be continuously and accurately estimated from real-time biomedical signals without requiring full ICU trajectories. It supports replication and evaluation of our predictive framework, which dynamically models phenotype probabilities as data accumulates. All elements are reported in line with the TRIPOD statement to ensure transparency and reproducibility.
The training and test data are derived from the MIMIC-IV database and consist of vectorized representations of multivariate, irregularly sampled biomedical time series and associated phenotype labels. These were generated through a structured pipeline that includes cohort selection, event aggregation using fixed-length time bins, and feature engineering to represent both value trends and missingness. Supplementary Tables S.1 to S.6 describe the variables used in this transformation, their sources within the EHR, aggregation methods, and descriptive statistics for both static (e.g., demographics, admission data) and dynamic (e.g., vital signs, lab results, ventilator settings) features across the train and test sets.
Table S.7 summarizes the model’s real-time phenotyping performance using multiple evaluation perspectives. The results reveal strong generalization and early predictive value: in the (ls) setting, the model achieved good diagnostic performance (AUROC ≥ 0.8) for 69% of phenotypes and excellent performance (AUROC ≥ 0.9) for 30%. In the real-time (fs) setting—using only the earliest recorded physiological data—the model still achieved good performance for 40% of phenotypes and excellent performance for 5%, demonstrating the feasibility of early, actionable phenotyping. The intermediate (td) evaluation shows that predictive quality improves consistently as more data becomes available, supporting the framework’s ability to track dynamic disease progression in real time.
To interpret and use the data: - Each patient stay is represented as a multivariate time series with associated phenotype labels. - Time series are aligned in fixed time intervals (e.g., 2 hours), where each variable is aggregated using statistical functions (e.g., mean, last, sum). - The phenotype labels correspond to ICD-9-CM diagnostic categories assigned at discharge but are used here as latent variables to be estimated continuously.
This dataset enables reproducibility of the results and further research in developing machine learning models for early, interpretable, and actionable phenotyping in critical care.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for Test Data Generation Tools was valued at USD 800 million in 2023 and is projected to reach USD 2.2 billion by 2032, growing at a CAGR of 12.1% during the forecast period. The surge in the adoption of agile and DevOps practices, along with the increasing complexity of software applications, is driving the growth of this market.
One of the primary growth factors for the Test Data Generation Tools market is the increasing need for high-quality test data in software development. As businesses shift towards more agile and DevOps methodologies, the demand for automated and efficient test data generation solutions has surged. These tools help in reducing the time required for test data creation, thereby accelerating the overall software development lifecycle. Additionally, the rise in digital transformation across various industries has necessitated the need for robust testing frameworks, further propelling the market growth.
The proliferation of big data and the growing emphasis on data privacy and security are also significant contributors to market expansion. With the introduction of stringent regulations like GDPR and CCPA, organizations are compelled to ensure that their test data is compliant with these laws. Test Data Generation Tools that offer features like data masking and data subsetting are increasingly being adopted to address these compliance requirements. Furthermore, the increasing instances of data breaches have underscored the importance of using synthetic data for testing purposes, thereby driving the demand for these tools.
Another critical growth factor is the technological advancements in artificial intelligence and machine learning. These technologies have revolutionized the field of test data generation by enabling the creation of more realistic and comprehensive test data sets. Machine learning algorithms can analyze large datasets to generate synthetic data that closely mimics real-world data, thus enhancing the effectiveness of software testing. This aspect has made AI and ML-powered test data generation tools highly sought after in the market.
Regional outlook for the Test Data Generation Tools market shows promising growth across various regions. North America is expected to hold the largest market share due to the early adoption of advanced technologies and the presence of major software companies. Europe is also anticipated to witness significant growth owing to strict regulatory requirements and increased focus on data security. The Asia Pacific region is projected to grow at the highest CAGR, driven by rapid industrialization and the growing IT sector in countries like India and China.
Synthetic Data Generation has emerged as a pivotal component in the realm of test data generation tools. This process involves creating artificial data that closely resembles real-world data, without compromising on privacy or security. The ability to generate synthetic data is particularly beneficial in scenarios where access to real data is restricted due to privacy concerns or regulatory constraints. By leveraging synthetic data, organizations can perform comprehensive testing without the risk of exposing sensitive information. This not only ensures compliance with data protection regulations but also enhances the overall quality and reliability of software applications. As the demand for privacy-compliant testing solutions grows, synthetic data generation is becoming an indispensable tool in the software development lifecycle.
The Test Data Generation Tools market is segmented into software and services. The software segment is expected to dominate the market throughout the forecast period. This dominance can be attributed to the increasing adoption of automated testing tools and the growing need for robust test data management solutions. Software tools offer a wide range of functionalities, including data profiling, data masking, and data subsetting, which are essential for effective software testing. The continuous advancements in software capabilities also contribute to the growth of this segment.
In contrast, the services segment, although smaller in market share, is expected to grow at a substantial rate. Services include consulting, implementation, and support services, which are crucial for the successful deployment and management of test data generation tools. The increasing complexity of IT inf
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.
The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.
Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.
This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.
Key Features:
Technical Details:
Acknowledgments:
The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature selection performance of different approaches in simulated scenarios.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
This CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The scale of the database along with ground truth validation makes the DDSM a useful tool in the development and testing of decision support systems. The CBIS-DDSM collection includes a subset of the DDSM data selected and curated by a trained mammographer. The images have been decompressed and converted to DICOM format. Updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data are also included. A manuscript describing how to use this dataset in detail is available at https://www.nature.com/articles/sdata2017177.
Published research results from work in developing decision support systems in mammography are difficult to replicate due to the lack of a standard evaluation data set; most computer-aided diagnosis (CADx) and detection (CADe) algorithms for breast cancer in mammography are evaluated on private data sets or on unspecified subsets of public databases. Few well-curated public datasets have been provided for the mammography community. These include the DDSM, the Mammographic Imaging Analysis Society (MIAS) database, and the Image Retrieval in Medical Applications (IRMA) project. Although these public data sets are useful, they are limited in terms of data set size and accessibility.
For example, most researchers using the DDSM do not leverage all its images for a variety of historical reasons. When the database was released in 1997, computational resources to process hundreds or thousands of images were not widely available. Additionally, the DDSM images are saved in non-standard compression files that require the use of decompression code that has not been updated or maintained for modern computers. Finally, the ROI annotations for the abnormalities in the DDSM were provided to indicate a general position of lesions, but not a precise segmentation for them. Therefore, many researchers must implement segmentation algorithms for accurate feature extraction. This causes an inability to directly compare the performance of methods or to replicate prior results. The CBIS-DDSM collection addresses that challenge by publicly releasing an curated and standardized version of the DDSM for evaluation of future CADx and CADe systems (sometimes referred to generally as CAD) research in mammography.
Please note that the image data for this collection is structured such that each participant has multiple patient IDs. For example, participant 00038 has 10 separate patient IDs which provide information about the scans within the IDs (e.g. Calc-Test_P_00038_LEFT_CC, Calc-Test_P_00038_RIGHT_CC_1). This makes it appear as though there are 6,671 patients according to the DICOM metadata, but there are only 1,566 actual participants in the cohort.
For scientific and other inquiries about this dataset, please contact TCIA's Helpdesk.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The video features for Ego4D Natural Language Queries subset, containing training set, validation set and testing set.
The features are extracted by VideoMAE-L pretrained on Ego4D-Verb subset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of the real datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database description:
The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion [1].
The written digits database is the original MNIST handwritten digits database [2] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.
The spoken digits database was extracted from Google Speech Commands [3], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.
To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [2] and [3] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).
The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.
Files:
References:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This benchmark dataset works as the demonstrative data in the testing of PyKGML, the Python library for the efficient development of knowledge-guided machine learning (KGML) models.
The dataset are developed using agroecosystem data in the two KGML studies:
1. "KGML-ag: A Modeling Framework of Knowledge-Guided Machine Learning to Simulate Agroecosystems: A Case Study of Estimating N2O Emission using Data from Mesocosm Experiments".
Licheng Liu, Shaoming Xu, Zhenong Jin*, Jinyun Tang, Kaiyu Guan, Timothy J. Griffis, Matt D. Erickson, Alexander L. Frie, Xiaowei Jia, Taegon Kim, Lee T. Miller, Bin Peng, Shaowei Wu, Yufeng Yang, Wang Zhou, Vipin Kumar.
2. "Knowledge-guided machine learning can improve carbon cycle quantification in agroecosystems".
Licheng Liu, Wang Zhou, Kaiyu Guan, Bin Peng, Shaoming Xu, Jinyun Tang, Qing Zhu, Jessica Till, Xiaowei Jia, Chongya Jiang, Sheng Wang, Ziqi Qin, Hui Kong, Robert Grant, Symon Mezbahuddin, Vipin Kumar, Zhenong Jin.
X_train
, X_test
: Feature matrices for training and testing.
Y_train
, Y_test
: Target values for training and testing.
x_scaler
: The scaler (mean, std) used for normalizing input features.y_scaler
: The scaler (mean, std) used for normalizing output features.
input_features
: A list of input feature names.
output_features
: A list of output feature names.Please download and use the latest version of this dataset, as it contains important updates.
Contact: Dr. Licheng Liu (lichengl@umn.edu), Dr. Yufeng Yang (yang6956@umn.edu)
This dataset is designed for Visual Geo-Localization (VG), also known as Visual Place Recognition (VPR). The task involves determining the geographic location of a given image by retrieving the most visually similar images from a database. This dataset provides a diverse collection of urban images, enabling researchers and practitioners to train and evaluate geo-localization models under challenging real-world conditions.
This dataset consists of images curated for training and evaluation of visual geo-localization models. The data is drawn from multiple sources to ensure diversity in lighting conditions, perspectives, and geographical contexts.
This dataset is ideal for: ✅ Training and testing deep learning models for visual geo-localization. ✅ Studying the impact of lighting, perspective, and cultural diversity on place recognition. ✅ Benchmarking retrieval-based localization methods. ✅ Exploring feature extraction techniques for geo-localization tasks.
If you find this dataset useful, please consider citing it in your research or giving it an upvote on Kaggle! 🚀
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The Inria Aerial Image Labeling addresses a core topic in remote sensing: the automatic pixelwise labeling of aerial imagery. Dataset features: Coverage of 810 km² (405 km² for training and 405 km² for testing) Aerial orthorectified color imagery with a spatial resolution of 0.3 m Ground truth data for two semantic classes: building and not building (publicly disclosed only for the training subset) The images cover dissimilar urban settlements, ranging from densely populated areas (e.g., San Francisco’s financial district) to alpine towns (e.g,. Lienz in Austrian Tyrol). Instead of splitting adjacent portions of the same images into the training and test subsets, different cities are included in each of the subsets. For example, images over Chicago are included in the training set (and not on the test set) and images over San Francisco are included on the test set (and not on the training set). The ultimate goal of this dataset is to assess the generalization power of the techniqu
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of the simulation data.
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track. The datasets are available under directory dataset. There are 4 datasets in this directory. 1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system. 2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016). 3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification. 4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data. In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here. The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset. More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance). References: 1. GumTree * https://github.com/GumTreeDiff/gumtree Jean-R��my Falleri, Flor��al Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ���14,Vasteras, Sweden - September 15 - 19, 2014. 313���324 2. PyDriller * https://pydriller.readthedocs.io/en/latest/ * Davide Spadini, Maur��cio Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908���911
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A-F is E. coli K12, E. coli O157 Sakai, S. enterica Typhi CT18, S. enterica Paratypi ATCC 9150, C. pneumoniae CWL029 and S. agalactiae 2603, respectively. “Yes” indicates that the corresponding feature is included in the optimal feature subset.