Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".
The appendix contains:
The queries from the three open source systems we used in the evaluation of our tool (the industry software system is not part of this appendix, due to privacy reasons)
The results of our evaluation.
The source code of the tool. Most recent version can be found at https://github.com/SERG-Delft/evosql.
The results of the tuning procedure we conducted before running the final evaluation.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global synthetic data tool market is projected to reach USD 10,394.0 million by 2033, exhibiting a CAGR of 34.8% during the forecast period. The growing adoption of AI and ML technologies, increasing demand for data privacy and security, and the rising need for data for training and testing machine learning models are the key factors driving market growth. Additionally, the availability of open-source synthetic data generation tools and the increasing adoption of cloud-based synthetic data platforms are further contributing to market growth. North America is expected to hold the largest market share during the forecast period due to the early adoption of AI and ML technologies and the presence of key vendors in the region. Europe is anticipated to witness significant growth due to increasing government initiatives to promote AI adoption and the growing data privacy concerns. The Asia Pacific region is projected to experience rapid growth due to government initiatives to develop AI capabilities and the increasing adoption of AI and ML technologies in various industries, namely healthcare, retail, and manufacturing.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The open source performance testing market has seen a considerable rise in its market size, with global figures indicating an impressive growth trajectory. In 2023, the market was valued at approximately USD 1.2 billion and is projected to reach around USD 3.5 billion by 2032, growing at a robust compound annual growth rate (CAGR) of 12.5%. This growth is primarily driven by the increasing need for businesses to ensure seamless operational capabilities and the rising complexity of software systems that necessitate efficient performance evaluation tools. As organizations across various sectors are increasingly relying on digital platforms, the demand for performance testing tools is expected to witness a substantial surge.
The growth of the open source performance testing market can be attributed to several key factors. Firstly, the growing adoption of digital transformation initiatives across industries has led to a surge in the deployment of software solutions that require rigorous performance testing. Organizations are increasingly realizing the importance of maintaining optimal performance for their applications to enhance user experience and ensure customer satisfaction. Additionally, the cost-effectiveness of open source tools compared to proprietary software solutions has made them an attractive choice for businesses, especially small and medium enterprises (SMEs) that may have budget constraints. Moreover, the collaborative nature of open source development communities has fostered innovation and rapid advancements in performance testing tools, further propelling market growth.
Another significant factor contributing to the expansion of this market is the increasing complexity of IT infrastructure, which includes cloud computing, IoT devices, and microservices architectures. These advancements in technology have made performance testing more challenging and essential than ever before. As organizations adopt more complex and distributed systems, the need for robust performance testing solutions becomes critical to identify bottlenecks and ensure system reliability. Open source performance testing tools offer the flexibility and scalability required to cater to these evolving demands, making them a preferred choice for organizations worldwide. Furthermore, the growing emphasis on agile and DevOps methodologies has accelerated the adoption of continuous testing practices, where open source tools play a vital role in enabling seamless integration into the software development lifecycle.
From a regional perspective, the growth dynamics of the open source performance testing market show variation across different parts of the world. North America has emerged as a dominant player in this market, driven by the early adoption of advanced technologies and the presence of major technology companies. The region's robust IT infrastructure and a strong focus on technological innovation have contributed significantly to market growth. Meanwhile, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This can be attributed to the rapid digitization across various industries, increasing investments in IT infrastructure, and the rising trend of cloud adoption. Additionally, Europe also presents significant growth opportunities due to the increasing emphasis on digitalization and stringent regulatory requirements concerning software performance and security.
In the realm of software development, Test Data Generation Tools have become indispensable for ensuring the accuracy and reliability of performance testing. These tools enable developers to create realistic test data that mimics real-world scenarios, allowing for comprehensive evaluation of software applications. By simulating various user interactions and data inputs, Test Data Generation Tools help identify potential issues and optimize application performance. As organizations increasingly adopt agile and DevOps methodologies, the integration of these tools into the testing process has become crucial for maintaining high-quality software. The ability to generate diverse and complex data sets not only enhances the effectiveness of performance testing but also aids in uncovering hidden defects that could impact user experience.
In the open source performance testing market, the tool type segment is a critical component that defines the specific applications and functionalities of these testing platforms. Load testing t
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size of open-source big data tools was valued at approximately USD 17.5 billion in 2023 and is projected to reach an estimated USD 85.6 billion by 2032, growing at a compound annual growth rate (CAGR) of 19.4% during the forecast period. This remarkable growth can be attributed to factors such as the increasing proliferation of data, the rising adoption of big data analytics across various industries, and the cost-effectiveness and flexibility offered by open-source solutions.
One of the primary growth factors driving the open-source big data tools market is the exponential increase in data generation from various sources, including social media, IoT devices, and enterprise databases. Organizations are increasingly recognizing the value of data-driven decision-making, which necessitates robust and scalable data management and analytics tools. Open-source big data tools provide the necessary capabilities to manage, process, and analyze vast volumes of data, thereby enabling organizations to gain actionable insights and make informed decisions.
Another significant factor contributing to the market growth is the cost-effectiveness and flexibility of open-source solutions. Unlike proprietary software, open-source big data tools are generally available for free and offer the flexibility to customize and scale according to specific organizational needs. This makes them particularly attractive to small and medium enterprises (SMEs) that may have limited budgets but still require powerful data analytics capabilities. Additionally, the collaborative nature of open-source communities ensures continuous innovation and improvement of these tools, further enhancing their value proposition.
The increasing adoption of cloud-based solutions is also playing a pivotal role in the growth of the open-source big data tools market. Cloud platforms provide the necessary infrastructure to deploy big data tools efficiently while offering scalability, cost savings, and ease of access. Organizations are increasingly opting for cloud-based deployments to leverage these benefits, which in turn drives the demand for open-source big data tools that are compatible with cloud environments. The ongoing digital transformation initiatives across various industries are further propelling this trend.
Hadoop Related Software plays a crucial role in the open-source big data tools ecosystem. As a foundational technology, Hadoop provides the framework for storing and processing large datasets across distributed computing environments. Its ability to handle vast amounts of data efficiently makes it an integral part of many big data strategies. Organizations leverage Hadoop's capabilities to build scalable data architectures that support complex analytics tasks. The ecosystem around Hadoop has expanded significantly, with numerous related software solutions enhancing its functionality. These include tools for data ingestion, processing, and visualization, which together create a comprehensive platform for big data analytics. The continuous evolution and support from the open-source community ensure that Hadoop and its related software remain at the forefront of big data innovations.
Regionally, North America dominates the open-source big data tools market, driven by the presence of major technology companies, early adoption of advanced technologies, and significant investments in big data analytics. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, supported by the rapid digitalization, increasing internet penetration, and growing awareness about the benefits of data analytics in countries like China and India. Europe also holds a substantial market share due to stringent data protection regulations and the increasing focus on data-driven decision-making in various industries.
The open-source big data tools market by component is segmented into software and services. The software segment encompasses a wide array of tools designed for data integration, data storage, data processing, and data analytics. These tools include popular open-source platforms such as Apache Hadoop, Apache Spark, and MongoDB, which have gained widespread adoption due to their robustness, scalability, and community support. The software segment is expected to maintain a dominant position in the market, driven by continuous innovation and the increasing complexity of data man
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The open-source big data tools market is experiencing robust growth, driven by the increasing need for scalable, cost-effective data management and analysis solutions across diverse industries. The market's expansion is fueled by several key factors. Firstly, the rising volume and velocity of data generated by businesses necessitate powerful tools capable of handling massive datasets efficiently. Open-source options provide a compelling alternative to proprietary solutions, offering flexibility, customization, and community support without the high licensing costs associated with commercial software. This is particularly attractive to smaller companies and startups with limited budgets. Secondly, advancements in cloud computing have made it easier to deploy and manage open-source big data tools, further lowering the barrier to entry and expanding the market's reach. Finally, a growing pool of skilled developers and a vibrant community contribute to the continuous improvement and innovation of these tools, ensuring they remain competitive with their commercial counterparts. We estimate the 2025 market size to be approximately $15 billion, based on observable market trends in related technologies and considering a reasonable CAGR. The market segmentation reveals significant opportunities across various application sectors. The banking, manufacturing, and consultancy sectors are leading adopters, leveraging open-source tools for advanced analytics, fraud detection, risk management, and supply chain optimization. Government agencies are increasingly adopting these tools for data-driven policymaking and citizen services. Furthermore, the diverse range of tools – encompassing data collection, storage, analysis, and language processing capabilities – caters to a broad spectrum of user needs. While the market faces challenges such as integration complexities and the need for skilled professionals to manage and maintain these systems, the overall trend points toward sustained, rapid growth over the next decade. Geographic growth is expected to be strongest in regions with burgeoning digital economies and increasing data generation, particularly in Asia-Pacific and North America. This consistent demand, coupled with ongoing technological improvements, is poised to propel the market to even greater heights in the coming years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to the cost of developing and training deep learning models from scratch, machine learning engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks. PTM registries known as “model hubs” support engineers in distributing and reusing deep learning models. PTM packages include pre-trained weights, documentation, model architectures, datasets, and metadata. Mining the information in PTM packages will enable the discovery of engineering phenomena and tools to support software engineers. However, accessing this information is difficult — there are many PTM registries, and both the registries and the individual packages may have rate limiting for accessing the data.
We present an open-source dataset, PTMTorrent, to facilitate the evaluation and understanding of PTM packages. This paper describes the creation, structure, usage, and limitations of the dataset. The dataset includes a snapshot of 5 model hubs and a total of 15,913 PTM packages. These packages are represented in a uniform data schema for cross-hub mining. We describe prior uses of this data and suggest research opportunities for mining using our dataset.
We provide links to the PTM Dataset and PTM Torrent Source Code.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribute Information
test case generation; unit testing; search-based software engineering; benchmark
Paper Abstract
Several promising techniques have been proposed to automate different tasks in software testing, such as test data generation for object-oriented software. However, reported studies in the literature only show the feasibility of the proposed techniques, because the choice of the employed artifacts in the case studies (e.g., software applications) is usually done in a non-systematic way. The chosen case study might be biased, and so it might not be a valid representative of the addressed type of software (e.g., internet applications and embedded systems). The common trend seems to be to accept this fact and get over it by simply discussing it in a threats to validity section. In this paper, we evaluate search-based software testing (in particular the EvoSuite tool) when applied to test data generation for open source projects. To achieve sound empirical results, we randomly selected 100 Java projects from SourceForge, which is the most popular open source repository (more than 300,000 projects with more than two million registered users). The resulting case study not only is very large (8,784 public classes for a total of 291,639 bytecode level branches), but more importantly it is statistically sound and representative for open source projects. Results show that while high coverage on commonly used types of classes is achievable, in practice environmental dependencies prohibit such high coverage, which clearly points out essential future research directions. To support this future research, our SF100 case study can serve as a much needed corpus of classes for test generation.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The AI code generation market is experiencing explosive growth, driven by the increasing demand for efficient and cost-effective software development. The market, estimated at $2 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 35% from 2025 to 2033, reaching an estimated $15 billion by 2033. This rapid expansion is fueled by several key factors. Firstly, the rising complexity of software projects necessitates tools that enhance developer productivity and reduce development time. Secondly, the growing adoption of AI/ML across industries creates a surge in demand for developers skilled in these technologies; AI code generators can help bridge the skills gap. Thirdly, advancements in deep learning and natural language processing are continuously improving the accuracy and capabilities of these tools, making them increasingly valuable to developers of all skill levels. Finally, the proliferation of cloud-based platforms facilitates easy access and integration of these tools into existing development workflows. Several market trends are shaping the landscape. The increasing focus on developer experience (DX) is leading to the development of more user-friendly and intuitive AI code generators. Integration with popular Integrated Development Environments (IDEs) like VS Code and PyCharm is becoming standard, streamlining the development process. The rise of open-source AI models is fostering innovation and expanding accessibility, further fueling the market's expansion. However, challenges remain, including concerns about code security and intellectual property, as well as the need for robust data privacy and security measures within AI code generation tools. The market is segmented by type (cloud-based, on-premise), application (web development, mobile development, data science), and organization size (SMEs, enterprises). Key players, including GitHub Copilot, Tabnine, and Replit GhostWriter, are aggressively competing to dominate market share through continuous innovation and strategic partnerships. Future growth will be influenced by the continuous improvement of AI algorithms, wider industry adoption, and the addressing of security and ethical concerns.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global big data market size was valued at approximately USD 162 billion in 2023 and is expected to reach an impressive USD 450 billion by 2032, with a compound annual growth rate (CAGR) of 12.5% from 2024 to 2032. This robust growth is driven by the increasing volume of data generated across various sectors and the growing need for data analytics to drive business decisions. The proliferation of Internet of Things (IoT) devices, advancements in artificial intelligence (AI), and the rising adoption of data-driven decision-making processes are major factors contributing to this expansion.
One of the primary growth factors in the big data market is the exponential increase in data generation from various sources, including social media, sensors, digital platforms, and enterprise applications. The data explosion necessitates advanced analytics solutions to extract actionable insights, driving the demand for big data technologies. Additionally, the advent of 5G technology is expected to further amplify data generation, thereby fueling the need for efficient data management and analytics solutions. Organizations are increasingly recognizing the value of big data in enhancing customer experience, optimizing operations, and driving innovation.
Another significant driver is the growing adoption of cloud-based big data solutions. Cloud computing offers scalable, cost-effective, and flexible data storage and processing capabilities, making it an attractive option for organizations of all sizes. The shift towards cloud infrastructure has enabled businesses to manage and analyze vast amounts of data more efficiently, leading to increased demand for cloud-based big data analytics solutions. Moreover, the integration of big data with emerging technologies such as AI, machine learning, and blockchain is creating new opportunities for market growth.
The increasing focus on regulatory compliance and data security is also propelling the big data market. Organizations are required to comply with stringent data protection regulations such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). These regulations necessitate robust data management and governance frameworks, driving the adoption of big data solutions. Furthermore, the rising incidents of cyber threats and data breaches are compelling businesses to invest in advanced data security solutions, contributing to market growth.
Regionally, North America is expected to dominate the big data market due to the presence of major technology companies, high adoption of advanced technologies, and significant investments in data analytics solutions. The Asia Pacific region is anticipated to witness the highest growth rate, driven by the rapid digital transformation, increasing internet penetration, and growing adoption of big data analytics across various industries. Europe is also expected to contribute significantly to market growth, supported by the strong emphasis on data privacy and security regulations.
The big data market is segmented by components into software, hardware, and services. The software segment holds the largest share, driven by the increasing demand for data management and analytics solutions. Big data software solutions, including data integration, data visualization, and business intelligence, are essential for extracting valuable insights from vast amounts of data. The rising adoption of AI and machine learning algorithms in big data analytics is further boosting the demand for advanced software solutions. Additionally, the emergence of open-source big data platforms is providing cost-effective options for organizations, contributing to market growth.
The hardware segment is also witnessing significant growth, primarily due to the increasing need for high-performance computing infrastructure to handle large datasets. As data volumes continue to surge, organizations are investing in advanced servers, storage systems, and networking equipment to support their big data initiatives. The proliferation of IoT devices and the consequent rise in data generation are further driving the demand for robust hardware solutions. Furthermore, the development of edge computing technologies is enabling real-time data processing closer to the source, enhancing the efficiency of big data analytics.
The services segment, encompassing consulting, implementation, and maintenance services, is experiencing substantial growth as well. Organizations often require expert guidance and support to navigate the comp
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Digital data from the political sphere is abundant, omnipresent, and more and more directly accessible through the Internet. Project Vote Smart (PVS) is a prominent example of this big public data and covers various aspects of U.S. politics in astonishing detail. Despite the vast potential of PVS’ data for political science, economics, and sociology, it is hardly used in empirical research. The systematic compilation of semi-structured data can be complicated and time consuming as the data format is not designed for conventional scientific research. This paper presents a new tool that makes the data easily accessible to a broad scientific community. We provide the software called pvsR as an add-on to the R programming environment for statistical computing. This open source interface (OSI) serves as a direct link between a statistical analysis and the large PVS database. The free and open code is expected to substantially reduce the cost of research with PVS’ new big public data in a vast variety of possible applications. We discuss its advantages vis-à-vis traditional methods of data generation as well as already existing interfaces. The validity of the library is documented based on an illustration involving female representation in local politics. In addition, pvsR facilitates the replication of research with PVS data at low costs, including the pre-processing of data. Similar OSIs are recommended for other big public databases.
The NIST BGP RPKI IO framework (BRIO) is a test tool only subset of the BGP-SRx Framework. It is an open source implementation and test platform that allows the synthetic generation of test data for emerging BGP security extensions such as RPKI Origin Validation and BGPSec Path Validation and ASPA validation. BRIO is designed in such that it allows the creation of stand alone testbeds, loaded with freely configurable scenarios to study secure BGP implementations. As a result, much functionality is provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.
Abstract:
Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.
Benchmark data
Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).
The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.
Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.
Synthetic data generation
Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.
A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.
Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).
Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.
Funding
This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
SEPAL (https://sepal.io/) is a free and open source cloud computing platform for geo-spatial data access and processing. It empowers users to quickly process large amounts of data on their computer or mobile device. Users can create custom analysis ready data using freely available satellite imagery, generate and improve land use maps, analyze time series, run change detection and perform accuracy assessment and area estimation, among many other functionalities in the platform. Data can be created and analyzed for any place on Earth using SEPAL.
https://data.apps.fao.org/catalog/dataset/9c4d7c45-7620-44c4-b653-fbe13eb34b65/resource/63a3efa0-08ab-4ad6-9d4a-96af7b6a99ec/download/cambodia_mosaic_2020.png" alt="alt text" title="Figure 1: Best pixel mosaic of Landsat 8 data for 2020 over Cambodia">
SEPAL reaches over 5000 users in 180 countries for the creation of custom data products from freely available satellite data. SEPAL was developed as a part of the Open Foris suite, a set of free and open source software platforms and tools that facilitate flexible and efficient data collection, analysis and reporting. SEPAL combines and integrates modern geospatial data infrastructures and supercomputing power available through Google Earth Engine and Amazon Web Services with powerful open-source data processing software, such as R, ORFEO, GDAL, Python and Jupiter Notebooks. Users can easily access the archive of satellite imagery from NASA, the European Space Agency (ESA) as well as high spatial and temporal resolution data from Planet Labs and turn such images into data that can be used for reporting and better decision making.
National Forest Monitoring Systems in many countries have been strengthened by SEPAL, which provides technical government staff with computing resources and cutting edge technology to accurately map and monitor their forests. The platform was originally developed for monitoring forest carbon stock and stock changes for reducing emissions from deforestation and forest degradation (REDD+). The application of the tools on the platform now reach far beyond forest monitoring by providing different stakeholders access to cloud based image processing tools, remote sensing and machine learning for any application. Presently, users work on SEPAL for various applications related to land monitoring, land cover/use, land productivity, ecological zoning, ecosystem restoration monitoring, forest monitoring, near real time alerts for forest disturbances and fire, flood mapping, mapping impact of disasters, peatland rewetting status, and many others.
The Hand-in-Hand initiative enables countries that generate data through SEPAL to disseminate their data widely through the platform and to combine their data with the numerous other datasets available through Hand-in-Hand.
https://data.apps.fao.org/catalog/dataset/9c4d7c45-7620-44c4-b653-fbe13eb34b65/resource/868e59da-47b9-4736-93a9-f8d83f5731aa/download/probability_classification_over_zambia.png" alt="alt text" title="Figure 2: Image classification module for land monitoring and mapping. Probability classification over Zambia">
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The data annotation and labeling tools market is experiencing robust growth, driven by the escalating demand for high-quality training data in the burgeoning fields of artificial intelligence (AI) and machine learning (ML). The market's expansion is fueled by the increasing adoption of AI across diverse sectors, including autonomous vehicles, healthcare, and finance. These industries require vast amounts of accurately labeled data to train their AI models, leading to a significant surge in the demand for efficient and scalable annotation tools. While precise market sizing for 2025 is unavailable, considering a conservative estimate and assuming a CAGR of 25% (a reasonable figure given industry growth), we can project a market value exceeding $2 billion in 2025, rising significantly over the forecast period (2025-2033). Key trends include the growing adoption of cloud-based solutions, increased automation in the annotation process through AI-assisted tools, and a heightened focus on data privacy and security. The rise of synthetic data generation is also beginning to impact the market, offering potential cost savings and improved data diversity. However, challenges remain. The high cost of skilled annotators, the need for continuous quality control, and the inherent complexities of labeling diverse data types (images, text, audio, video) pose significant restraints on market growth. While leading players like Labelbox, Scale AI, and SuperAnnotate dominate the market with advanced features and robust scalability, smaller companies and open-source tools continue to compete, often focusing on niche applications or offering cost-effective alternatives. The competitive landscape is dynamic, with continuous innovation and mergers and acquisitions shaping the future of this rapidly evolving market. Regional variations in adoption are also expected, with North America and Europe likely leading the market, followed by Asia-Pacific and other regions. This continuous evolution necessitates careful strategic planning and adaptation for businesses operating in or considering entry into this space.
Australia Software Testing Services Market Size 2025-2029
The Australia software testing services market size is forecast to increase by USD 1.7 billion, at a CAGR of 12.3% between 2024 and 2029.
The Software Testing Services Market in Australia is driven by the increasing need for cost reduction and faster time-to-market in the software development industry. This demand is fueled by the competitive business landscape, where companies strive to release high-quality software quickly to gain a competitive edge. Another significant trend in the market is the evolution of software testing labs, which offer specialized testing services and advanced testing tools to ensure software functionality, reliability, and security. However, the market also faces challenges, such as the availability of open-source and free testing tools that can potentially reduce the demand for paid testing services. Predictive analytics and test results analysis are driving test strategy development, enabling proactive identification and resolution of issues.
Additionally, the increasing complexity of software applications and the need for continuous testing pose significant challenges for testing service providers. Companies must adapt to these trends and challenges by offering value-added services, leveraging advanced testing tools, and focusing on providing expert testing capabilities to differentiate themselves in the market. By doing so, they can capitalize on the growing demand for software testing services and effectively navigate the competitive landscape.
What will be the size of the Australia Software Testing Services Market during the forecast period?
Request Free Sample
The software testing services market in Australia encompasses various offerings, including software testing consulting, test automation expertise, and quality assurance audits. Certifications in software testing methodologies and test automation frameworks are increasingly valued, as businesses prioritize quality assurance metrics and adherence to software quality standards. Mobile app testing and security vulnerability scanning are crucial components of modern testing practices, with test execution management and test reporting tools streamlining processes. Quality assurance professionals employ test planning, test design techniques, and test case management to ensure comprehensive coverage.
Performance monitoring, cloud-based testing, and test data generation are essential for maintaining optimal software functionality. AI-powered testing and test automation platforms are transforming the industry, offering advanced capabilities in test automation frameworks and test environment provisioning. Defect tracking systems facilitate efficient issue resolution, while test results analysis and quality assurance audits ensure continuous improvement.
How is this market segmented?
The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Product
Application testing
Product testing
End-user
BFSI
Telecom and media
Manufacturing
Retail
Others
Deployment
Cloud-based
On-premises
Service Type
Manual testing
Automated testing
Performance testing
Security testing
Geography
APAC
Australia
By Product Insights
The application testing segment is estimated to witness significant growth during the forecast period. Application testing is an essential process in ensuring the functionality, consistency, and usability of software applications. Three primary types of applications – desktop, mobile, and web – require testing for various reasons. Web applications undergo testing for business logic, application integrity, functionality, data flow, and hardware and software compatibility. Performance, security, and load testing are crucial for web applications, along with cross-browser testing, beta testing, compatibility testing, exploratory testing, regression testing, multilanguage support testing, and stress testing. Mobile application software testing includes UI testing, security testing, functionality and compatibility testing, and regression testing. Three testing methodologies – black box, white box, and grey box – are used to test applications.
Black box testing focuses on the application's external behavior, while white box testing examines the internal structure and workings. Grey box testing combines elements of both, providing a more comprehensive testing approach. Moreover, the software development lifecycle integrates various testing types, including load testing, integration testing, test analysis, test management tools, bug tracking, test data management, test reporting, usability testing, automation testing, test automation frameworks, u
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This repository hosts the Testing Roads for Autonomous VEhicLes (TRAVEL) dataset. TRAVEL is an extensive collection of virtual roads that have been used for testing lane assist/keeping systems (i.e., driving agents) and data from their execution in state of the art, physically accurate driving simulator, called BeamNG.tech. Virtual roads consist of sequences of road points interpolated using Cubic splines.
Along with the data, this repository contains instructions on how to install the tooling necessary to generate new data (i.e., test cases) and analyze them in the context of test regression. We focus on test selection and test prioritization, given their importance for developing high-quality software following the DevOps paradigms.
This dataset builds on top of our previous work in this area, including work on
test generation (e.g., AsFault, DeepJanus, and DeepHyperion) and the SBST CPS tool competition (SBST2021),
test selection: SDC-Scissor and related tool
test prioritization: automated test cases prioritization work for SDCs.
Dataset Overview
The TRAVEL dataset is available under the data folder and is organized as a set of experiments folders. Each of these folders is generated by running the test-generator (see below) and contains the configuration used for generating the data (experiment_description.csv), various statistics on generated tests (generation_stats.csv) and found faults (oob_stats.csv). Additionally, the folders contain the raw test cases generated and executed during each experiment (test..json).
The following sections describe what each of those files contains.
Experiment Description
The experiment_description.csv contains the settings used to generate the data, including:
Time budget. The overall generation budget in hours. This budget includes both the time to generate and execute the tests as driving simulations.
The size of the map. The size of the squared map defines the boundaries inside which the virtual roads develop in meters.
The test subject. The driving agent that implements the lane-keeping system under test. The TRAVEL dataset contains data generated testing the BeamNG.AI and the end-to-end Dave2 systems.
The test generator. The algorithm that generated the test cases. The TRAVEL dataset contains data obtained using various algorithms, ranging from naive and advanced random generators to complex evolutionary algorithms, for generating tests.
The speed limit. The maximum speed at which the driving agent under test can travel.
Out of Bound (OOB) tolerance. The test cases' oracle that defines the tolerable amount of the ego-car that can lie outside the lane boundaries. This parameter ranges between 0.0 and 1.0. In the former case, a test failure triggers as soon as any part of the ego-vehicle goes out of the lane boundary; in the latter case, a test failure triggers only if the entire body of the ego-car falls outside the lane.
Experiment Statistics
The generation_stats.csv contains statistics about the test generation, including:
Total number of generated tests. The number of tests generated during an experiment. This number is broken down into the number of valid tests and invalid tests. Valid tests contain virtual roads that do not self-intersect and contain turns that are not too sharp.
Test outcome. The test outcome contains the number of passed tests, failed tests, and test in error. Passed and failed tests are defined by the OOB Tolerance and an additional (implicit) oracle that checks whether the ego-car is moving or standing. Tests that did not pass because of other errors (e.g., the simulator crashed) are reported in a separated category.
The TRAVEL dataset also contains statistics about the failed tests, including the overall number of failed tests (total oob) and its breakdown into OOB that happened while driving left or right. Further statistics about the diversity (i.e., sparseness) of the failures are also reported.
Test Cases and Executions
Each test..json contains information about a test case and, if the test case is valid, the data observed during its execution as driving simulation.
The data about the test case definition include:
The road points. The list of points in a 2D space that identifies the center of the virtual road, and their interpolation using cubic splines (interpolated_points)
The test ID. The unique identifier of the test in the experiment.
Validity flag and explanation. A flag that indicates whether the test is valid or not, and a brief message describing why the test is not considered valid (e.g., the road contains sharp turns or the road self intersects)
The test data are organized according to the following JSON Schema and can be interpreted as RoadTest objects provided by the tests_generation.py module.
{ "type": "object", "properties": { "id": { "type": "integer" }, "is_valid": { "type": "boolean" }, "validation_message": { "type": "string" }, "road_points": { §\label{line:road-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "interpolated_points": { §\label{line:interpolated-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "test_outcome": { "type": "string" }, §\label{line:test-outcome}§ "description": { "type": "string" }, "execution_data": { "type": "array", "items": { "$ref" : "schemas/simulationdata" } } }, "required": [ "id", "is_valid", "validation_message", "road_points", "interpolated_points" ] }
Finally, the execution data contain a list of timestamped state information recorded by the driving simulation. State information is collected at constant frequency and includes absolute position, rotation, and velocity of the ego-car, its speed in Km/h, and control inputs from the driving agent (steering, throttle, and braking). Additionally, execution data contain OOB-related data, such as the lateral distance between the car and the lane center and the OOB percentage (i.e., how much the car is outside the lane).
The simulation data adhere to the following (simplified) JSON Schema and can be interpreted as Python objects using the simulation_data.py module.
{ "$id": "schemas/simulationdata", "type": "object", "properties": { "timer" : { "type": "number" }, "pos" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel_kmh" : { "type": "number" }, "steering" : { "type": "number" }, "brake" : { "type": "number" }, "throttle" : { "type": "number" }, "is_oob" : { "type": "number" }, "oob_percentage" : { "type": "number" } §\label{line:oob-percentage}§ }, "required": [ "timer", "pos", "vel", "vel_kmh", "steering", "brake", "throttle", "is_oob", "oob_percentage" ] }
Dataset Content
The TRAVEL dataset is a lively initiative so the content of the dataset is subject to change. Currently, the dataset contains the data collected during the SBST CPS tool competition, and data collected in the context of our recent work on test selection (SDC-Scissor work and tool) and test prioritization (automated test cases prioritization work for SDCs).
SBST CPS Tool Competition Data
The data collected during the SBST CPS tool competition are stored inside data/competition.tar.gz. The file contains the test cases generated by Deeper, Frenetic, AdaFrenetic, and Swat, the open-source test generators submitted to the competition and executed against BeamNG.AI with an aggression factor of 0.7 (i.e., conservative driver).
Name
Map Size (m x m)
Max Speed (Km/h)
Budget (h)
OOB Tolerance (%)
Test Subject
DEFAULT
200 × 200
120
5 (real time)
0.95
BeamNG.AI - 0.7
SBST
200 × 200
70
2 (real time)
0.5
BeamNG.AI - 0.7
Specifically, the TRAVEL dataset contains 8 repetitions for each of the above configurations for each test generator totaling 64 experiments.
SDC Scissor
With SDC-Scissor we collected data based on the Frenetic test generator. The data is stored inside data/sdc-scissor.tar.gz. The following table summarizes the used parameters.
Name
Map Size (m x m)
Max Speed (Km/h)
Budget (h)
OOB Tolerance (%)
Test Subject
SDC-SCISSOR
200 × 200
120
16 (real time)
0.5
BeamNG.AI - 1.5
The dataset contains 9 experiments with the above configuration. For generating your own data with SDC-Scissor follow the instructions in its repository.
Dataset Statistics
Here is an overview of the TRAVEL dataset: generated tests, executed tests, and faults found by all the test generators grouped by experiment configuration. Some 25,845 test cases are generated by running 4 test generators 8 times in 2 configurations using the SBST CPS Tool Competition code pipeline (SBST in the table). We ran the test generators for 5 hours, allowing the ego-car a generous speed limit (120 Km/h) and defining a high OOB tolerance (i.e., 0.95), and we also ran the test generators using a smaller generation budget (i.e., 2 hours) and speed limit (i.e., 70 Km/h) while setting the OOB tolerance to a lower value (i.e., 0.85). We also collected some 5, 971 additional tests with SDC-Scissor (SDC-Scissor in the table) by running it 9 times for 16 hours using Frenetic as a test generator and defining a more realistic OOB tolerance (i.e., 0.50).
Generating new Data
Generating new data, i.e., test cases, can be done using the SBST CPS Tool Competition pipeline and the driving simulator BeamNG.tech.
Extensive instructions on how to install both software are reported inside the SBST CPS Tool Competition pipeline Documentation;
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across industries. The rising need for data-driven decision-making, coupled with the expanding adoption of cloud-based analytics solutions, is fueling market expansion. While precise figures for market size and CAGR are not provided, a reasonable estimation, based on the prevalent growth in the broader analytics market and the crucial role of EDA in the data science workflow, would place the 2025 market size at approximately $3 billion, with a projected Compound Annual Growth Rate (CAGR) of 15% through 2033. This growth is segmented across various applications, with large enterprises leading the adoption due to their higher investment capacity and complex data needs. However, SMEs are witnessing rapid growth in EDA tool adoption, driven by the increasing availability of user-friendly and cost-effective solutions. Further segmentation by tool type reveals a strong preference for graphical EDA tools, which offer intuitive visualizations facilitating better data understanding and communication of findings. Geographic regions, such as North America and Europe, currently hold a significant market share, but the Asia-Pacific region shows promising potential for future growth owing to increasing digitalization and data generation. Key restraints to market growth include the need for specialized skills to effectively utilize these tools and the potential for data bias if not handled appropriately. The competitive landscape is dynamic, with both established players like IBM and emerging companies specializing in niche areas vying for market share. Established players benefit from brand recognition and comprehensive enterprise solutions, while specialized vendors provide innovative features and agile development cycles. Open-source options like KNIME and R packages (Rattle, Pandas Profiling) offer cost-effective alternatives, particularly attracting academic institutions and smaller businesses. The ongoing development of advanced analytics functionalities, such as automated machine learning integration within EDA platforms, will be a significant driver of future market growth. Further, the integration of EDA tools within broader data science platforms is streamlining the overall analytical workflow, contributing to increased adoption and reduced complexity. The market's evolution hinges on enhanced user experience, more robust automation features, and seamless integration with other data management and analytics tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was generated using an open-source large language model and carefully curated prompts, simulating realistic clinical narratives while ensuring no real patient data is included. The primary purpose of this dataset is to support the development, evaluation, and benchmarking of Artificial Intelligence tools for clinical and biomedical applications in the Portuguese language, especially European Portuguese. It is particularly valuable for information extraction (IE) tasks such as named entity recognition, clinical note classification, summarization, and synthetic data generation in low-resource language settings. The dataset promotes research on the responsible use of synthetic data in healthcare and aims to serve as a foundation for training or fine-tuning domain-specific Portuguese language models in clinical IE and other natural language processing tasks. About the dataset XML files comprising 98,571 fully synthetic clinical notes in European Portuguese, divided into 4 types: 24,759 admission notes, 24,411 ambulatory notes, 24,639 discharge summaries, and 24,762 nursing notes; CSV file with prompts and responses from prompt engineering; CSV files with prompts and responses from synthetic dataset generation; CSV file with results from human evaluation; TXT files containing 1,000 clinical notes (250 of each type) taken from the synthetic dataset and used during automatic evaluation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".
The appendix contains:
The queries from the three open source systems we used in the evaluation of our tool (the industry software system is not part of this appendix, due to privacy reasons)
The results of our evaluation.
The source code of the tool. Most recent version can be found at https://github.com/SERG-Delft/evosql.
The results of the tuning procedure we conducted before running the final evaluation.