Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A spatial data set of 21,434 random samples was generated, spreading evenly across 625 locations, u_i=(u_i, v_i ) (i=1,2,⋯,625). The data-generating process was inspired by that used by Fotheringham et al. (2017). For each data point, four independent variables (g_1,g_2,x_1,z_1) were generated according to a multivariate normal distribution with zero means and an identity variance-covariance matrix. However, the mean of g_1 and g_2 at each location were substituted for the original values to simulate group-level spatial-related variables. Among them, x_1 and z_1 were regarded as sample-level variables. Samples located together share coefficient values.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample size calculation per Cochrane review group; random review # generator (used to help pick reviews at random)
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Test Data Generation Tools market is experiencing robust growth, driven by the increasing demand for high-quality software and the rising adoption of agile and DevOps methodologies. The market's expansion is fueled by several factors, including the need for realistic and representative test data to ensure thorough software testing, the growing complexity of applications, and the increasing pressure to accelerate software delivery cycles. The market is segmented by type (Random, Pathwise, Goal, Intelligent) and application (Large Enterprises, SMEs), each demonstrating unique growth trajectories. Intelligent test data generation, offering advanced capabilities like data masking and synthetic data creation, is gaining significant traction, while large enterprises are leading the adoption due to their higher testing volumes and budgets. Geographically, North America and Europe currently hold the largest market shares, but the Asia-Pacific region is expected to witness significant growth due to rapid digitalization and increasing software development activities. Competitive intensity is high, with a mix of established players like IBM and Informatica and emerging innovative companies continuously introducing advanced features and functionalities. The market's growth is, however, constrained by challenges such as the complexity of implementing and managing test data generation tools and the need for specialized expertise. Overall, the market is projected to maintain a healthy growth rate throughout the forecast period (2025-2033), driven by continuous technological advancements and evolving software testing requirements. While the precise CAGR isn't provided, assuming a conservative yet realistic CAGR of 15% based on industry trends and the factors mentioned above, the market is poised for significant expansion. This growth will be fueled by the increasing adoption of cloud-based solutions, improved data masking techniques for enhanced security and privacy, and the rise of AI-powered test data generation tools that automatically create comprehensive and realistic datasets. The competitive landscape will continue to evolve, with mergers and acquisitions likely shaping the market structure. Furthermore, the focus on data privacy regulations will influence the development and adoption of advanced data anonymization and synthetic data generation techniques. The market will see further segmentation as specialized tools catering to specific industry needs (e.g., financial services, healthcare) emerge. The long-term outlook for the Test Data Generation Tools market remains positive, driven by the relentless demand for higher software quality and faster development cycles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recently, distribution element trees (DETs) were introduced as an accurate and computationally efficient method for density estimation. In this work, we demonstrate that the DET formulation promotes an easy and inexpensive way to generate random samples similar to a smooth bootstrap. These samples can be generated unconditionally, but also, without further complications, conditionally using available information about certain probability-space components. This article is accompanied by the R codes that were used to produce all simulation results. Supplementary material for this article is available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This repository hosts the Testing Roads for Autonomous VEhicLes (TRAVEL) dataset. TRAVEL is an extensive collection of virtual roads that have been used for testing lane assist/keeping systems (i.e., driving agents) and data from their execution in state of the art, physically accurate driving simulator, called BeamNG.tech. Virtual roads consist of sequences of road points interpolated using Cubic splines.
Along with the data, this repository contains instructions on how to install the tooling necessary to generate new data (i.e., test cases) and analyze them in the context of test regression. We focus on test selection and test prioritization, given their importance for developing high-quality software following the DevOps paradigms.
This dataset builds on top of our previous work in this area, including work on
test generation (e.g., AsFault, DeepJanus, and DeepHyperion) and the SBST CPS tool competition (SBST2021),
test selection: SDC-Scissor and related tool
test prioritization: automated test cases prioritization work for SDCs.
Dataset Overview
The TRAVEL dataset is available under the data folder and is organized as a set of experiments folders. Each of these folders is generated by running the test-generator (see below) and contains the configuration used for generating the data (experiment_description.csv), various statistics on generated tests (generation_stats.csv) and found faults (oob_stats.csv). Additionally, the folders contain the raw test cases generated and executed during each experiment (test..json).
The following sections describe what each of those files contains.
Experiment Description
The experiment_description.csv contains the settings used to generate the data, including:
Time budget. The overall generation budget in hours. This budget includes both the time to generate and execute the tests as driving simulations.
The size of the map. The size of the squared map defines the boundaries inside which the virtual roads develop in meters.
The test subject. The driving agent that implements the lane-keeping system under test. The TRAVEL dataset contains data generated testing the BeamNG.AI and the end-to-end Dave2 systems.
The test generator. The algorithm that generated the test cases. The TRAVEL dataset contains data obtained using various algorithms, ranging from naive and advanced random generators to complex evolutionary algorithms, for generating tests.
The speed limit. The maximum speed at which the driving agent under test can travel.
Out of Bound (OOB) tolerance. The test cases' oracle that defines the tolerable amount of the ego-car that can lie outside the lane boundaries. This parameter ranges between 0.0 and 1.0. In the former case, a test failure triggers as soon as any part of the ego-vehicle goes out of the lane boundary; in the latter case, a test failure triggers only if the entire body of the ego-car falls outside the lane.
Experiment Statistics
The generation_stats.csv contains statistics about the test generation, including:
Total number of generated tests. The number of tests generated during an experiment. This number is broken down into the number of valid tests and invalid tests. Valid tests contain virtual roads that do not self-intersect and contain turns that are not too sharp.
Test outcome. The test outcome contains the number of passed tests, failed tests, and test in error. Passed and failed tests are defined by the OOB Tolerance and an additional (implicit) oracle that checks whether the ego-car is moving or standing. Tests that did not pass because of other errors (e.g., the simulator crashed) are reported in a separated category.
The TRAVEL dataset also contains statistics about the failed tests, including the overall number of failed tests (total oob) and its breakdown into OOB that happened while driving left or right. Further statistics about the diversity (i.e., sparseness) of the failures are also reported.
Test Cases and Executions
Each test..json contains information about a test case and, if the test case is valid, the data observed during its execution as driving simulation.
The data about the test case definition include:
The road points. The list of points in a 2D space that identifies the center of the virtual road, and their interpolation using cubic splines (interpolated_points)
The test ID. The unique identifier of the test in the experiment.
Validity flag and explanation. A flag that indicates whether the test is valid or not, and a brief message describing why the test is not considered valid (e.g., the road contains sharp turns or the road self intersects)
The test data are organized according to the following JSON Schema and can be interpreted as RoadTest objects provided by the tests_generation.py module.
{ "type": "object", "properties": { "id": { "type": "integer" }, "is_valid": { "type": "boolean" }, "validation_message": { "type": "string" }, "road_points": { §\label{line:road-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "interpolated_points": { §\label{line:interpolated-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "test_outcome": { "type": "string" }, §\label{line:test-outcome}§ "description": { "type": "string" }, "execution_data": { "type": "array", "items": { "$ref" : "schemas/simulationdata" } } }, "required": [ "id", "is_valid", "validation_message", "road_points", "interpolated_points" ] }
Finally, the execution data contain a list of timestamped state information recorded by the driving simulation. State information is collected at constant frequency and includes absolute position, rotation, and velocity of the ego-car, its speed in Km/h, and control inputs from the driving agent (steering, throttle, and braking). Additionally, execution data contain OOB-related data, such as the lateral distance between the car and the lane center and the OOB percentage (i.e., how much the car is outside the lane).
The simulation data adhere to the following (simplified) JSON Schema and can be interpreted as Python objects using the simulation_data.py module.
{ "$id": "schemas/simulationdata", "type": "object", "properties": { "timer" : { "type": "number" }, "pos" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel_kmh" : { "type": "number" }, "steering" : { "type": "number" }, "brake" : { "type": "number" }, "throttle" : { "type": "number" }, "is_oob" : { "type": "number" }, "oob_percentage" : { "type": "number" } §\label{line:oob-percentage}§ }, "required": [ "timer", "pos", "vel", "vel_kmh", "steering", "brake", "throttle", "is_oob", "oob_percentage" ] }
Dataset Content
The TRAVEL dataset is a lively initiative so the content of the dataset is subject to change. Currently, the dataset contains the data collected during the SBST CPS tool competition, and data collected in the context of our recent work on test selection (SDC-Scissor work and tool) and test prioritization (automated test cases prioritization work for SDCs).
SBST CPS Tool Competition Data
The data collected during the SBST CPS tool competition are stored inside data/competition.tar.gz. The file contains the test cases generated by Deeper, Frenetic, AdaFrenetic, and Swat, the open-source test generators submitted to the competition and executed against BeamNG.AI with an aggression factor of 0.7 (i.e., conservative driver).
Name
Map Size (m x m)
Max Speed (Km/h)
Budget (h)
OOB Tolerance (%)
Test Subject
DEFAULT
200 × 200
120
5 (real time)
0.95
BeamNG.AI - 0.7
SBST
200 × 200
70
2 (real time)
0.5
BeamNG.AI - 0.7
Specifically, the TRAVEL dataset contains 8 repetitions for each of the above configurations for each test generator totaling 64 experiments.
SDC Scissor
With SDC-Scissor we collected data based on the Frenetic test generator. The data is stored inside data/sdc-scissor.tar.gz. The following table summarizes the used parameters.
Name
Map Size (m x m)
Max Speed (Km/h)
Budget (h)
OOB Tolerance (%)
Test Subject
SDC-SCISSOR
200 × 200
120
16 (real time)
0.5
BeamNG.AI - 1.5
The dataset contains 9 experiments with the above configuration. For generating your own data with SDC-Scissor follow the instructions in its repository.
Dataset Statistics
Here is an overview of the TRAVEL dataset: generated tests, executed tests, and faults found by all the test generators grouped by experiment configuration. Some 25,845 test cases are generated by running 4 test generators 8 times in 2 configurations using the SBST CPS Tool Competition code pipeline (SBST in the table). We ran the test generators for 5 hours, allowing the ego-car a generous speed limit (120 Km/h) and defining a high OOB tolerance (i.e., 0.95), and we also ran the test generators using a smaller generation budget (i.e., 2 hours) and speed limit (i.e., 70 Km/h) while setting the OOB tolerance to a lower value (i.e., 0.85). We also collected some 5, 971 additional tests with SDC-Scissor (SDC-Scissor in the table) by running it 9 times for 16 hours using Frenetic as a test generator and defining a more realistic OOB tolerance (i.e., 0.50).
Generating new Data
Generating new data, i.e., test cases, can be done using the SBST CPS Tool Competition pipeline and the driving simulator BeamNG.tech.
Extensive instructions on how to install both software are reported inside the SBST CPS Tool Competition pipeline Documentation;
Predavanje za predmet Tehnike obrade biomedicinskih signala na master akademskim studijama na Elektrotehničkom fakultetu Univerziteta u Beogradu.
Attribution-NonCommercial-NoDerivs 2.5 (CC BY-NC-ND 2.5)https://creativecommons.org/licenses/by-nc-nd/2.5/
License information was derived automatically
NADA (Not-A-Database) is an easy-to-use geometric shape data generator that allows users to define non-uniform multivariate parameter distributions to test novel methodologies. The full open-source package is provided at GIT:NA_DAtabase. See Technical Report for details on how to use the provided package.
This database includes 3 repositories:
Each image can be used for classification (shape/color) or regression (radius/area) tasks.
All datasets can be modified and adapted to the user's research question using the included open source data generator.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains turbine- and plant-level power outputs for 252,500 cases of diverse wind plant layouts operating under a wide range of yawing and atmospheric conditions. The power outputs were computed using the Gaussian wake model in NREL's FLOw Redirection and Induction in Steady State (FLORIS) model, version 2.3.0. The 252,500 cases include 500 unique wind plants generated randomly by a specialized Plant Layout Generator (PLayGen) that samples randomized realizations of wind plant layouts from one of four canonical configurations: (i) cluster, (ii) single string, (iii) multiple string, (iv) parallel string. Other wind plant layout parameters were also randomly sampled, including the number of turbines (25-200) and the mean turbine spacing (3D-10D, where D denotes the turbine rotor diameter). For each layout, 500 different sets of atmospheric conditions were randomly sampled. These include wind speed in 0-25 m/s, wind direction in 0 deg.-360 deg., and turbulence intensity chosen from low (6%), medium (8%), and high (10%). For each atmospheric inflow scenario, the individual turbine yaw angles were randomly sampled from a one-sided truncated Gaussian on the interval 0 deg.-30 deg. oriented relative to wind inflow direction.
This random data is supplemented with a collection of yaw-optimized samples where FLORIS was used to determine turbine yaw angles that maximize power production for the entire plant. To generate this data, a subset of cases were selected (50 atmospheric conditions from 50 layouts each for a total of additional 2,500 cases) for which FLORIS was re-run with wake steering control optimization. The IEA onshore reference turbine, which has a 130 m rotor diameter, a 110 m hub height, and a rated power capacity of 3.4 MW was used as the turbine for all simulations.
The simulations were performed using NREL's Eagle high performance computing system in February 2021 as part of the Spatial Analysis for Wind Technology Development project funded by the U.S. Department of Energy Wind Energy Technologies Office. The data was collected, reformatted, and preprocessed for this OEDI submission in May 2023 under the Foundational AI for Wind Energy project funded by the U.S. Department of Energy Wind Energy Technologies Office. This dataset is intended to serve as a benchmark against which new artificial intelligence (AI) or machine learning (ML) tools may be tested. Baseline AI/ML methods for analyzing this dataset have been implemented, and a link to their repository containing those models has been provided.
The .h5 data file structure can be found in the GitHub repository under explore_wind_plant_data_h5.ipynb.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This package includes:
Two Stata .dta files consisting of information on patents assigned by the United States Patent and Trademark Office between 1976 and 2014: a random sample of 250 000 US patents, and data on patent owned by Intellectual Ventures, RPX, and several other companies. The variables for example include: grant date, application date, forward and backward citations, renewals, claims and others.
Source codes and methods used in generating and analyzing the two data files.
A bachelor thesis with further information will be linked here.
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, the global Quantum Random Number Generator RNG market size is USD 555.9 million in 2024. It will expand at a compound annual growth rate (CAGR) of 72.60% from 2024 to 2031.
North America held the major market share for more than 40% of the global revenue with a market size of USD 222.36 million in 2024 and will grow at a compound annual growth rate (CAGR) of 70.8% from 2024 to 2031.
Europe accounted for a market share of over 30% of the global revenue with a market size of USD 166.77 million.
Asia Pacific held a market share of around 23% of the global revenue with a market size of USD 127.86 million in 2024 and will grow at a compound annual growth rate (CAGR) of 74.6% from 2024 to 2031.
Latin America had a market share for more than 5% of the global revenue with a market size of USD 27.80 million in 2024 and will grow at a compound annual growth rate (CAGR) of 72.0% from 2024 to 2031.
Middle East and Africa had a market share of around 2% of the global revenue and was estimated at a market size of USD 11.12 million in 2024 and will grow at a compound annual growth rate (CAGR) of 72.3% from 2024 to 2031.
Cloud held the dominant segment in the Quantum Random Number Generator RNG market in 2024.
Market Dynamics of Quantum Random Number Generator RNG Market
Key Drivers for Quantum Random Number Generator RNG Market
Increasing need for random numbers in cryptography or compute applications
The QRNG is an ideal random key generator since it generates entropy using intrinsic quantum physics properties. Nowadays, applications demand a huge number of keys and randomization to achieve total security. It could include key vaults, games, IoT devices, AI/ML, blockchains, simulations, and vital infrastructure. QRNG is the source of these applications in which trust in randomness is prevalent. Furthermore, it is utilized in encryption for a wide range of applications, including cryptography, numerical simulation, gambling, and game design.
Growing adoption of quantum computing
The increasing use of quantum computing is boosting the market for Quantum Random Number Generators (RNG) as it creates a need for improved random number generation capabilities. The accurate abilities of quantum computing enable RNGs to produce truly random numbers, essential for secure communication and encryption. Advancements in quantum computing will lead to a higher demand for dependable RNGs, driving market expansion to meet the changing requirements of cybersecurity and data encryption.
Restraint Factor for the Quantum Random Number Generator RNG Market
High initial investment
A significant initial investment hinders the Quantum Random Number Generator (RNG) Market, creating a barrier for new entrants and small companies looking to invest in RNG generation. The significant initial costs involved in the research, development, and deployment of quantum RNG solutions may discourage potential entrants from joining the market. This limitation impedes the growth of the market by limiting innovation and competition, potentially hindering progress in the era of RNG and constraining the market's growth
Impact of Covid-19 on the Quantum Random Number Generator RNG Market
The effect of COVID-19 on the Quantum Random Number Generator RNG Market was merged. Although the pandemic initially caused disruptions in supply chains and slowed down certain trends, the increased focus on cybersecurity and data protection during remote work and digital interactions enhanced the need for secure communication solutions such as quantum RNGs. With a focus on safeguarding information, both organizations and governments fueled growth in the Quantum RNG market despite pandemic-related obstacles. Introduction of the Quantum Random Number Generator RNG Market
The Quantum Random Number Generator (QRNG) is a highly sophisticated engineering innovation that combines the power of complex deep-tech technologies like semiconductors, optoelectronics, high-precision electronics, and quantum physics to achieve the highest level of randomness possible. QRNG has shown to be a critical enabling technology for quantum-level security in mobile devices, data centres, and medical implants. They provide consumers with a significant enhancement over ordinary random number generators (RNGs), which have been utilized for years in a variety of business applications. Several factors, including th...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset
Aim
Samples
Benign-malicious
traffic ratio
D1
Training
400,003
50%
D2
Test
57,239
50%
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters
Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5
Increase the probability of a false positive identification
--risk=3
Increase the probability of extracting data
--random-agent
Select the User-Agent randomly
--batch
Never ask for user input, use the default behavior
--answers="follow=Y"
Predefined answers to yes
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This publication documents the various datasets generated using the kac_drumset codebase. The aims of kac_drumset is to provide a robust framework for the generation and analysis of arbitrarily shaped drums. The source code for this project is available here: https://github.com/lewiswolf/kac_drumset.
Background
Arbitrarily shaped drums are a strange family of percussion instruments and a wholly meta-physical construction in this contemporary setting. These percussive instruments possess a number of interesting musical characteristics resulting from their particular geometric designs. As it currently stands, these instruments remain largely unexplored throughout musical practice, as they were originally devised as a collection of hypothetical mathematical objects. These datasets serve to sonify these objects so as to explore these conceptual constructions in the audio domain.
Usage
To use these datasets, first install kac_drumset:
pip install "git+https://github.com/lewiswolf/kac_drumset.git#egg=kac_drumset"
And then in python:
from kac_drumset import (
# methods
loadDataset,
transformDataset,
# classes
TorchDataset,
)
dataset: TorchDataset = transformDataset(
# load a dataset (any folder containing a metadata.json)
loadDataset('absolute/path/to/data'),
# alter the dataset representation, either as an end2end, fft or mel.
{'output_type': 'end2end'},
)
# use the dataset
for i in range(dataset._len_()):
x, y = dataset._getitem_(i)
...
For more details on using kac_drumset, see the project's documentation.
2000 Convex Polygonal Drums of Varying Size
Each sample in this dataset corresponds to a randomly generated convex polygon. The audio for each sample was generated using a two-dimensional physical model of a drum. Each sample is one second long and decays linearly.
Contained in this dataset are ten different sizes of drums - 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, 0.6 - each of which is a measure of the longest vertex of each drum in meters. There are 40 different drums sampled for each size. Each drum is sampled five times, first by being struck in the geometric centroid, and then by being struck four more times in random locations. This dataset is labelled with the vertices of each polygon, normalised to the unit interval, and the strike location of each sample.
The audio is sampled at 48khz, and the default representation is raw audio. Each sample is stored in the metadata.json, alongside being made available audibly as a 24-bit .wav and graphically as a .png.
5000 Circular Drums of Varying Size
Each sample in this dataset corresponds to a randomly generated circular drum. The audio for each sample was generated using additive synthesis, inferred using a closed form solution to the two dimensional wave equation. Each sample is one second long and decays exponentially.
Contained in this dataset are 1000 different drums, each determined by a randomly generated size (0.1, 2.0) in meters. Each drum is sampled five times, first being struck in the geometric centroid, and then by being struck four more times in random locations. This dataset is labelled with the size of each drum and the strike location of each sample.
The audio is sampled at 48khz, and the default representation is raw audio. Each sample is stored in the metadata.json, alongside being made available audibly as a 24-bit .wav and graphically as a .png.
5000 Rectangular Drums of Varying Dimension
Each sample in this dataset corresponds to a randomly generated rectangular drum. The audio for each sample was generated using additive synthesis, inferred using a closed form solution to the two dimensional wave equation. Each sample is one second long and decays exponentially.
Contained in this dataset are 1000 different drums, each determined by a randomly generated size (0.1, 2.0) in meters and aspect ratio (0.25, 4.0). Each drum is sampled five times, first being struck in the geometric centroid, and then by being struck four more times in random locations. This dataset is labelled with the size and aspect ratio of each drum, and the strike location of each sample.
The audio is sampled at 48khz, and the default representation is raw audio. Each sample is stored in the metadata.json, alongside being made available audibly as a 24-bit .wav and graphically as a .png.
Reranking dataset built from only 179 annotated queries with random sampling technique to increase final sample amount. This dataset is used to train Rearank model in the paper - REARANK: Reasoning Re-ranking Agent via Reinforcement Learning Below are the data generation code
import re import os from datasets import Dataset, load_dataset from random import randint, seed, choice… See the full description on the dataset page: https://huggingface.co/datasets/le723z/rearank_12k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Zhang et al. (https://link.springer.com/article/10.1140/epjb/e2017-80122-8) suggest a temporal random network with changing dynamics that follow a Markov process, allowing for a continuous-time network history moving from a static definition of a random graph with a fixed number of nodes n and edge probability p to a temporal one. Defining lambda = probability per time granule of a new edge to appear and mu = probability per time granule of an existing edge to disappear, Zhang et al. show that the equilibrium probability of an edge is p=lambda/(lambda+mu) Our implementation, a Python package that we refer to as RandomDynamicGraph https://github.com/ScanLab-ossi/DynamicRandomGraphs, generates large-scale dynamic random graphs according to the defined density. The package focuses on massive data generation; it uses efficient math calculations, writes to file instead of in-memory when datasets are too large, and supports multi-processing. Please note the datetime is arbitrary.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
To evaluate land use and land cover (LULC) maps an independent and representative test dataset is required. Here, a test dataset was generated via stratified random sampling approach across all areas in Fiji not used to generate training data (i.e. all Tikinas which did not contain a training data point were valid for sampling to generate the test dataset). Following equation 13 in Olofsson et al. (2014), the sample size of the test dataset was 834. This was based on a desired standard error of the overall accuracy score of 0.01 and a user's accuracy of 0.75 for all classes. The strata for sampling test samples were the eight LULC classes: water, mangrove, bare soil, urban, agriculture, grassland, shrubland, and trees.
There are different strategies for allocating samples to strata for evaluating LULC maps, as discussed by Olofsson et al. (2014). Equal allocation of samples to strata ensures coverage of rarely occurring classes and minimise the standard error of estimators of user's accuracy. However, equal allocation does not optimise the standard error of the estimator of overall accuracy. Proportional allocation of samples to strata, based on the proportion of the strata in the overall dataset, can result in rarely occurring classes being underrepresented in the test dataset. Optimal allocation of samples to strata is challenging to implement when there are multiple evaluation objectives. Olofsson et al. (2014) recommend a "simple" allocation procedure where 50 to 100 samples are allocated to rare classes and proportional allocation is used to allocate samples to the remaining majority classes. The number of samples to allocate to rare classes can be determined by iterating over different allocations and computing estimated standard errors for performance metrics. Here, the 2021 all-Fiji LULC map, minus the Tikinas used for generating training samples, was used to estimate the proportional areal coverage of each LULC class. The LULC map from 2021 was used to permit comparison with other LULC products with a 2021 layer, notably the ESA WorldCover 10m v200 2021 product.
The 2021 LULC map was dominated by the tree class (74\% of the area classified) and the remaining classes had less than 10\% coverage each. Therefore, a "simple" allocation of 100 samples to the seven minority classes and an allocation of 133 samples to the tree class was used. This ensured all the minority classes had sufficient coverage in the test set while balancing the requirement to minimise standard errors for the estimate of overall accuracy. The allocated number of test dataset points were randomly sampled within each strata and were manually labelled using 2021 annual median RGB composites from Sentinel-2 and Planet NICFI and high-resolution Google Satellite Basemaps.
The Fiji LULC test data is available in GeoJSON format in the file fiji-lulc-test-data.geojson
. Each point feature has two attributes: ref_class
(the LULC class manually labelled and quality checked) and strata
(the strata the sampled point belongs to derived from the 2021 all-Fiji LULC map). The following integers correspond to the ref_class
and strata
labels:
When evaluating LULC maps using test data derived from a stratified sample, the nature of the stratified sampling needs to be accounted for when estimating performance metrics such as overall accuracy, user's accuracy, and producer's accuracy. This is particulary so if the strata do not match the map classes (i.e. when comparing different LULC products). Stehman (2014) provide formulas for estimating performance metrics and their standard errors when using test data with a stratified sampling structure.
To support LULC accuracy assessment a Python package has been developed which provides implementations of Stehman's (2014) formulas. The package can be installed via:
pip install lulc-validation
with documentation and examples here.
In order to compute performance metrics accounting for the stratified nature of the sample the total number of points / pixels available to be sampled in each strata must be known. For this dataset that is:
This dataset was generated with support from a Climate Change AI Innovation Grant.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of results between NSA-GA, NSA, and random testing on all programs using float data type and different range.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Synthetic datasets (training/validation) for end-to-end Relation Extraction of relationships between Organisms and Natural-Products. The datasets are provided for reproducibility purposes, but, can also be used to train new models. As in the corresponding article, 3 subtypes of synthetic datasets are provided:
Diversity-synt: The seed literature references used in the generation process correspond to the top-500 extracted items per biological kingdoms using the GME-sampler. Random-synt: 5 datasets of equivalent sizes as Diversity-synt, but using randomly sampled seed literature references. Extended-synt: A merge of Diversity-synt and the 5 Random-synt datasets. All datasets were produced with Vicuna-13b-v1.3. Like the model, the produced synthetic data are also submitted to the License of the model used for generation, see the original LLaMA model card. LLaMA is licensed under the LLaMA License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
The STEP (Skills Toward Employment and Productivity) Measurement program is the first ever initiative to generate internationally comparable data on skills available in developing countries. The program implements standardized surveys to gather information on the supply and distribution of skills and the demand for skills in labor market of low-income countries.
The uniquely-designed Household Survey includes modules that measure the cognitive skills (reading, writing and numeracy), socio-emotional skills (personality, behavior and preferences) and job-specific skills (subset of transversal skills with direct job relevance) of a representative sample of adults aged 15 to 64 living in urban areas, whether they work or not. The cognitive skills module also incorporates a direct assessment of reading literacy based on the Survey of Adults Skills instruments. Modules also gather information about family, health and language.
13 major metropolitan areas: Bogota, Medellin, Cali, Baranquilla, Bucaramanga, Cucuta, Cartagena, Pasto, Ibague, Pereira, Manizales, Monteira, and Villavicencio.
The units of analysis are the individual respondents and households. A household roster is undertaken at the start of the survey and the individual respondent is randomly selected among all household members aged 15 to 64 included. The random selection process was designed by the STEP team and compliance with the procedure is carefully monitored during fieldwork.
The target population for the Colombia STEP survey is all non-institutionalized persons 15 to 64 years old (inclusive) living in private dwellings in urban areas of the country at the time of data collection. This includes all residents except foreign diplomats and non-nationals working for international organizations.
The following groups are excluded from the sample: - residents of institutions (prisons, hospitals, etc.) - residents of senior homes and hospices - residents of other group dwellings such as college dormitories, halfway homes, workers' quarters, etc. - persons living outside the country at the time of data collection.
Sample survey data [ssd]
Stratified 7-stage sample design was used in Colombia. The stratification variable is city-size category.
First Stage Sample The primary sample unit (PSU) is a metropolitan area. A sample of 9 metropolitan areas was selected from the 13 metropolitan areas on the sample frame. The metropolitan areas were grouped according to city-size; the five largest metropolitan areas are included in Stratum 1 and the remaining 8 metropolitan areas are included in Stratum 2. The five metropolitan areas in Stratum 1 were selected with certainty; in Stratum 2, four metropolitan areas were selected with probability proportional to size (PPS), where the measure of size was the number of persons aged 15 to 64 in a metropolitan area.
Second Stage Sample The second stage sample unit is a Section. At the second stage of sample selection, a PPS sample of 267 Sections was selected from the sampled metropolitan areas; the measure of size was the number of persons aged 15 to 64 in a Section. The sample of 267 Sections consisted of 243 initial Sections and 24 reserve Sections to be used in the event of complete non-response at the Section level.
Third Stage Sample The third stage sample unit is a Block. Within each selected Section, a PPS sample of 4 blocks was selected; the measure of size was the number of persons aged 15 to 64 in a Block. Two sample Blocks were initially activated while the remaining two sample Blocks were reserved for use in cases where there was a refusal to cooperate at the Block level or cases where the block did not belong to the target population (e.g., parks, and commercial and industrial areas).
Fourth Stage Sample The fourth stage sample unit is a Block Segment. Regarding the Block segmentation strategy, the Colombia document 'FINAL SAMPLING PLAN (ARD-397)' states "According to the 2005 population and housing census conducted by DANE, the average number of dwellings per block in the 13 large cities or metropolitan areas was approximately 42 dwellings. Based on this finding, the defined protocol was to report those cases in which 80 or more dwellings were present in a given block in order to partition block using a random selection algorithm." At the fourth stage of sample selection, 1 Block Segment was selected in each selected Block using a simple random sample (SRS) method.
Fifth Stage Sample The fifth stage sample unit is a dwelling. At the fifth stage of sample selection, 5582 dwellings were selected from the sampled Blocks/Block Segments using a simple random sample (SRS) method. According to the Colombia document 'FINAL SAMPLING PLAN (ARD-397)', the selection of dwellings within a participant Block "was performed differentially amongst the different socioeconomic strata that the Colombian government uses for the generation of cross-subsidies for public utilities (in this case, the socioeconomic stratum used for the electricity bill was used). Given that it is known from previous survey implementations that refusal rates are highest amongst households of higher socioeconomic status, the number of dwellings to be selected increased with the socioeconomic stratum (1 being the poorest and 6 being the richest) that was most prevalent in a given block".
Sixth Stage Sample The sixth stage sample unit is a household. At the sixth stage of sample selection, one household was selected in each selected dwelling using an SRS method.
Seventh Stage Sample The seventh stage sample unit was an individual aged 15-64 (inclusive). The sampling objective was to select one individual with equal probability from each selected household.
Sampling methodologies are described for each country in two documents and are provided as external resources: (i) the National Survey Design Planning Report (NSDPR) (ii) the weighting documentation (available for all countries)
Face-to-face [f2f]
The STEP survey instruments include:
All countries adapted and translated both instruments following the STEP technical standards: two independent translators adapted and translated the STEP background questionnaire and Reading Literacy Assessment, while reconciliation was carried out by a third translator.
The survey instruments were piloted as part of the survey pre-test.
The background questionnaire covers such topics as respondents' demographic characteristics, dwelling characteristics, education and training, health, employment, job skill requirements, personality, behavior and preferences, language and family background.
The background questionnaire, the structure of the Reading Literacy Assessment and Reading Literacy Data Codebook are provided in the document "Colombia STEP Skills Measurement Survey Instruments", available in external resources.
STEP data management process:
1) Raw data is sent by the survey firm 2) The World Bank (WB) STEP team runs data checks on the background questionnaire data. Educational Testing Services (ETS) runs data checks on the Reading Literacy Assessment data. Comments and questions are sent back to the survey firm. 3) The survey firm reviews comments and questions. When a data entry error is identified, the survey firm corrects the data. 4) The WB STEP team and ETS check if the data files are clean. This might require additional iterations with the survey firm. 5) Once the data has been checked and cleaned, the WB STEP team computes the weights. Weights are computed by the STEP team to ensure consistency across sampling methodologies. 6) ETS scales the Reading Literacy Assessment data. 7) The WB STEP team merges the background questionnaire data with the Reading Literacy Assessment data and computes derived variables.
Detailed information on data processing in STEP surveys is provided in "STEP Guidelines for Data Processing", available in external resources. The template do-file used by the STEP team to check raw background questionnaire data is provided as an external resource, too.`
An overall response rate of 48% was achieved in the Colombia STEP Survey.
A weighting documentation was prepared for each participating country and provides some information on sampling errors. Please refer to the 'STEP Survey Weighting Procedures Summary' provided as an external resource.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A spatial data set of 21,434 random samples was generated, spreading evenly across 625 locations, u_i=(u_i, v_i ) (i=1,2,⋯,625). The data-generating process was inspired by that used by Fotheringham et al. (2017). For each data point, four independent variables (g_1,g_2,x_1,z_1) were generated according to a multivariate normal distribution with zero means and an identity variance-covariance matrix. However, the mean of g_1 and g_2 at each location were substituted for the original values to simulate group-level spatial-related variables. Among them, x_1 and z_1 were regarded as sample-level variables. Samples located together share coefficient values.