Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes random number generated through various methods.Method 1: shuf https://www.mankier.com/1/shufCommands used to generate dataset files: $ shuf -i 1-1000000000 -n1000000 -o random-shuf.txt$ shuf -i 1-1000000000000 -n1000000 -o random-shuf-1-1000000000000.txt$ jot -r 1000000 1 1000000000000 > random-jot-1-1000000000000.txt
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The performance of statistical methods is frequently evaluated by means of simulation studies. In case of network meta-analysis of binary data, however, available data- generating models are restricted to either inclusion of two-armed trials or the fixed-effect model. Based on data-generation in the pairwise case, we propose a framework for the simulation of random-effect network meta-analyses including multi-arm trials with binary outcome. The only of the common data-generating models which is directly applicable to a random-effects network setting uses strongly restrictive assumptions. To overcome these limitations, we modify this approach and derive a related simulation procedure using odds ratios as effect measure. The performance of this procedure is evaluated with synthetic data and in an empirical example.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This publication documents the various datasets generated using the kac_drumset codebase. The aims of kac_drumset is to provide a robust framework for the generation and analysis of arbitrarily shaped drums. The source code for this project is available here: https://github.com/lewiswolf/kac_drumset.
Background
Arbitrarily shaped drums are a strange family of percussion instruments and a wholly meta-physical construction in this contemporary setting. These percussive instruments possess a number of interesting musical characteristics resulting from their particular geometric designs. As it currently stands, these instruments remain largely unexplored throughout musical practice, as they were originally devised as a collection of hypothetical mathematical objects. These datasets serve to sonify these objects so as to explore these conceptual constructions in the audio domain.
Usage
To use these datasets, first install kac_drumset:
pip install "git+https://github.com/lewiswolf/kac_drumset.git#egg=kac_drumset"
And then in python:
from kac_drumset import ( # methods loadDataset, transformDataset, # classes TorchDataset, )
dataset: TorchDataset = transformDataset( # load a dataset (any folder containing a metadata.json) loadDataset('absolute/path/to/data'), # alter the dataset representation, either as an end2end, fft or mel. {'output_type': 'end2end'}, )
for i in range(dataset._len_()): x, y = dataset._getitem_(i) ...
For more details on using kac_drumset, see the project's documentation.
2000 Convex Polygonal Drums of Varying Size
Each sample in this dataset corresponds to a randomly generated convex polygon. The audio for each sample was generated using a two-dimensional physical model of a drum. Each sample is one second long and decays linearly.
Contained in this dataset are ten different sizes of drums - 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, 0.6 - each of which is a measure of the longest vertex of each drum in meters. There are 40 different drums sampled for each size. Each drum is sampled five times, first by being struck in the geometric centroid, and then by being struck four more times in random locations. This dataset is labelled with the vertices of each polygon, normalised to the unit interval, and the strike location of each sample.
The audio is sampled at 48khz, and the default representation is raw audio. Each sample is stored in the metadata.json, alongside being made available audibly as a 24-bit .wav and graphically as a .png.
5000 Circular Drums of Varying Size
Each sample in this dataset corresponds to a randomly generated circular drum. The audio for each sample was generated using additive synthesis, inferred using a closed form solution to the two dimensional wave equation. Each sample is one second long and decays exponentially.
Contained in this dataset are 1000 different drums, each determined by a randomly generated size (0.1, 2.0) in meters. Each drum is sampled five times, first being struck in the geometric centroid, and then by being struck four more times in random locations. This dataset is labelled with the size of each drum and the strike location of each sample.
The audio is sampled at 48khz, and the default representation is raw audio. Each sample is stored in the metadata.json, alongside being made available audibly as a 24-bit .wav and graphically as a .png.
5000 Rectangular Drums of Varying Dimension
Each sample in this dataset corresponds to a randomly generated rectangular drum. The audio for each sample was generated using additive synthesis, inferred using a closed form solution to the two dimensional wave equation. Each sample is one second long and decays exponentially.
Contained in this dataset are 1000 different drums, each determined by a randomly generated size (0.1, 2.0) in meters and aspect ratio (0.25, 4.0). Each drum is sampled five times, first being struck in the geometric centroid, and then by being struck four more times in random locations. This dataset is labelled with the size and aspect ratio of each drum, and the strike location of each sample.
The audio is sampled at 48khz, and the default representation is raw audio. Each sample is stored in the metadata.json, alongside being made available audibly as a 24-bit .wav and graphically as a .png.
Fun Club Name Generator Dataset
This is a small, handcrafted dataset of random and fun club name ideas.The goal is to help people who are stuck naming something — whether it's a book club, a gaming group, a project, or just a Discord server between friends.
Why this?
A few friends and I spent hours trying to name a casual group — everything felt cringey, too serious, or already taken. We started writing down names that made us laugh, and eventually collected enough to… See the full description on the dataset page: https://huggingface.co/datasets/Laurenfromhere/fun-club-name-generator-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
producing 8 distinct datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Excel sheets in order: The sheet entitled “Hens Original Data” contains the results of an experiment conducted to study the response of laying hens during initial phase of egg production subjected to different intakes of dietary threonine. The sheet entitled “Simulated data & fitting values” contains the 10 simulated data sets that were generated using a standard procedure of random number generator. The predicted values obtained by the new three-parameter and conventional four-parameter logistic models were also appeared in this sheet. (XLSX)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Read this document to understand the minimum technical requirements for random number generators (RNGs) used in gaming-related equipment and systems.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
About
The dataset contains 1000 images with a random shape (of 17 possible shapes). This dataset is generated using the 3D Shapes Dataset Generator I've developed. Feel free to use it from here.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15136143%2F434f4faa08f5f66033feca35f6c682f3%2Flogo_op_spidey.ico?generation=1684269448893545&alt=media" alt="">
Label
Column Name | Info |
---|---|
filename | Name of the image file |
shape | Shape Index |
operation | Operation Index |
a,b,c,d,e,f,g,h,i,j,k,l | Dimensional parameters |
hue, sat, val | HSV Values of the color |
rot_x, rot_y, rot_z | Euler Angles |
pos_x, pos_y, pos_z | Position Vector |
Each row depicts information about a shape in the image of a dataset.
Seed The seed value of the dataset is stored in a txt file and can be used to re-generate the dataset using the tool.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artifacts for the paper titled Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?.
This artifact repository contains 9 compressed folders, as follows:
ID File Name Description
1 syn_circa.zip CIRCA10, and CIRCA50 datasets for Causal Discovery
2 syn_rcd.zip RCD10, and RCD50 datasets for Causal Discovery
3 syn_causil.zip CausIL10, and CausIL50 datasets for Causal Discovery
4 rca_circa.zip CIRCA10, and CIRCA50 datasets for RCA
5 rca_rcd.zip RCD10, and RCD50 datasets for RCA
6 online-boutique.zip Online Boutique dataset for RCA
7 sock-shop-1.zip Sock Shop 1 dataset for RCA
8 sock-shop-2.zip Sock Shop 2 dataset for RCA
9 train-ticket.zip Train Ticket dataset for RCA
Each zip file contains the generated/collected data from the corresponding data generator or microservice benchmark systems (e.g., online-boutique.zip contains metrics data collected from the Online Boutique system).
Details about the generation of our datasets
We use three different synthetic data generators from three previous RCA studies [15, 25, 28] to create the synthetic datasets: CIRCA, RCD, and CausIL data generators. Their mechanisms are as follows:1. CIRCA datagenerator [28] generates a random causal directed acyclic graph (DAG) based on a given number of nodes and edges. From this DAG, time series data for each node is generated using a vector auto-regression (VAR) model. A fault is injected into a node by altering the noise term in the VAR model for two timestamps. 2. RCD data generator [25] uses the pyAgrum package [3] to generate a random DAG based on a given number of nodes, subsequently generating discrete time series data for each node, with values ranging from 0 to 5. A fault is introduced into a node by changing its conditional probability distribution.3. CausIL data generator [15] generates causal graphs and time series data that simulate the behavior of microservice systems. It first constructs a DAG of services and metrics based on domain knowledge, then generates metric data for each node of the DAG using regressors trained on real metrics data. Unlike the CIRCA and RCD data generators, the CausIL data generator does not have the capability to inject faults.To create our synthetic datasets, we first generate 10 DAGs whose nodes range from 10 to 50 for each of the synthetic data generators. Next, we generate fault-free datasets using these DAGs with different seedings, resulting in 100 cases for the CIRCA and RCD generators and 10 cases for the CausIL generator. We then create faulty datasets by introducing ten faults into each DAG and generating the corresponding faulty data, yielding 100 cases for the CIRCA and RCD data generators. The fault-free datasets (e.g. syn_rcd
, syn_circa
) are used to evaluate causal discovery methods, while the faulty datasets (e.g. rca_rcd
, rca_circa
) are used to assess RCA methods.
We deploy three popular benchmark microservice systems: Sock Shop [6], Online Boutique [4], and Train Ticket [8], on a four-node Kubernetes cluster hosted by AWS. Next, we use the Istio service mesh [2] with Prometheus [5] and cAdvisor [1] to monitor and collect resource-level and service-level metrics of all services, as in previous works [ 25 , 39, 59 ]. To generate traffic, we use the load generators provided by these systems and customise them to explore all services with 100 to 200 users concurrently. We then introduce five common faults (CPU hog, memory leak, disk IO stress, network delay, and packet loss) into five different services within each system. Finally, we collect metrics data before and after the fault injection operation. An overview of our setup is presented in the Figure below.
Code
The code to reproduce the experimental results in the paper is available at https://github.com/phamquiluan/RCAEval.
References
As in our paper.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for DynaMath
[💻 Github] [🌐 Homepage][📖 Preprint Paper]
Dataset Details
🔈 Notice
DynaMath is a dynamic benchmark with 501 seed question generators. This dataset is only a sample of 10 variants generated by DynaMath. We encourage you to use the dataset generator on our github site to generate random datasets to test.
🌟 About DynaMath
The rapid advancements in Vision-Language Models (VLMs) have shown significant potential in tackling… See the full description on the dataset page: https://huggingface.co/datasets/DynaMath/DynaMath_Sample.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains all gathered data from the experiment from Wednesday, March 16, 2022 11:58:41.929 AM UTC+0 (1647431921929) until Sunday, April 3, 2022 1:08:35.353 PM UTC+0 (1648991315353). The experiment was executed during physical presence within the Arctic Circle in Tromsø, Norway 69° 40' 53.117'' N 18° 58' 36.027'' E at 35m elevation above sea level. The dataset was gathered with a prototype [1] based on the CREDO android application [2]. The main research is to use Ultra High Energy Cosmic Rays (UHECR) as an entropy source for a Random Bit Generator (RBG).
The associated publication will probably have the title "Accessing Cosmic Radiation as an Entropy Source for a Non-Deterministic Random Number Generator"
In order to reproduce the results the SQLite3 database "mrng_arctic_experiment_2022.db" is needed. To get the visual representations of the detections use "image_decoding_and_codesnippets.py" to generate the cleaned (414 detections / ~15MB) or the uncleaned (5567 detections / ~195 MB) dataset. The compressed folder "raw_data_incl_space_weather.7z" contains all raw data as gathered with the MRNG prototype, unprocessed, uncleaned, and unmerged.
[1] https://github.com/StefanKutschera/mrng-prototype, visited on 27.03.2023
[2] https://github.com/credo-science/credo-detector-android, visited on 27.03.2023
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Healthcare Dataset
Overview
This dataset is a synthetic healthcare dataset created for use in data analysis. It mimics real-world patient healthcare data and is intended for applications within the healthcare industry.
Data Generation
The data has been generated using the Faker Python library, which produces randomized and synthetic records that resemble real-world data patterns. It includes various healthcare-related fields such as patient… See the full description on the dataset page: https://huggingface.co/datasets/vrajakishore/dummy_health_data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains the result of applying the NIST Statistical Test Suite on accelerometer data processed for random number generator seeding. The NIST Statistical Test Suite can be downloaded from: http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html. The format of the output is explained in http://csrc.nist.gov/publications/nistpubs/800-22-rev1a/SP800-22rev1a.pdf.
Underpinning data for manuscript entitled "Generation of random numbers by measuring on a silicon-on-insulator chip phase fluctuations from a laser diode"
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The random number generation algorithm used within the consensus mechanism of blockchain systems may be plagued by member inference attacks, resulting in the inference of the algorithm's features or patterns of generated random numbers. Based on this issue, the research group proposed a resistance scheme based on knowledge distillation to ensure the security of the random number generation algorithm. The research group used a member inference attack defense strategy dataset to evaluate the performance of our proposed defense scheme, which includes 5 batch training datasets and 1 test dataset. By analyzing the performance changes of machine learning models after being subjected to member inference attacks on this dataset, evaluate the performance of member inference attack defense strategies. Collection plan: The folder name of the test dataset is "Member Reasoning Attack Resistance Strategy Dataset/cifar-10 patches py". CIFAR-10 is a small dataset used to identify ubiquitous objects, which can be accessed through the following link http://www.cs.toronto.edu/ ~Kriz/cifar. HTML download. Contains 10 categories of RGB color images. Each image has a size of 32 × 32. Each category has 6000 images, and there are a total of 50000 training images and 10000 test images in the dataset. Time and location: This dataset is test data collected by the research unit "Peking University" during 2021. Equipment situation: Data collection is processed in the following environment: hardware environment: supports general computing platforms such as Intel and ARM; System environment: Windows 11 and Ubuntu 20.04.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This repository hosts the Testing Roads for Autonomous VEhicLes (TRAVEL) dataset. TRAVEL is an extensive collection of virtual roads that have been used for testing lane assist/keeping systems (i.e., driving agents) and data from their execution in state of the art, physically accurate driving simulator, called BeamNG.tech. Virtual roads consist of sequences of road points interpolated using Cubic splines.
Along with the data, this repository contains instructions on how to install the tooling necessary to generate new data (i.e., test cases) and analyze them in the context of test regression. We focus on test selection and test prioritization, given their importance for developing high-quality software following the DevOps paradigms.
This dataset builds on top of our previous work in this area, including work on
test generation (e.g., AsFault, DeepJanus, and DeepHyperion) and the SBST CPS tool competition (SBST2021),
test selection: SDC-Scissor and related tool
test prioritization: automated test cases prioritization work for SDCs.
Dataset Overview
The TRAVEL dataset is available under the data folder and is organized as a set of experiments folders. Each of these folders is generated by running the test-generator (see below) and contains the configuration used for generating the data (experiment_description.csv), various statistics on generated tests (generation_stats.csv) and found faults (oob_stats.csv). Additionally, the folders contain the raw test cases generated and executed during each experiment (test..json).
The following sections describe what each of those files contains.
Experiment Description
The experiment_description.csv contains the settings used to generate the data, including:
Time budget. The overall generation budget in hours. This budget includes both the time to generate and execute the tests as driving simulations.
The size of the map. The size of the squared map defines the boundaries inside which the virtual roads develop in meters.
The test subject. The driving agent that implements the lane-keeping system under test. The TRAVEL dataset contains data generated testing the BeamNG.AI and the end-to-end Dave2 systems.
The test generator. The algorithm that generated the test cases. The TRAVEL dataset contains data obtained using various algorithms, ranging from naive and advanced random generators to complex evolutionary algorithms, for generating tests.
The speed limit. The maximum speed at which the driving agent under test can travel.
Out of Bound (OOB) tolerance. The test cases' oracle that defines the tolerable amount of the ego-car that can lie outside the lane boundaries. This parameter ranges between 0.0 and 1.0. In the former case, a test failure triggers as soon as any part of the ego-vehicle goes out of the lane boundary; in the latter case, a test failure triggers only if the entire body of the ego-car falls outside the lane.
Experiment Statistics
The generation_stats.csv contains statistics about the test generation, including:
Total number of generated tests. The number of tests generated during an experiment. This number is broken down into the number of valid tests and invalid tests. Valid tests contain virtual roads that do not self-intersect and contain turns that are not too sharp.
Test outcome. The test outcome contains the number of passed tests, failed tests, and test in error. Passed and failed tests are defined by the OOB Tolerance and an additional (implicit) oracle that checks whether the ego-car is moving or standing. Tests that did not pass because of other errors (e.g., the simulator crashed) are reported in a separated category.
The TRAVEL dataset also contains statistics about the failed tests, including the overall number of failed tests (total oob) and its breakdown into OOB that happened while driving left or right. Further statistics about the diversity (i.e., sparseness) of the failures are also reported.
Test Cases and Executions
Each test..json contains information about a test case and, if the test case is valid, the data observed during its execution as driving simulation.
The data about the test case definition include:
The road points. The list of points in a 2D space that identifies the center of the virtual road, and their interpolation using cubic splines (interpolated_points)
The test ID. The unique identifier of the test in the experiment.
Validity flag and explanation. A flag that indicates whether the test is valid or not, and a brief message describing why the test is not considered valid (e.g., the road contains sharp turns or the road self intersects)
The test data are organized according to the following JSON Schema and can be interpreted as RoadTest objects provided by the tests_generation.py module.
{ "type": "object", "properties": { "id": { "type": "integer" }, "is_valid": { "type": "boolean" }, "validation_message": { "type": "string" }, "road_points": { §\label{line:road-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "interpolated_points": { §\label{line:interpolated-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "test_outcome": { "type": "string" }, §\label{line:test-outcome}§ "description": { "type": "string" }, "execution_data": { "type": "array", "items": { "$ref" : "schemas/simulationdata" } } }, "required": [ "id", "is_valid", "validation_message", "road_points", "interpolated_points" ] }
Finally, the execution data contain a list of timestamped state information recorded by the driving simulation. State information is collected at constant frequency and includes absolute position, rotation, and velocity of the ego-car, its speed in Km/h, and control inputs from the driving agent (steering, throttle, and braking). Additionally, execution data contain OOB-related data, such as the lateral distance between the car and the lane center and the OOB percentage (i.e., how much the car is outside the lane).
The simulation data adhere to the following (simplified) JSON Schema and can be interpreted as Python objects using the simulation_data.py module.
{ "$id": "schemas/simulationdata", "type": "object", "properties": { "timer" : { "type": "number" }, "pos" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel_kmh" : { "type": "number" }, "steering" : { "type": "number" }, "brake" : { "type": "number" }, "throttle" : { "type": "number" }, "is_oob" : { "type": "number" }, "oob_percentage" : { "type": "number" } §\label{line:oob-percentage}§ }, "required": [ "timer", "pos", "vel", "vel_kmh", "steering", "brake", "throttle", "is_oob", "oob_percentage" ] }
Dataset Content
The TRAVEL dataset is a lively initiative so the content of the dataset is subject to change. Currently, the dataset contains the data collected during the SBST CPS tool competition, and data collected in the context of our recent work on test selection (SDC-Scissor work and tool) and test prioritization (automated test cases prioritization work for SDCs).
SBST CPS Tool Competition Data
The data collected during the SBST CPS tool competition are stored inside data/competition.tar.gz. The file contains the test cases generated by Deeper, Frenetic, AdaFrenetic, and Swat, the open-source test generators submitted to the competition and executed against BeamNG.AI with an aggression factor of 0.7 (i.e., conservative driver).
Name
Map Size (m x m)
Max Speed (Km/h)
Budget (h)
OOB Tolerance (%)
Test Subject
DEFAULT
200 × 200
120
5 (real time)
0.95
BeamNG.AI - 0.7
SBST
200 × 200
70
2 (real time)
0.5
BeamNG.AI - 0.7
Specifically, the TRAVEL dataset contains 8 repetitions for each of the above configurations for each test generator totaling 64 experiments.
SDC Scissor
With SDC-Scissor we collected data based on the Frenetic test generator. The data is stored inside data/sdc-scissor.tar.gz. The following table summarizes the used parameters.
Name
Map Size (m x m)
Max Speed (Km/h)
Budget (h)
OOB Tolerance (%)
Test Subject
SDC-SCISSOR
200 × 200
120
16 (real time)
0.5
BeamNG.AI - 1.5
The dataset contains 9 experiments with the above configuration. For generating your own data with SDC-Scissor follow the instructions in its repository.
Dataset Statistics
Here is an overview of the TRAVEL dataset: generated tests, executed tests, and faults found by all the test generators grouped by experiment configuration. Some 25,845 test cases are generated by running 4 test generators 8 times in 2 configurations using the SBST CPS Tool Competition code pipeline (SBST in the table). We ran the test generators for 5 hours, allowing the ego-car a generous speed limit (120 Km/h) and defining a high OOB tolerance (i.e., 0.95), and we also ran the test generators using a smaller generation budget (i.e., 2 hours) and speed limit (i.e., 70 Km/h) while setting the OOB tolerance to a lower value (i.e., 0.85). We also collected some 5, 971 additional tests with SDC-Scissor (SDC-Scissor in the table) by running it 9 times for 16 hours using Frenetic as a test generator and defining a more realistic OOB tolerance (i.e., 0.50).
Generating new Data
Generating new data, i.e., test cases, can be done using the SBST CPS Tool Competition pipeline and the driving simulator BeamNG.tech.
Extensive instructions on how to install both software are reported inside the SBST CPS Tool Competition pipeline Documentation;
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
French Last Names from Death Records (1970-2024)
This dataset contains French lasst names extracted from death records provided by INSEE (French National Institute of Statistics and Economic Studies) covering the period from 1970 to September 2024.
Dataset Description
Random name generator demo
go to https://sctg-development.github.io/french-names-extractor/
Data Source
The data is sourced from INSEE's death records database. It includes last names… See the full description on the dataset page: https://huggingface.co/datasets/eltorio/french_last_names_insee_2024.
Code and experiment results for a synthetic knowledge graph generator. The generator receives a set of rules, with an expected body support and support, and returns a knowledge graph that approximately matches the rules according to the body support and confidence. This code was developed during the Bachelor thesis by Gabriel Glaser, Generating Random Knowledge Graphs from Rules, University of Stuttgart, 2024. Handle 11682/15486.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes random number generated through various methods.Method 1: shuf https://www.mankier.com/1/shufCommands used to generate dataset files: $ shuf -i 1-1000000000 -n1000000 -o random-shuf.txt$ shuf -i 1-1000000000000 -n1000000 -o random-shuf-1-1000000000000.txt$ jot -r 1000000 1 1000000000000 > random-jot-1-1000000000000.txt