Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterI wanted to run data analysis and machine learning on a large dataset to build my data science skills but I felt out of touch with the various datasets available so I thought... how about I try and build my own dataset?
I wondered what data should be in the dataset and settled with online digital game purchases since I am an avid gamer. Imagine getting sales data from the PlayStation Store or Xbox Microsoft Store, this is what I was aiming to replicate.
I envisaged the dataset to be data created through the purchase of a digital game on either the UK PlayStation Store or Xbox Microsoft Store. Considering this, the scope of dataset varies depending on which column of data you are viewing, for example: - Date and Time: purchases were defined between a start/end date (this can be altered, see point 4) and, of course, anytime across the 24hr clock - Geographically: purchases were setup to come from any postcode in the UK - in total this is over 1,000,000 active postcodes - Purchases: the list of game titles available for purchase is 24 - Registered Banks: the list of registered banks in the UK (as of 03/2022) was 159
To generate the dataset, I built a function in Python. This function, when called with the number of rows you want in your dataset, will generate the dataset. For example, calling function(1000) will provide you with a dataset with 1000 rows.
Considering this, if just over 42,000 rows of data (42,892 to be exact) isn't enough, feel free to check out the code on my GitHub to run the function yourself with as many rows as you want.
Note: You can also edit the start/end dates of the function depending on which timespan you want the dataset to cover.
Yes, as stated above, this dataset is still a work in progress and is therefore not 100% perfect. There is a backlog of issues that need to be resolved. Feel free to check out the backlog.
One example of this is how on various columns, the distributions of data is equal, when in fact for the dataset to be entirely random, this should not be the case. An example of this issue is the Time column. These issues will be resolved in a later update.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.
Folder configuration The dataset consists of 3 folders:
JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.
Essential Terminology
Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.
Dataset Data The dataset includes 4 types of JSON annotation files files:
annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:
id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.
Most Labelers generate different annotation specifications in the spec key-value pair:
BoundingBox2DLabeler/BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:
template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:
label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:
label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:
label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.
captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:
id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:
sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:
ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:
id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.
Each Labeler generates different annotation specifications in the values key-value pair:
BoundingBox2DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:
label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:
index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.
The SemanticSegmentationLabeler does not contain a values list.
egos.json: Contains collections of key-value pairs for each ego. These include:
id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:
id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).
Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:
e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title: Rule-based Synthetic Data for Japanese GEC. Dataset Contents:This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows:Synthetic Corpus - synthesized_data.tsv. This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction.These paired sentences are derived from data scraped from the keyword-lookup site
Facebook
Twitterhttps://ega-archive.org/dacs/EGAC00001000514https://ega-archive.org/dacs/EGAC00001000514
Please note: This synthetic data set (with cohort “participants” / ”subjects” marked with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or results. The purpose of this dataset is to aid development of technical implementations for cohort data discovery, harmonization, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the CINECA project (funding: EC H2020 grant 825775) is acknowledged. For any questions please contact isuru@ebi.ac.uk or cthomas@ebi.ac.uk
This dataset (CINECA_synthetic_cohort_EUROPE_UK1) consists of 2521 samples which have genetic data based on 1000 Genomes data (https://www.nature.com/articles/nature15393), and synthetic subject attributes and phenotypic data derived from UKBiobank (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779). These data were initially derived using the TOFU tool (https://github.com/spiros/tofu), which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial set of phenotypes and attributes were generated, the data data was checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth. These data were then loaded to the development instance of Biosamples (https://www.ebi.ac.uk/biosamples/) which accessioned each of the samples. The genetic data are derived from the 1000 Genomes Phase 3 release (https://www.internationalgenome.org/category/phase-3/). The genotype data consists of a single joint call vcf files with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium and low call rates for a random selection of variants. Additionally 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data. The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. The genetic data was then linked to the synthetic data in BioSamples, and submitted to EGA.
Facebook
TwitterWe include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterWe compare percentages present in the true labels of the real data and the predicted labels. Analogously, we measure the ratio of samples with positive label present in the synthetic generated data and predicted labels for datasets generated using distinct synthesizer techniques. Predictions(R) represents ratio of positive prediction labels of an experiment where model trained on synthetic data was evaluated on real data, and Predictions(S) ratio of positive prediction labels of an experiment where model trained on synthetic data was evaluated on synthetic data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Synthetic Passages for MS MARCO Queries
This dataset has been generated as part of a master's thesis. The goal was to investigate how Large Language Models can be used to generate synthetic data for the MS MARCO queries and how the would compare to the original MS MARCO dataset. The code for the data generation as well as the experiments can be found in this repository. The data generation follows an approach where queries are sampled from the MS Marco training set and then used to… See the full description on the dataset page: https://huggingface.co/datasets/fletcher3/synthetic-passages-msmarco-queries.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Record linkage is a complex process and is utilized to some extent in nearly every organization that works with modern human data records. People create methods for linking records on a case-by-case basis. Some may use basic matching between record 1 and record 2 as seen below ```python
if (r1.FirstName == r2.FirstName) & (r1.LastName == r2.LastName): out.match = True else: out.match = False ``` while others may choose to create more complex decision trees or even machine learning approaches to record linkage.
When people approach record linkage via machine learning (ML), they can match on a variety of fields, typically dependent on the forms used to collect data. While these ML-utilized fields can vary on a organization-to-organization level, there are several fields that appear more frequently than others. They are as follows: * First Name * Middle Name * Last Name * Date of Birth * Sex at Birth * Race * SSN (maybe) * Address house * Address zip * Address city * Address county * Address state * Phone * Email * Date Data Submitted
By comparing two records on all of these fields, ML record linkage models use complex logic to make the "Yes" or "No" decision on whether 2 records reflect the same individual. Record linkage can become difficult when individuals change addresses, adopt new last name, erroniously fill out data, or have information that closely resembles another individual (ex: twins).
As described above, record linkage can have many complex elements to it.
Consider a situation where you are manually reviewing 2 records.
These two records only contain basic information on the individuals and you are tasked to decide if Record #1 and Record #2 belong to the same person.
| Record # | First Name | Last Name | Sex | DOB | Address | Address ZIP | Address State | Date Recieved |
|---|---|---|---|---|---|---|---|---|
| 1 | Wanda | Smith | F | 1992-09-13 | 1768 Walker Rd. Unit 209 | 99301 | WA | 2015-03-01 |
| 2 | Wanda | Turner | 1992-09-13 | 4545 Pennsylvania Ct. | 98682 | WA | 2021-06-30 |
At a glance, these records are significantly different and you should therefore mark them as different persons. For the purposes of record linkage manual review, you probably made the correct decision. After all, for record linkage, most models prefer False Negatives to False Positives.
When groups validate record linkage models, they often turn to manually-reviewed record comparisons as their "gold-standard". There are two seperate marks of judgement for record linkage that I would like us to consider 1. Creating a model that simulates a human's decision making processs 2. Creating a model that seeks a deeper record equality "Truth"
I believe many groups aim for and are content with accomplishing goal #1. That approach is inarguably useful. However, I believe that it can be harmful in biases that it introduces. For example, it is biased against people who adopt new last names upon marriages/civil unions (more often "Female" Sex at Birth). Models that bias against non-american names can also produce high validation marks, but are flawed nonetheless. Consider the 2 records displayed earlier. There is a real chance that Wanda adopted a new last name and moved in the 6 years between when the data was collected.
Without relavant documentation (birth, marriage, ... , housing records), we have no way of knowing whether or not "Wanda Smith" is the same person as "Wanda Turner". It follows that treating manual review as a "gold-standard" fails to completely support goal #2.
We hope to create a simulated society that can be used as absoulte truth. The simulated society will be built to reflect the population of Washington State. This will have a relational-database type structure with tables containing all relevant supporting structures for record linkage such as: * birth records * partnership (marriage) records * moving records
We hope to create a society with representative and diverse names, representative demographic breakdowns, and representative geographic population densities. The structure of the database will allow for "Time Travel" queries that allow a user to capture all data from a specific year in time.
By creating a simulated society, we will have absolute truth in determining whether record1 = record2. This approach will give us an opportunity to assess record linkage models considering goal #2.
After we wrap this work, we will work on proccesses/functions for simulating human error in record filling. Also, functions to help support the process of bias-recognition in using our dataset as a test set.
PJ Gibson - peter.gibson@doh.wa.gov
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.
Abstract:
Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.
Benchmark data
Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).
The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.
Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.
Synthetic data generation
Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.
A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.
Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).
Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.
Funding
This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset simulates a real-world binary classification problem where the goal is to predict whether a person is fit (is_fit = 1) or not fit (is_fit = 0) based on various health and lifestyle features.
The dataset contains 2000 samples with a mixture of numerical and categorical features, some of which include noisy, inconsistent, or missing values to reflect real-life data challenges. This design enables users, especially beginners, to practice data preprocessing, feature engineering, and building classification models such as neural networks.
Features have both linear and non-linear relationships with the target variable. Some features have complex interactions and the target is generated using a sigmoid-like function with added noise, making it a challenging but realistic task. The dataset also includes mixed data types (e.g., the "smokes" column contains both numeric and string values) and some outliers are present.
This dataset is ideal for users wanting to improve skills in cleaning messy data, encoding categorical variables, handling missing values, detecting outliers, and training classification models including neural networks.
| Column Name | Description |
|---|---|
| age | Age of the individual in years (integer) |
| height_cm | Height in centimeters (integer) |
| weight_kg | Weight in kilograms (integer, contains some outliers) |
| heart_rate | Resting heart rate in beats per minute (float) |
| blood_pressure | Systolic blood pressure in mmHg (float) |
| sleep_hours | Average hours of sleep per day (float, may contain NaNs) |
| nutrition_quality | Daily nutrition quality score between 0 and 10 (float) |
| activity_index | Physical activity level score between 1 and 5 (float) |
| smokes | Smoking status (mixed types: 0, 1, "yes", "no") |
| gender | Gender of individual, either 'M' or 'F' |
| is_fit | Target variable: 1 if the person is fit, 0 otherwise |
This dataset intentionally includes several data quality issues to simulate real-world scenarios:
Due to the synthetic nature and intentional noise, expect: - Baseline accuracy: ~60% (majority class) - Good models: 75-85% accuracy - Excellent models: 85-90% accuracy
The dataset is designed to be challenging but achievable, making it perfect for learning and experimentation.
This dataset is provided under the CC0 Public Domain license, making it suitable for educational and research purposes without restrictions.
This is a synthetic dataset created for educational purposes. It does not contain real personal health information and is designed to help users practice data science skills in a safe, privacy-compliant environment.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Surrogate optimisation holds a big promise for building energy optimisation studies due to its goal to replace the use of lengthy building energy simulations within an optimisation step with expendable local surrogate models that can quickly predict simulation results. To be useful for such purpose, it should be possible to quickly train precise surrogate models from a small number of simulation results (10–100) obtained from appropriately sampled points in the desired part of the design space. Two sampling methods and two machine learning models are compared here. Latin hypercube sampling (LHS), widely accepted in building energy community, is compared to an exploratory Monte Carlo-based sequential design method mc-intersite-proj-th (MIPT). Artificial neural networks (ANN), also widely accepted in building energy community, are compared to gradient-boosted tree ensembles (XGBoost), model of choice in many machine learning competitions. In order to get a better understanding of the behaviour of these two sampling methods and two machine learning models, we compare their predictions against a large set of generated synthetic data. For this purpose, a simple case study of an office cell model with a single window and a fixed overhang, whose main input parameters are overhang depth and height, while climate type, presence of obstacles, orientation and heating and cooling set points are additional input parameters, was extensively simulated with EnergyPlus, to form a large underlying dataset of 729,000 simulation results. Expendable local surrogate models for predicting simulated heating, cooling and lighting loads and equivalent primary energy needs of the office cell were trained using both LHS and MIPT and both ANN and XGBoost for several main hyperparameter choices. Results show that XGBoost models are more precise than ANN models, and that for both machine learning models, the use of MIPT sampling leads to more precise surrogates than LHS.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
FLUXSynID: A Synthetic Face Dataset with Document and Live Images FLUXSynID is a high-resolution synthetic identity dataset containing 14,889 unique synthetic identities, each represented through a document-style image and three live capture variants. Identities are generated using the FLUX.1 [dev] diffusion model, guided by user-defined identity attributes such as gender, age, region of origin, and other various identity features. The dataset is created to support biometric research, including face recognition and morphing attack detection. File Structure Each identity has a dedicated folder (named as a 12-digit hex string, e.g., 000e23cdce23) containing the following 5 files: 000e23cdce23_f.json — metadata including sampled identity attributes, prompt, generation seed, etc. (_f = female; _m = male; _nb = non-binary) 000e23cdce23_f_doc.png — document-style frontal image 000e23cdce23_f_live_0_e_d1.jpg — live image generated with LivePortrait (_e = expression and pose) 000e23cdce23_f_live_0_a_d1.jpg — live image via Arc2Face (_a = arc2face) 000e23cdce23_f_live_0_p_d1.jpg — live image via PuLID (_p = pulid) All document and LivePortrait/PuLID images are 1024×1024. Arc2Face images are 512×512 due to original model constraints. Attribute Sampling and Prompting The attributes/ directory contains all information about how identity attributes were sampled: A set of .txt files (e.g., ages.txt, eye_shape.txt, body_type.txt) — each lists the possible values for one attribute class, along with their respective sampling probabilities. file_probabilities.json — defines the inclusion probability for each attribute class (i.e., how likely a class such as "eye shape" is to be included in a given prompt). attribute_clashes.json — specifies rules for resolving semantically conflicting attributes. Each clash defines a primary attribute (to be kept) and secondary attributes (to be discarded when the clash occurs). Prompts are generated automatically using Qwen2.5 large language model, based on selected attributes, and used to condition FLUX.1 [dev] during image generation. Live Image Generation Each synthetic identity has three live image-style variants: LivePortrait: expression/pose changes via keypoint-based retargeting Arc2Face: natural variation using identity embeddings (no prompt required) PuLID: identity-aware generation using prompt, embedding, and edge-conditioning with a customized FLUX.1 [dev] diffusion model These approaches provide both controlled and naturalistic identity-consistent variation. Filtering and Quality Control Included are 9 supplementary text files listing filtered subsets of identities. For instance, file similarity_filtering_adaface_thr_0.333987832069397_fmr_0.0001.txt contains identities retained after filtering out overly similar faces using AdaFace FRS under the specified threshold and false match rate (FMR). Usage and Licensing This dataset is licensed under the Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0) license.You are free to use, share, and adapt the dataset for non-commercial purposes, provided that appropriate credit is given. The images in this dataset were generated using the FLUX.1 [dev] model by Black Forest Labs, which is made available under their Non-Commercial License. While this dataset does not include or distribute the model or its weights, the images were produced using that model. Users are responsible for ensuring that their use of the images complies with the FLUX.1 [dev] license, including any restrictions it imposes. Acknowledgments The FLUXSynID dataset was developed under the EINSTEIN project. The EINSTEIN project is funded by the European Union (EU) under G.A. no. 101121280 and UKRI Funding Service under IFS reference 10093453. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect the views of the EU/Executive Agency or UKRI. Neither the EU nor the granting authority nor UKRI can be held responsible for them.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Focused ion beam (FIB) tomography is a destructive technique used to collect three-dimensional (3D) structural information at a resolution of a few nanometers. For FIB tomography, a material sample is degraded by layer-wise milling. After each layer, the current surface is imaged by a scanning electron microscope (SEM), providing a consecutive series of cross-sections of the three-dimensional material sample. Especially for nanoporous materials, the reconstruction of the 3D microstructure of the material, from the information collected during FIB tomography, is impaired by the so-called shine-through effect. This effect prevents a unique mapping between voxel intensity values and material phase (e.g., solid or void). It often substantially reduces the accuracy of conventional methods for image segmentation. Here we demonstrate how machine learning can be used to tackle this problem. A bottleneck in doing so is the availability of sufficient training data. To overcome this problem, we present a novel approach to generate synthetic training data in the form of FIB-SEM images generated by Monte Carlo simulations. Based on this approach, we compare the performance of different machine learning architectures for segmenting FIB tomography data of nanoporous materials. We demonstrate that two-dimensional (2D) convolutional neural network (CNN) architectures processing a group of adjacent slices as input data as well as 3D CNN perform best and can enhance the segmentation performance significantly.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The data set corresponds to the two examples of the paper "History matching production data and uncertainty assessment with an efficient TSVD parameterization algorithm". Files in Example 1 contain 200 realizations of a subsurface reservoir of dimension 30x30x3 where each layer has a different covariance matrix. Each realizations include isotropic horizontal log-permeability field, vertical log-permeability field, and porosity field. The true permeability and porosity fields and simulation parameters (e.g., well locations and controls) are also included. The goal is to generate synthetic observed data from the true model, and then apply the algorithm to update/calibrate each of the realizations such that they match production data to within some tolerance. Example 2 contains 1500 realizations of a 28x30x3 model. This example applies ensemble-based regularization to estimate the covariance matrix within the model calibration/estimation framework.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
5 million timestamps. Sensors readings are at 1Hz sampling frequency.
1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.
Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.
Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).
Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.
Change Log
Version 2
Metadata: we include a metadata.csv with information about:
Anomaly categories
Root cause channel (signal in which the anomaly is first visible)
Affected channel (signal in which the anomaly might propagate) through coupled system dynamics
Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.
Two data files: CSV and parquet for convenience.
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.
Facebook
Twitter
According to our latest research, the global synthetic benchmarking for AI market size has reached USD 1.42 billion in 2024. The market is expected to witness robust growth, registering a compound annual growth rate (CAGR) of 24.6% during the forecast period. By 2033, the global synthetic benchmarking for AI market is projected to attain a value of USD 11.5 billion. This accelerated growth is largely driven by the increasing complexity of AI models, the critical demand for standardized performance evaluation, and the rapid adoption of AI across diverse industries.
The primary growth factor fueling the synthetic benchmarking for AI market is the exponential rise in AI adoption across sectors such as healthcare, finance, automotive, and manufacturing. As organizations deploy increasingly complex AI models, the need for reliable, repeatable, and scalable benchmarking solutions has become paramount. Traditional benchmarking methods often fall short in capturing the nuanced performance metrics required by modern AI workloads, leading to a surge in demand for synthetic benchmarking tools. These tools allow enterprises to simulate diverse scenarios, stress-test AI systems, and ensure optimal performance before deployment, thereby mitigating risks and maximizing return on investment.
Another significant driver is the evolution of AI hardware and software ecosystems. The proliferation of specialized AI chips, edge AI devices, and cloud-based AI platforms has created a fragmented landscape where performance can vary dramatically depending on deployment context. Synthetic benchmarking offers a standardized approach to measure and compare the efficiency, throughput, and scalability of AI solutions across heterogeneous environments. This capability is especially critical for enterprises seeking to make informed procurement decisions, optimize resource allocation, and maintain competitive advantage in a rapidly changing technological landscape.
Furthermore, regulatory and security considerations are propelling market growth. With increasing scrutiny on AI transparency, fairness, and security, organizations are leveraging synthetic benchmarking to validate model robustness, detect vulnerabilities, and ensure compliance with industry standards. Synthetic benchmarks can simulate adversarial attacks, test data privacy mechanisms, and assess the resilience of AI systems under various threat scenarios. This proactive approach not only enhances trust in AI deployments but also reduces the likelihood of costly failures or compliance breaches, thus fostering greater adoption of benchmarking solutions.
From a regional perspective, North America currently dominates the synthetic benchmarking for AI market, accounting for over 38% of the global revenue in 2024. This leadership is attributed to the region's early adoption of AI, strong presence of leading technology providers, and significant investments in research and development. Asia Pacific is emerging as the fastest-growing region, fueled by rapid digital transformation, government initiatives to promote AI innovation, and the expansion of AI applications in industries such as manufacturing and healthcare. Europe follows closely, leveraging its robust regulatory framework and focus on ethical AI development. Collectively, these regional dynamics underscore the global nature of the synthetic benchmarking for AI market and its critical role in enabling the next wave of AI-driven transformation.
Within the synthetic benchmarking for AI market, the component segment is categorized into software, hardware, and services. Software solutions form the backbone of synthetic benchmarking, providing the algorithms, frameworks, and user interfaces necessary for simulating workloads, generating synthetic data, and analyzing performance metrics. The demand for sophisticated software tools has surged as AI models grow in complexity and organizations seek granula
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a synthetic representation of user behavior on a fictional dating app. It contains 50,000 records with 19 features capturing demographic details, app usage patterns, swipe tendencies, and match outcomes. The data was generated programmatically to simulate realistic user interactions, making it ideal for exploratory data analysis (EDA), machine learning modeling (e.g., predicting match outcomes), or studying user behavior trends in online dating platforms.
Key features include gender, sexual orientation, location type, income bracket, education level, user interests, app usage time, swipe ratios, likes received, mutual matches, and match outcomes (e.g., "Mutual Match," "Ghosted," "Catfished"). The dataset is designed to be diverse and balanced, with categorical, numerical, and labeled variables for various analytical purposes.
This dataset can be used for:
Exploratory Data Analysis (EDA): Investigate correlations between demographics, app usage, and match success. Machine Learning: Build models to predict match outcomes or user engagement levels. Social Studies: Analyze trends in dating app behavior across different demographics. Feature Engineering Practice: Experiment with transforming categorical and numerical data.
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).