93 datasets found

Fake Employee Dataset
kaggle.com
zip
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oyekanmi Olamilekan (2023). Fake Employee Dataset [Dataset]. https://www.kaggle.com/datasets/oyekanmiolamilekan/fake-employee-dataset
Explore at:
zip(162874 bytes)Available download formats
Dataset updated
Nov 20, 2023
Authors
Oyekanmi Olamilekan
Description
Creating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including: Employee ID: A unique identifier for each employee. Name: First name and last name of the employee. Gender: Male, female, non-binary, etc. Date of Birth: Birthdate of the employee. Email Address: Contact email of the employee. Phone Number: Contact number of the employee. Address: Home or work address of the employee. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.). Job Title: The specific job title of the employee. Manager ID: ID of the employee's manager. Hire Date: Date when the employee was hired. Salary: Employee's salary or compensation. Employment Status: Full-time, part-time, contractor, etc. Employee Type: Regular, temporary, contract, etc. Education Level: Highest level of education attained by the employee. Certifications: Any relevant certifications the employee holds. Skills: Specific skills or expertise possessed by the employee. Performance Ratings: Ratings or evaluations of employee performance. Work Experience: Previous work experience of the employee. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.). Work Location: Physical location where the employee works. Work Hours: Regular working hours or shifts of the employee. Employee Status: Active, on leave, terminated, etc. Emergency Contact: Contact information of the employee's emergency contact person. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.

Code Url: https://github.com/intellisenseCodez/faker-data-generator
Z
Stack Overflow Chat Dutch
data.niaid.nih.gov
Updated Jun 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
Explore at:
Dataset updated
Jun 20, 2023
Authors
Vanroy, Bram
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

Data Instances

{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

Data Fields

id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

topic: the topic description

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

The prompt to translate the topic is:

TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.

"

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

The translation quality has not been verified. Use at your own risk!

Licensing Information

Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
Fake Dataset for Practice
kaggle.com
zip
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuvo Kumar Basak-4004 (2023). Fake Dataset for Practice [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/fake-dataset-for-practice
Explore at:
zip(1515599 bytes)Available download formats
Dataset updated
Aug 21, 2023
Authors
Shuvo Kumar Basak-4004
Description
Description: This dataset is created solely for the purpose of practice and learning. It contains entirely fake and fabricated information, including names, phone numbers, emails, cities, ages, and other attributes. None of the information in this dataset corresponds to real individuals or entities. It serves as a resource for those who are learning data manipulation, analysis, and machine learning techniques. Please note that the data is completely fictional and should not be treated as representing any real-world scenarios or individuals.

Attributes: - phone_number: Fake phone numbers in various formats. - name: Fictitious names generated for practice purposes. - email: Imaginary email addresses created for the dataset. - city: Made-up city names to simulate geographical diversity. - age: Randomly generated ages for practice analysis. - sex: Simulated gender values (Male, Female). - married_status: Synthetic marital status information. - job: Fictional job titles for practicing data analysis. - income: Fake income values for learning data manipulation. - religion: Pretend religious affiliations for practice. - nationality: Simulated nationalities for practice purposes.

Please be aware that this dataset is not based on real data and should be used exclusively for educational purposes.
User Subscription Dummy Data
kaggle.com
Updated Sep 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nitin Choudhary (2022). User Subscription Dummy Data [Dataset]. https://www.kaggle.com/datasets/nitinchoudhary012/user-subscription-dummy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nitin Choudhary
Description
This data is purely random and created for learning purpose.

In situations where data is not readily available but needed, you'll have to resort to building up the data yourself. There are many methods you can use to acquire this data from web scraping to APIs. But sometimes, you'll end up needing to create fake or “dummy” data. Dummy data can be useful in times where you know the exact features you’ll be using and the data types included but, you just don’t have the data itself.

Features Description

ID — a unique string of characters to identify each user.

Gender — string data type of three choices.

Subscriber — a binary True/False choice of their subscription status.

Name — string data type of the first and last name of the user.

Email —string data type of the email address of the user.

Last Login — string data type of the last login time.

Date of Birth — string format of year-month-day.

Education — current education level as a string data type.

Bio — short string descriptions of random words.

Rating — integer type of a 1 through 5 rating of something.

Note - This Data is Purely Random (Dummy Data). if you wish, you can perform some data visualization and model building part into it.

Reference - https://towardsdatascience.com/build-a-your-own-custom-dataset-using-python-9296540a0178
h
benchmark-dummy-data
huggingface.co
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation Bot (2023). benchmark-dummy-data [Dataset]. https://huggingface.co/datasets/autoevaluator/benchmark-dummy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2023
Authors
Evaluation Bot
Description
Dummy Dataset for AutoTrain Benchmark

This dataset contains dummy data that's needed to create AutoTrain projects for benchmarks like RAFT. See here for more details.
Simulation Data Set
s.cnmilf.com
catalog.data.gov
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
n
Data and code for: Generation and applications of simulated datasets to...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Mar 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Silk; Olivier Gimenez (2023). Data and code for: Generation and applications of simulated datasets to integrate social network and demographic analyses [Dataset]. http://doi.org/10.5061/dryad.m0cfxpp7s
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m0cfxpp7s
Dataset updated
Mar 10, 2023
Dataset provided by
Centre d'Écologie Fonctionnelle et Évolutive
Authors
Matthew Silk; Olivier Gimenez
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Social networks are tied to population dynamics; interactions are driven by population density and demographic structure, while social relationships can be key determinants of survival and reproductive success. However, difficulties integrating models used in demography and network analysis have limited research at this interface. We introduce the R package genNetDem for simulating integrated network-demographic datasets. It can be used to create longitudinal social networks and/or capture-recapture datasets with known properties. It incorporates the ability to generate populations and their social networks, generate grouping events using these networks, simulate social network effects on individual survival, and flexibly sample these longitudinal datasets of social associations. By generating co-capture data with known statistical relationships it provides functionality for methodological research. We demonstrate its use with case studies testing how imputation and sampling design influence the success of adding network traits to conventional Cormack-Jolly-Seber (CJS) models. We show that incorporating social network effects in CJS models generates qualitatively accurate results, but with downward-biased parameter estimates when network position influences survival. Biases are greater when fewer interactions are sampled or fewer individuals are observed in each interaction. While our results indicate the potential of incorporating social effects within demographic models, they show that imputing missing network measures alone is insufficient to accurately estimate social effects on survival, pointing to the importance of incorporating network imputation approaches. genNetDem provides a flexible tool to aid these methodological advancements and help researchers test other sampling considerations in social network studies. Methods The dataset and code stored here is for Case Studies 1 and 2 in the paper. Datsets were generated using simulations in R. Here we provide 1) the R code used for the simulations; 2) the simulation outputs (as .RDS files); and 3) the R code to analyse simulation outputs and generate the tables and figures in the paper.
h
generated-usa-passeports-dataset
huggingface.co
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2023). generated-usa-passeports-dataset [Dataset]. https://huggingface.co/datasets/UniqueData/generated-usa-passeports-dataset
Explore at:
Dataset updated
Jul 15, 2023
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.
dummy_data_multicollinearity
kaggle.com
zip
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhavesh Padharia (2023). dummy_data_multicollinearity [Dataset]. https://www.kaggle.com/datasets/bhavesh1335/dummy-data-multicollinearity
Explore at:
zip(37594 bytes)Available download formats
Dataset updated
Oct 5, 2023
Authors
Bhavesh Padharia
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Bhavesh Padharia

Released under CC0: Public Domain

Contents
Finance Dataset by Faker Library
kaggle.com
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza Obaydallah (2024). Finance Dataset by Faker Library [Dataset]. https://www.kaggle.com/datasets/hamzazaki/finance-dataset-by-faker-library
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hamza Obaydallah
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9365842%2F5d270d8701f4dc2687f0ae193ee018ae%2F20-Best-Finance-Economic-Datasets-for-Machine-Learning-Social.jpg?generation=1708443878634431&alt=media" alt=""> Finance dataset with fake information such as transaction ID, date, amount, currency, description, category, merchant, customer, city, and country. It can be used for educational purposes as well as for testing.

This script generates a dataset with fake information such as name, email, phone number, address, date of birth, job, and company. Adjust the num_rows variable to specify the number of rows you want in your dataset. Finally, the dataset is saved to a CSV file named fake_dataset.csv. You can modify the fields or add additional fields according to your requirements.

`

Define the number of rows for your dataset

num_rows = 15000

Generate fake finance data

data = { 'Transaction_ID': [fake.uuid4() for _ in range(num_rows)], 'Date': [fake.date_time_this_year() for _ in range(num_rows)],

'Amount': [round(random.uniform(10, 10000), 2) for _ in range(num_rows)], 'Currency': [fake.currency_code() for _ in range(num_rows)], 'Description': [fake.bs() for _ in range(num_rows)], 'Category': [random.choice(['Food', 'Transport', 'Shopping', 'Entertainment', 'Utilities']) for _ in range(num_rows)], 'Merchant': [fake.company() for _ in range(num_rows)], 'Customer': [fake.name() for _ in range(num_rows)], 'City': [fake.city() for _ in range(num_rows)], 'Country': [fake.country() for _ in range(num_rows)]

}

Create a DataFrame

df = pd.DataFrame(data)

Save the DataFrame to a CSV file

df.to_csv('finance_dataset.csv', index=False)

Display the DataFrame

df.head()`
Meta data and supporting documentation
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Synthetic datasets of the UK Biobank cohort
zenodo.org
data.niaid.nih.gov
bin, csv, pdf, zip
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Explore at:
bin, csv, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13983170
Dataset updated
Sep 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]

Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.

synthbdbasevar: baseline variables, mostly collected at recruitment.

synthpmdata: annual average exposure to PM_2.5 for each participant reconstructed using their residential history.

synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.

asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).

Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM_2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM_2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM_2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
c
Hurwitz_imicrobe_mock_communities_2018 - Dataset - Data Commons
dc.cyverse.org
Updated Sep 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Hurwitz_imicrobe_mock_communities_2018 - Dataset - Data Commons [Dataset]. https://dc.cyverse.org/dataset/hurwitz_imicrobe_mock_communities_2018
Explore at:
Dataset updated
Sep 4, 2025
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Artificial metagenomic reads from mock communities. Data from https://www.imicrobe.us/#/projects/265 The raw sequence data from the staggered mock community (SRP115095 in project accession PRJNA397434) were used to develop simulated metagenomes. We used the known staggered mock community abundance profile to generate a simulated metagenome using GemSim (454 sequencing) with 0.5, 1, 5, and 10 million reads based on the relative abundance of organisms in the staggered mock community. We developed four simulated metagenomes to test the effect of changing the dominant organism abundance and genetic composition including: 10 million reads from the staggered mock community (mock 1), the mock community with alterations in a few abundant species (mock 2), the mock community with many alterations in abundant species (mock 3), and mock 3 with additional sequences from archaea to further alter the genetic composition (mock 4). The same community profiles were used to generate paired-end illumina dataset (100 million reads), using GemSim (illumina v4 error model). Finally, using SimLord, the community profiles were used to generate simulated third generation sequencing datasets (Pacific Bioscience SMRT sequencing - 1 million reads). SimLord default parameters were used to generate those simulated datasets.
Fake News data set
kaggle.com
zip
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bjørn-Jostein (2021). Fake News data set [Dataset]. https://www.kaggle.com/datasets/bjoernjostein/fake-news-data-set
Explore at:
zip(56446259 bytes)Available download formats
Dataset updated
Dec 17, 2021
Authors
Bjørn-Jostein
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.

How do we figure out what is true and what is fake? Can we do something about it?

Content

The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!

The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.

Acknowledgements

This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.
d
Data from: Warm-up and metrics of song performance: a commentary on...
search.dataone.org
datadryad.org
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GonÃ§alo Cardoso (2025). Warm-up and metrics of song performance: a commentary on Vazquez-Cardona et al [Dataset]. http://doi.org/10.5061/dryad.fxpnvx112
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.fxpnvx112
Dataset updated
Jul 29, 2025
Dataset provided by
Dryad Digital Repository
Authors
GonÃ§alo Cardoso
Description
Interactive plots that re-generate random data in Figure 2 of Cardoso (2024) "Warm-up and metrics of song performance: a commentary on Vazquez-Cardona et al." Behavioral Ecology., , , # Data from: Warm-up and metrics of song performance: a commentary on Vazquez-Cardona et al.

https://doi.org/10.5061/dryad.fxpnvx112

Interactive plots that re-generate random data in Figure 2 of Cardoso (2024) "Warm-up and metrics of song performance: a commentary on Vazquez-Cardona et al." Behavioral Ecology.

Description of the data and file structure

Each figure plot is in a separate spreadsheet. Press F9 to re-generate the simulated data.

In each spreadsheet:

. Column A indicates the bin number for the X axis data, as per the explanations in the legend of Figure 2.

. Column B contains the formula for X axis data: random numbers from a uniform distribution within each bin.

. Column C contains the formula for Y axis data: absolute value of random numbers from a normal distribution with mean 0 and standard deviation 1.

. In panels C and D, these random numbers for Y are divided or multiplied by 10+X, as explained in the legend of Fi...
Z
Data from: Simulated Arterial Pulse Waves Database (preliminary version)
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter H Charlton; Mariscal Harana, Jorge; Vennin, Samuel; Chowienczyk, Phil; Alastruey, Jordi (2020). Simulated Arterial Pulse Waves Database (preliminary version) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3296510
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
King's College London
Authors
Peter H Charlton; Mariscal Harana, Jorge; Vennin, Samuel; Chowienczyk, Phil; Alastruey, Jordi
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
This provides a brief overview of the database. Further details are provided at: https://peterhcharlton.github.io/pwdb/ppwdb.html

Background: The shape of the arterial pulse wave (PW) is a rich source of information on cardiovascular (CV) health, since it is influenced by both the heart and the vasculature. Consequently, many algorithms have been proposed to estimate clinical parameters from PWs. However, it is difficult and costly to acquire comprehensive datasets with which to assess their performance. We are aiming to address this difficulty by creating a database of simulated PWs under a range of CV conditions, representative of a healthy population. The database provided here is an initial version which has already been used to gain some novel insights into haemodynamics.

Methods: Baseline PWs were simulated using 1D computational modelling. CV model parameters were varied across normal healthy ranges to simulate a sample of subjects for each age decade from 25 to 75 years. The model was extended to simulate photoplethysmographic (PPG) PWs at common measurement sites, in addition to the pressure (ABP), flow rate (Q), flow velocity (U) and diameter (D) PWs produced by the model.

Validation: The database was verified by comparing simulated PWs with in vivo PWs. Good agreement was observed, with age-related changes in blood pressure and wave morphology well reproduced.

Conclusion: This database is a valuable resource for development and pre-clinical assessment of PW analysis algorithms. It is particularly useful because it contains several types of PWs at multiple measurement sites, and the exact CV conditions which generated each PW are known.

Future work: However, there are two limitations: (i) the database does not exhibit the wide variation in cardiovascular properties observed across a population sample; and (ii) the methods used to model changes with age have been improved since creating this initial version. Therefore, we are currently creating a more comprehensive database which addresses these limitations.

Accompanying Presentation: This database was originally presented at the BioMedEng18 Conference. The presentation describing the methods for creating the database, and providing an introduction to the database, is available at: https://www.youtube.com/watch?v=X8aPZFs8c08 . The accompanying abstract is available here.

Accompanying Manual: Further information on how to use the PWDB datasets, including this preliminary dataset, are provided in the user manual. Further details on the contents of the dataset files are available here.

Citation: When using this dataset please cite this publication:

Charlton P.H. et al. Modelling arterial pulse wave propagation during healthy ageing, In World Congress of Biomechanics 2018, Dublin, Ireland, 2018.

Version History:

v.1.0: Originally uploaded to PhysioNet. This is the version which was used in the accompanying presentation.

v.2.0: The initial upload to this DOI. The database was curated using the PWDB Algorithms v.0.1.1. It differs slightly from the originally reported version in that: (i) the augmentation pressure and index were calculated at the aortic root rather than the carotid artery.

Text adapted from: Charlton P.H. et al., 'A database for the development of pulse wave analysis algorithms', BioMedEng18, London, 2018.
R
Mock_data Dataset
universe.roboflow.com
zip
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NW2 (2022). Mock_data Dataset [Dataset]. https://universe.roboflow.com/nw2/mock_data
Explore at:
zipAvailable download formats
Dataset updated
Mar 4, 2022
Dataset authored and provided by
NW2
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Handguns Rifles Bounding Boxes
Description
Mock_data

## Overview Mock_data is a dataset for object detection tasks - it contains Handguns Rifles annotations for 4,183 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Simulated exome-sequencing data for a family study of lymphoid cancer
zenodo.org
data.niaid.nih.gov
bin, txt, zip
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinko Graham; Jinko Graham; Nirodha Epasinghege Dona; Nirodha Epasinghege Dona (2024). Simulated exome-sequencing data for a family study of lymphoid cancer [Dataset]. http://doi.org/10.5281/zenodo.12696267
Explore at:
bin, zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12696267
Dataset updated
Jul 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jinko Graham; Jinko Graham; Nirodha Epasinghege Dona; Nirodha Epasinghege Dona
License
https://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt
Time period covered
Jul 8, 2024
Description
This repository contains all the data files for a simulated exome-sequencing study of 150 families, ascertained to contain at least four members affected with lymphoid cancer. Please note that previous versions of this repository omitted a key file linking the genotypes of individuals to their family and individual IDs; this file, geno_key.txt, is now included. All other files remain the same as in previous versions.

The simulated data can be found in the files section below. The files are:

SLiM_output.txt - contains the SLiM-simulated, exome-wide, SNV data generated under an American-admixture demographic model, for the American-admixed sub-population only.

SLiM_output_chr8&9.txt - contains the SLiM-simulated data above for all source populations as well as the American-admixed sub-population, but only for chromosomes 8 and 9.

sample_info.txt - contains pedigree information of all the disease-affected individuals and individuals connecting them along a line of descent, for all 150 ascertained pedigrees.

Genotypes.zip - a zipfile that contains 22 text files of genotypes for each chromosome. The genotypes are for simulated single-nucleotide variants on the exome and are in gene-dosage format.

geno_key.txt – a plain-text file that links the genotyped individuals to their family and individual IDs.

SNVmaps.zip - a zipfile that contains 22 text files giving the single-nucleotide variant information for each chromosome.

familial_cRV.txt - contains the familial causal rare variants for all 150 ascertained pedigrees.

study_peds.txt - contains the 150 pedigrees ascertained to contain four or more relatives affected with lymphoid cancer.

PLINKfiles.zip - a zipfile that contains PLINK .fam, .bim and .bed files for all 22 of the chromosomes.

All the scripts used to generate these data can be found in the GitHub repository archived at https://zenodo.org/records/12694914

We have also uploaded one intermediate .Rdata file, Chromwide.Rdata, to save the user substantial time when running the associated RMarkdown script for the simulation. We recommend loading Chromwide.Rdata into your R work-space rather than generating it from scratch.
G
OpenAPI Mocking Tool Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). OpenAPI Mocking Tool Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/openapi-mocking-tool-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Aug 23, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
OpenAPI Mocking Tool Market Outlook

According to the latest research, the OpenAPI Mocking Tool market size reached USD 420.7 million in 2024 at a robust growth trajectory. With a strong compound annual growth rate (CAGR) of 13.2% projected from 2025 to 2033, the market is forecasted to attain a value of USD 1,179.5 million by 2033. The surge in demand for rapid API development, enhanced testing capabilities, and the increasing adoption of microservices architectures are among the primary growth drivers for this market. As per our latest research, enterprises across several verticals are accelerating their digital transformation journeys, further propelling the adoption of OpenAPI mocking tools globally.

One of the most significant growth factors for the OpenAPI Mocking Tool market is the escalating adoption of APIs across diverse industries. As organizations strive to build scalable, modular, and interoperable software systems, the need for effective API management and testing becomes imperative. OpenAPI mocking tools enable developers to simulate API behaviors before the actual backend is built, thereby reducing development cycles and improving collaboration between frontend and backend teams. This capability is particularly crucial in agile and DevOps environments, where rapid iteration and continuous integration are standard. The ability to quickly create mock servers and endpoints accelerates prototyping and testing, leading to faster time-to-market for digital products and services. This trend is further amplified by the increasing complexity of software ecosystems, where seamless integration and interoperability are non-negotiable.

Another key driver fueling the OpenAPI Mocking Tool market is the rise of microservices and cloud-native architectures. As enterprises migrate from monolithic systems to distributed microservices, the need for robust API testing and simulation grows exponentially. OpenAPI mocking tools provide essential support for these architectures by allowing teams to decouple service development, enabling parallel workstreams and reducing bottlenecks. The shift towards cloud-native development also means that organizations require tools that are compatible with both on-premises and cloud environments, further boosting the demand for flexible and scalable mocking solutions. Additionally, the proliferation of DevOps practices has underscored the importance of automated testing and continuous delivery, both of which are facilitated by advanced API mocking capabilities.

Furthermore, the increasing regulatory scrutiny around data privacy and security is prompting organizations to adopt OpenAPI mocking tools for safer and more compliant software testing. By simulating APIs and using synthetic data, companies can thoroughly test their applications without exposing sensitive information, thereby mitigating compliance risks. This is particularly relevant in industries such as BFSI, healthcare, and government, where data protection regulations are stringent. The growing emphasis on quality assurance and risk management is thus driving the integration of OpenAPI mocking tools into the broader software development lifecycle. Vendors are responding by offering enhanced features such as automated validation, detailed reporting, and seamless integration with CI/CD pipelines, further consolidating the market’s growth trajectory.

From a regional perspective, North America continues to dominate the OpenAPI Mocking Tool market, accounting for the largest share in 2024. The presence of leading technology companies, early adoption of cloud and DevOps practices, and a mature API ecosystem are key factors contributing to the region’s leadership. However, the Asia Pacific region is witnessing the fastest growth, driven by rapid digitalization, a burgeoning startup ecosystem, and increasing investments in IT infrastructure. Europe also holds a significant share, propelled by stringent regulatory requirements and a strong focus on software quality. Meanwhile, Latin America and the Middle East & Africa are emerging markets with untapped potential, as organizations in these regions accelerate their digital transformation initiatives. The regional dynamics underscore the global nature of the OpenAPI Mocking Tool market, with opportunities and challenges varying across geographies.

"https://growthmarketreports.com/request-sample/133131">
Z
Data from: Detecting and Tracking Drift in Quantum Information Processors
data.niaid.nih.gov
Updated May 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Proctor, Timothy; Revelle, Melissa; Nielsen, Erik; Rudinger, Kenneth; Lobser, Daniel; Maunz, Peter; Blume-Kohout, Robin; Young, Kevin (2022). Detecting and Tracking Drift in Quantum Information Processors [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4033076
Explore at:
Dataset updated
May 23, 2022
Dataset provided by
Quantum Performance Laboratory, Sandia National Laboratories
Sandia National Laboratories
Authors
Proctor, Timothy; Revelle, Melissa; Nielsen, Erik; Rudinger, Kenneth; Lobser, Daniel; Maunz, Peter; Blume-Kohout, Robin; Young, Kevin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is supplemental data and code for: T. Proctor et al, Detecting and tracking drift in quantum information processors, Nat. Comm. 11, 5396 (2020).

Please direct any questions to Timothy Proctor (tjproct@sandia.gov).

This folder contains all the data and the analysis code to generate the results presented in that paper. The core data analysis routines use PyGSTi, which can be found at https://github.com/pyGSTio/pyGSTi.

The analysis was run using pyGSTi commit 7c6ddd1de209b795ea39bfb69d010b687e812d07. This code does not work on the latest full release of pyGSTi (0.9.9). It is anticipated that it will work with the next full release of pyGSTi (0.9.10).

Below is a basic guide to navigating this SI:

Time-resolved Ramsey tomography on experimental data.

Directory: ramsey/experiment

This folder contains the data and analysis code for the time-resolved Ramsey experiment, the results of which are presented in Figure 1 of the paper. The folder contains a single Jupyter notebook, which runs all of the data analysis.

Time-resolved randomized benchmarking (RB) on simulated data.

Directory: rb/simulation

This folder contains the data and analysis code for the simulation of time-resolved RB, the results of which are presented in Figure 2 of the paper. The folder contains a single Jupyter notebook, which runs all of the data analysis on the simulated data, and which can be used to run new simulations with the same noise model.

Time-resolved gate set tomography (GST) on simulated data.

Directory: gst/simulation

This contains the data and analysis code for the simulation of time-resolved GST, the results of which are presented in Figure 2 of the paper. The raw simulated data is contained in the "data" folder. All the code is contained in the "analysis" folder. This contains the following code files:

create_simulated_data.py : this generates the simulated data. This was run using MPI on 20 cores.

drift.ipynb : this contains the general circuit-agnostic drift analysis.

trgst_fit.py : this contains the TR-GST model-fitting code. This was run using MPI on 20 cores.

tdmodel.py : encodes the general time-dependent model that the data is fit to.

Time-resolved gate set tomography (GST) on experimental data.

Directory: gst/experiments

This folder contains the data and analysis code for the two time-resolved GST experiments, the results of which are presented in Figure 3 of the paper. The raw data is contained in the two folders "data/1" and "data/2", corresponding to the first and second experiment, respectively. All analysis code is contained in the "analysis" folder. This contains the following code files:

drift.ipynb : this contains the general circuit-agnostic drift analysis.

gst.ipynb : this contains the standard GST analysis, used to inform the TR-GST analysis.

trgst_fit.py : this contains the TR-GST model-fitting code. This was run using MPI on 20 cores.

trgst_plotting.ipnyb : this contains code that analyzes the results of the TR-GST fit.

tdmodel.py : encodes the general time-dependent model that the data is fit to.

Facebook

Twitter

Click to copy link

Link copied

Cite

Oyekanmi Olamilekan (2023). Fake Employee Dataset [Dataset]. https://www.kaggle.com/datasets/oyekanmiolamilekan/fake-employee-dataset

Fake Employee Dataset

Generating fake or synthetic data

Explore at:

zip(162874 bytes)Available download formats

Dataset updated

Nov 20, 2023

Authors

Oyekanmi Olamilekan

Description

Creating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including: Employee ID: A unique identifier for each employee. Name: First name and last name of the employee. Gender: Male, female, non-binary, etc. Date of Birth: Birthdate of the employee. Email Address: Contact email of the employee. Phone Number: Contact number of the employee. Address: Home or work address of the employee. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.). Job Title: The specific job title of the employee. Manager ID: ID of the employee's manager. Hire Date: Date when the employee was hired. Salary: Employee's salary or compensation. Employment Status: Full-time, part-time, contractor, etc. Employee Type: Regular, temporary, contract, etc. Education Level: Highest level of education attained by the employee. Certifications: Any relevant certifications the employee holds. Skills: Specific skills or expertise possessed by the employee. Performance Ratings: Ratings or evaluations of employee performance. Work Experience: Previous work experience of the employee. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.). Work Location: Physical location where the employee works. Work Hours: Regular working hours or shifts of the employee. Employee Status: Active, on leave, terminated, etc. Emergency Contact: Contact information of the employee's emergency contact person. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.

Code Url: https://github.com/intellisenseCodez/faker-data-generator

Clear search

Close search

Google apps

Main menu

Fake Employee Dataset

Stack Overflow Chat Dutch

Fake Dataset for Practice

User Subscription Dummy Data

This data is purely random and created for learning purpose.

Note - This Data is Purely Random (Dummy Data). if you wish, you can perform some data visualization and model building part into it.

benchmark-dummy-data

Simulation Data Set

Data and code for: Generation and applications of simulated datasets to...

generated-usa-passeports-dataset

dummy_data_multicollinearity

Dataset

Contents

Finance Dataset by Faker Library

Define the number of rows for your dataset

Generate fake finance data

Create a DataFrame

Save the DataFrame to a CSV file

Display the DataFrame

Meta data and supporting documentation

Synthetic datasets of the UK Biobank cohort

Content

Generation of the synthetic data

Hurwitz_imicrobe_mock_communities_2018 - Dataset - Data Commons

Fake News data set

Context

Content

Acknowledgements

Data from: Warm-up and metrics of song performance: a commentary on...

Description of the data and file structure

Data from: Simulated Arterial Pulse Waves Database (preliminary version)

Mock_data Dataset

Mock_data

Simulated exome-sequencing data for a family study of lymphoid cancer

OpenAPI Mocking Tool Market Research Report 2033

OpenAPI Mocking Tool Market Outlook

Data from: Detecting and Tracking Drift in Quantum Information Processors

Fake Employee Dataset

Generating fake or synthetic data