Facebook
TwitterCreating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including: Employee ID: A unique identifier for each employee. Name: First name and last name of the employee. Gender: Male, female, non-binary, etc. Date of Birth: Birthdate of the employee. Email Address: Contact email of the employee. Phone Number: Contact number of the employee. Address: Home or work address of the employee. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.). Job Title: The specific job title of the employee. Manager ID: ID of the employee's manager. Hire Date: Date when the employee was hired. Salary: Employee's salary or compensation. Employment Status: Full-time, part-time, contractor, etc. Employee Type: Regular, temporary, contract, etc. Education Level: Highest level of education attained by the employee. Certifications: Any relevant certifications the employee holds. Skills: Specific skills or expertise possessed by the employee. Performance Ratings: Ratings or evaluations of employee performance. Work Experience: Previous work experience of the employee. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.). Work Location: Physical location where the employee works. Work Hours: Regular working hours or shifts of the employee. Employee Status: Active, on leave, terminated, etc. Emergency Contact: Contact information of the employee's emergency contact person. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.
Code Url: https://github.com/intellisenseCodez/faker-data-generator
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.
Data Instances
{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},
Data Fields
id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].
topic: the topic description
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other;
2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The prompt to translate the topic is:
TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.
"
The system message was:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].
The translation quality has not been verified. Use at your own risk!
Licensing Information
Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
Facebook
TwitterDescription: This dataset is created solely for the purpose of practice and learning. It contains entirely fake and fabricated information, including names, phone numbers, emails, cities, ages, and other attributes. None of the information in this dataset corresponds to real individuals or entities. It serves as a resource for those who are learning data manipulation, analysis, and machine learning techniques. Please note that the data is completely fictional and should not be treated as representing any real-world scenarios or individuals.
Attributes: - phone_number: Fake phone numbers in various formats. - name: Fictitious names generated for practice purposes. - email: Imaginary email addresses created for the dataset. - city: Made-up city names to simulate geographical diversity. - age: Randomly generated ages for practice analysis. - sex: Simulated gender values (Male, Female). - married_status: Synthetic marital status information. - job: Fictional job titles for practicing data analysis. - income: Fake income values for learning data manipulation. - religion: Pretend religious affiliations for practice. - nationality: Simulated nationalities for practice purposes.
Please be aware that this dataset is not based on real data and should be used exclusively for educational purposes.
Facebook
TwitterIn situations where data is not readily available but needed, you'll have to resort to building up the data yourself. There are many methods you can use to acquire this data from web scraping to APIs. But sometimes, you'll end up needing to create fake or “dummy” data. Dummy data can be useful in times where you know the exact features you’ll be using and the data types included but, you just don’t have the data itself.
Features Description
Reference - https://towardsdatascience.com/build-a-your-own-custom-dataset-using-python-9296540a0178
Facebook
TwitterDummy Dataset for AutoTrain Benchmark
This dataset contains dummy data that's needed to create AutoTrain projects for benchmarks like RAFT. See here for more details.
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Social networks are tied to population dynamics; interactions are driven by population density and demographic structure, while social relationships can be key determinants of survival and reproductive success. However, difficulties integrating models used in demography and network analysis have limited research at this interface. We introduce the R package genNetDem for simulating integrated network-demographic datasets. It can be used to create longitudinal social networks and/or capture-recapture datasets with known properties. It incorporates the ability to generate populations and their social networks, generate grouping events using these networks, simulate social network effects on individual survival, and flexibly sample these longitudinal datasets of social associations. By generating co-capture data with known statistical relationships it provides functionality for methodological research. We demonstrate its use with case studies testing how imputation and sampling design influence the success of adding network traits to conventional Cormack-Jolly-Seber (CJS) models. We show that incorporating social network effects in CJS models generates qualitatively accurate results, but with downward-biased parameter estimates when network position influences survival. Biases are greater when fewer interactions are sampled or fewer individuals are observed in each interaction. While our results indicate the potential of incorporating social effects within demographic models, they show that imputing missing network measures alone is insufficient to accurately estimate social effects on survival, pointing to the importance of incorporating network imputation approaches. genNetDem provides a flexible tool to aid these methodological advancements and help researchers test other sampling considerations in social network studies. Methods The dataset and code stored here is for Case Studies 1 and 2 in the paper. Datsets were generated using simulations in R. Here we provide 1) the R code used for the simulations; 2) the simulation outputs (as .RDS files); and 3) the R code to analyse simulation outputs and generate the tables and figures in the paper.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Bhavesh Padharia
Released under CC0: Public Domain
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9365842%2F5d270d8701f4dc2687f0ae193ee018ae%2F20-Best-Finance-Economic-Datasets-for-Machine-Learning-Social.jpg?generation=1708443878634431&alt=media" alt="">
Finance dataset with fake information such as transaction ID, date, amount, currency, description, category, merchant, customer, city, and country. It can be used for educational purposes as well as for testing.
This script generates a dataset with fake information such as name, email, phone number, address, date of birth, job, and company. Adjust the num_rows variable to specify the number of rows you want in your dataset. Finally, the dataset is saved to a CSV file named fake_dataset.csv. You can modify the fields or add additional fields according to your requirements.
`
num_rows = 15000
data = { 'Transaction_ID': [fake.uuid4() for _ in range(num_rows)], 'Date': [fake.date_time_this_year() for _ in range(num_rows)],
'Amount': [round(random.uniform(10, 10000), 2) for _ in range(num_rows)],
'Currency': [fake.currency_code() for _ in range(num_rows)],
'Description': [fake.bs() for _ in range(num_rows)],
'Category': [random.choice(['Food', 'Transport', 'Shopping', 'Entertainment', 'Utilities']) for _ in range(num_rows)],
'Merchant': [fake.company() for _ in range(num_rows)],
'Customer': [fake.name() for _ in range(num_rows)],
'City': [fake.city() for _ in range(num_rows)],
'Country': [fake.country() for _ in range(num_rows)]
}
df = pd.DataFrame(data)
df.to_csv('finance_dataset.csv', index=False)
df.head()`
Facebook
TwitterWe include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Artificial metagenomic reads from mock communities. Data from https://www.imicrobe.us/#/projects/265 The raw sequence data from the staggered mock community (SRP115095 in project accession PRJNA397434) were used to develop simulated metagenomes. We used the known staggered mock community abundance profile to generate a simulated metagenome using GemSim (454 sequencing) with 0.5, 1, 5, and 10 million reads based on the relative abundance of organisms in the staggered mock community. We developed four simulated metagenomes to test the effect of changing the dominant organism abundance and genetic composition including: 10 million reads from the staggered mock community (mock 1), the mock community with alterations in a few abundant species (mock 2), the mock community with many alterations in abundant species (mock 3), and mock 3 with additional sequences from archaea to further alter the genetic composition (mock 4). The same community profiles were used to generate paired-end illumina dataset (100 million reads), using GemSim (illumina v4 error model). Finally, using SimLord, the community profiles were used to generate simulated third generation sequencing datasets (Pacific Bioscience SMRT sequencing - 1 million reads). SimLord default parameters were used to generate those simulated datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.
How do we figure out what is true and what is fake? Can we do something about it?
The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!
The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.
This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.
Facebook
TwitterInteractive plots that re-generate random data in Figure 2 of Cardoso (2024) "Warm-up and metrics of song performance: a commentary on Vazquez-Cardona et al." Behavioral Ecology., , , # Data from: Warm-up and metrics of song performance: a commentary on Vazquez-Cardona et al.
https://doi.org/10.5061/dryad.fxpnvx112
Interactive plots that re-generate random data in Figure 2 of Cardoso (2024) "Warm-up and metrics of song performance: a commentary on Vazquez-Cardona et al." Behavioral Ecology.
Each figure plot is in a separate spreadsheet. Press F9 to re-generate the simulated data.
In each spreadsheet:
. Column A indicates the bin number for the X axis data, as per the explanations in the legend of Figure 2.
. Column B contains the formula for X axis data: random numbers from a uniform distribution within each bin.
. Column C contains the formula for Y axis data: absolute value of random numbers from a normal distribution with mean 0 and standard deviation 1.
. In panels C and D, these random numbers for Y are divided or multiplied by 10+X, as explained in the legend of Fi...
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This provides a brief overview of the database. Further details are provided at: https://peterhcharlton.github.io/pwdb/ppwdb.html
Background: The shape of the arterial pulse wave (PW) is a rich source of information on cardiovascular (CV) health, since it is influenced by both the heart and the vasculature. Consequently, many algorithms have been proposed to estimate clinical parameters from PWs. However, it is difficult and costly to acquire comprehensive datasets with which to assess their performance. We are aiming to address this difficulty by creating a database of simulated PWs under a range of CV conditions, representative of a healthy population. The database provided here is an initial version which has already been used to gain some novel insights into haemodynamics.
Methods: Baseline PWs were simulated using 1D computational modelling. CV model parameters were varied across normal healthy ranges to simulate a sample of subjects for each age decade from 25 to 75 years. The model was extended to simulate photoplethysmographic (PPG) PWs at common measurement sites, in addition to the pressure (ABP), flow rate (Q), flow velocity (U) and diameter (D) PWs produced by the model.
Validation: The database was verified by comparing simulated PWs with in vivo PWs. Good agreement was observed, with age-related changes in blood pressure and wave morphology well reproduced.
Conclusion: This database is a valuable resource for development and pre-clinical assessment of PW analysis algorithms. It is particularly useful because it contains several types of PWs at multiple measurement sites, and the exact CV conditions which generated each PW are known.
Future work: However, there are two limitations: (i) the database does not exhibit the wide variation in cardiovascular properties observed across a population sample; and (ii) the methods used to model changes with age have been improved since creating this initial version. Therefore, we are currently creating a more comprehensive database which addresses these limitations.
Accompanying Presentation: This database was originally presented at the BioMedEng18 Conference. The presentation describing the methods for creating the database, and providing an introduction to the database, is available at: https://www.youtube.com/watch?v=X8aPZFs8c08 . The accompanying abstract is available here.
Accompanying Manual: Further information on how to use the PWDB datasets, including this preliminary dataset, are provided in the user manual. Further details on the contents of the dataset files are available here.
Citation: When using this dataset please cite this publication:
Charlton P.H. et al. Modelling arterial pulse wave propagation during healthy ageing, In World Congress of Biomechanics 2018, Dublin, Ireland, 2018.
Version History:
v.1.0: Originally uploaded to PhysioNet. This is the version which was used in the accompanying presentation.
v.2.0: The initial upload to this DOI. The database was curated using the PWDB Algorithms v.0.1.1. It differs slightly from the originally reported version in that: (i) the augmentation pressure and index were calculated at the aortic root rather than the carotid artery.
Text adapted from: Charlton P.H. et al., 'A database for the development of pulse wave analysis algorithms', BioMedEng18, London, 2018.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Mock_data is a dataset for object detection tasks - it contains Handguns Rifles annotations for 4,183 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterhttps://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt
This repository contains all the data files for a simulated exome-sequencing study of 150 families, ascertained to contain at least four members affected with lymphoid cancer. Please note that previous versions of this repository omitted a key file linking the genotypes of individuals to their family and individual IDs; this file, geno_key.txt, is now included. All other files remain the same as in previous versions.
The simulated data can be found in the files section below. The files are:
All the scripts used to generate these data can be found in the GitHub repository archived at https://zenodo.org/records/12694914
We have also uploaded one intermediate .Rdata file, Chromwide.Rdata, to save the user substantial time when running the associated RMarkdown script for the simulation. We recommend loading Chromwide.Rdata into your R work-space rather than generating it from scratch.
Facebook
Twitter
According to the latest research, the OpenAPI Mocking Tool market size reached USD 420.7 million in 2024 at a robust growth trajectory. With a strong compound annual growth rate (CAGR) of 13.2% projected from 2025 to 2033, the market is forecasted to attain a value of USD 1,179.5 million by 2033. The surge in demand for rapid API development, enhanced testing capabilities, and the increasing adoption of microservices architectures are among the primary growth drivers for this market. As per our latest research, enterprises across several verticals are accelerating their digital transformation journeys, further propelling the adoption of OpenAPI mocking tools globally.
One of the most significant growth factors for the OpenAPI Mocking Tool market is the escalating adoption of APIs across diverse industries. As organizations strive to build scalable, modular, and interoperable software systems, the need for effective API management and testing becomes imperative. OpenAPI mocking tools enable developers to simulate API behaviors before the actual backend is built, thereby reducing development cycles and improving collaboration between frontend and backend teams. This capability is particularly crucial in agile and DevOps environments, where rapid iteration and continuous integration are standard. The ability to quickly create mock servers and endpoints accelerates prototyping and testing, leading to faster time-to-market for digital products and services. This trend is further amplified by the increasing complexity of software ecosystems, where seamless integration and interoperability are non-negotiable.
Another key driver fueling the OpenAPI Mocking Tool market is the rise of microservices and cloud-native architectures. As enterprises migrate from monolithic systems to distributed microservices, the need for robust API testing and simulation grows exponentially. OpenAPI mocking tools provide essential support for these architectures by allowing teams to decouple service development, enabling parallel workstreams and reducing bottlenecks. The shift towards cloud-native development also means that organizations require tools that are compatible with both on-premises and cloud environments, further boosting the demand for flexible and scalable mocking solutions. Additionally, the proliferation of DevOps practices has underscored the importance of automated testing and continuous delivery, both of which are facilitated by advanced API mocking capabilities.
Furthermore, the increasing regulatory scrutiny around data privacy and security is prompting organizations to adopt OpenAPI mocking tools for safer and more compliant software testing. By simulating APIs and using synthetic data, companies can thoroughly test their applications without exposing sensitive information, thereby mitigating compliance risks. This is particularly relevant in industries such as BFSI, healthcare, and government, where data protection regulations are stringent. The growing emphasis on quality assurance and risk management is thus driving the integration of OpenAPI mocking tools into the broader software development lifecycle. Vendors are responding by offering enhanced features such as automated validation, detailed reporting, and seamless integration with CI/CD pipelines, further consolidating the market’s growth trajectory.
From a regional perspective, North America continues to dominate the OpenAPI Mocking Tool market, accounting for the largest share in 2024. The presence of leading technology companies, early adoption of cloud and DevOps practices, and a mature API ecosystem are key factors contributing to the region’s leadership. However, the Asia Pacific region is witnessing the fastest growth, driven by rapid digitalization, a burgeoning startup ecosystem, and increasing investments in IT infrastructure. Europe also holds a significant share, propelled by stringent regulatory requirements and a strong focus on software quality. Meanwhile, Latin America and the Middle East & Africa are emerging markets with untapped potential, as organizations in these regions accelerate their digital transformation initiatives. The regional dynamics underscore the global nature of the OpenAPI Mocking Tool market, with opportunities and challenges varying across geographies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is supplemental data and code for: T. Proctor et al, Detecting and tracking drift in quantum information processors, Nat. Comm. 11, 5396 (2020).
Please direct any questions to Timothy Proctor (tjproct@sandia.gov).
This folder contains all the data and the analysis code to generate the results presented in that paper. The core data analysis routines use PyGSTi, which can be found at https://github.com/pyGSTio/pyGSTi.
The analysis was run using pyGSTi commit 7c6ddd1de209b795ea39bfb69d010b687e812d07. This code does not work on the latest full release of pyGSTi (0.9.9). It is anticipated that it will work with the next full release of pyGSTi (0.9.10).
Below is a basic guide to navigating this SI:
Time-resolved Ramsey tomography on experimental data.
Directory: ramsey/experiment
This folder contains the data and analysis code for the time-resolved Ramsey experiment, the results of which are presented in Figure 1 of the paper. The folder contains a single Jupyter notebook, which runs all of the data analysis.
Time-resolved randomized benchmarking (RB) on simulated data.
Directory: rb/simulation
This folder contains the data and analysis code for the simulation of time-resolved RB, the results of which are presented in Figure 2 of the paper. The folder contains a single Jupyter notebook, which runs all of the data analysis on the simulated data, and which can be used to run new simulations with the same noise model.
Time-resolved gate set tomography (GST) on simulated data.
Directory: gst/simulation
This contains the data and analysis code for the simulation of time-resolved GST, the results of which are presented in Figure 2 of the paper. The raw simulated data is contained in the "data" folder. All the code is contained in the "analysis" folder. This contains the following code files:
create_simulated_data.py : this generates the simulated data. This was run using MPI on 20 cores.
drift.ipynb : this contains the general circuit-agnostic drift analysis.
trgst_fit.py : this contains the TR-GST model-fitting code. This was run using MPI on 20 cores.
tdmodel.py : encodes the general time-dependent model that the data is fit to.
Time-resolved gate set tomography (GST) on experimental data.
Directory: gst/experiments
This folder contains the data and analysis code for the two time-resolved GST experiments, the results of which are presented in Figure 3 of the paper. The raw data is contained in the two folders "data/1" and "data/2", corresponding to the first and second experiment, respectively. All analysis code is contained in the "analysis" folder. This contains the following code files:
drift.ipynb : this contains the general circuit-agnostic drift analysis.
gst.ipynb : this contains the standard GST analysis, used to inform the TR-GST analysis.
trgst_fit.py : this contains the TR-GST model-fitting code. This was run using MPI on 20 cores.
trgst_plotting.ipnyb : this contains code that analyzes the results of the TR-GST fit.
tdmodel.py : encodes the general time-dependent model that the data is fit to.
Facebook
TwitterCreating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including: Employee ID: A unique identifier for each employee. Name: First name and last name of the employee. Gender: Male, female, non-binary, etc. Date of Birth: Birthdate of the employee. Email Address: Contact email of the employee. Phone Number: Contact number of the employee. Address: Home or work address of the employee. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.). Job Title: The specific job title of the employee. Manager ID: ID of the employee's manager. Hire Date: Date when the employee was hired. Salary: Employee's salary or compensation. Employment Status: Full-time, part-time, contractor, etc. Employee Type: Regular, temporary, contract, etc. Education Level: Highest level of education attained by the employee. Certifications: Any relevant certifications the employee holds. Skills: Specific skills or expertise possessed by the employee. Performance Ratings: Ratings or evaluations of employee performance. Work Experience: Previous work experience of the employee. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.). Work Location: Physical location where the employee works. Work Hours: Regular working hours or shifts of the employee. Employee Status: Active, on leave, terminated, etc. Emergency Contact: Contact information of the employee's emergency contact person. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.
Code Url: https://github.com/intellisenseCodez/faker-data-generator