100+ datasets found
  1. Fake Dataset for Practice

    • kaggle.com
    zip
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2023). Fake Dataset for Practice [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/fake-dataset-for-practice
    Explore at:
    zip(1515599 bytes)Available download formats
    Dataset updated
    Aug 21, 2023
    Authors
    Shuvo Kumar Basak-4004
    Description

    Description: This dataset is created solely for the purpose of practice and learning. It contains entirely fake and fabricated information, including names, phone numbers, emails, cities, ages, and other attributes. None of the information in this dataset corresponds to real individuals or entities. It serves as a resource for those who are learning data manipulation, analysis, and machine learning techniques. Please note that the data is completely fictional and should not be treated as representing any real-world scenarios or individuals.

    Attributes: - phone_number: Fake phone numbers in various formats. - name: Fictitious names generated for practice purposes. - email: Imaginary email addresses created for the dataset. - city: Made-up city names to simulate geographical diversity. - age: Randomly generated ages for practice analysis. - sex: Simulated gender values (Male, Female). - married_status: Synthetic marital status information. - job: Fictional job titles for practicing data analysis. - income: Fake income values for learning data manipulation. - religion: Pretend religious affiliations for practice. - nationality: Simulated nationalities for practice purposes.

    Please be aware that this dataset is not based on real data and should be used exclusively for educational purposes.

  2. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  3. Sample Leads Dataset

    • kaggle.com
    zip
    Updated Jun 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ThatSean (2022). Sample Leads Dataset [Dataset]. https://www.kaggle.com/datasets/thatsean/sample-leads-dataset
    Explore at:
    zip(22640 bytes)Available download formats
    Dataset updated
    Jun 24, 2022
    Authors
    ThatSean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is based on the Sample Leads Dataset and is intended to allow some simple filtering by lead source. I had modified this dataset to support an upcoming Towards Data Science article walking through the process. Link to be shared once published.

  4. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  5. Z

    Sample Dataset - HR Subject Areas

    • data.niaid.nih.gov
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weber, Marc (2023). Sample Dataset - HR Subject Areas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7447111
    Explore at:
    Dataset updated
    Jan 18, 2023
    Authors
    Weber, Marc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset created as part of the Master Thesis "Business Intelligence – Automation of Data Marts modeling and its data processing".

    Lucerne University of Applied Sciences and Arts

    Master of Science in Applied Information and Data Science (MScIDS)

    Autumn Semester 2022

    Change log Version 1.1:

    The following SQL scripts were added:

        Index
        Type
        Name
    
    
        1
        View
        pg.dictionary_table
    
    
        2
        View
        pg.dictionary_column
    
    
        3
        View
        pg.dictionary_relation
    
    
        4
        View
        pg.accesslayer_table
    
    
        5
        View
        pg.accesslayer_column
    
    
        6
        View
        pg.accesslayer_relation
    
    
        7
        View
        pg.accesslayer_fact_candidate
    
    
        8
        Stored Procedure
        pg.get_fact_candidate
    
    
        9
        Stored Procedure
        pg.get_dimension_candidate
    
    
        10
        Stored Procedure
        pg.get_columns
    

    Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.

  6. c

    Power BI Sample Dataset

    • cubig.ai
    zip
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Power BI Sample Dataset [Dataset]. https://cubig.ai/store/products/389/power-bi-sample-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 29, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
    Description

    1) Data Introduction • The Power BI Sample Data is a financial sample dataset provided for Power BI practice and data visualization exercises that includes a variety of financial metrics and transaction information, including sales, profits, and expenses.

    2) Data Utilization (1) Power BI Sample Data has characteristics that: • This dataset consists of numerical and categorical variables such as transaction date, region, product category, sales, profit, and cost, optimized for aggregation, analysis, and visualization. (2) Power BI Sample Data can be used to: • Revenue and Revenue Analysis: Analyze sales and profit data by region, product, and period to understand business performance and trends. • Power BI Dashboard Practice: Utilize a variety of financial metrics and transaction data to design and practice dashboards, reports, visualization charts, and more directly at Power BI.

  7. SISTER: Experimental Workflows, Product Generation Environment, and Sample...

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Sep 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ORNL_DAAC (2025). SISTER: Experimental Workflows, Product Generation Environment, and Sample Data, V004 [Dataset]. https://catalog.data.gov/dataset/sister-experimental-workflows-product-generation-environment-and-sample-data-v004-4dcd0
    Explore at:
    Dataset updated
    Sep 18, 2025
    Dataset provided by
    Oak Ridge National Laboratory Distributed Active Archive Center
    Description

    The Space-based Imaging Spectroscopy and Thermal pathfindER (SISTER) activity originated in support of the NASA Earth System Observatory's Surface Biology and Geology (SBG) mission to develop prototype workflows with community algorithms and generate prototype data products envisioned for SBG. SISTER focused on developing a data system that is open, portable, scalable, standards-compliant, and reproducible. This collection contains EXPERIMENTAL workflows and sample data products, including (a) the Common Workflow Language (CWL) process file and a Jupyter Notebook that run the entire SISTER workflow capable of generating experimental sample data products spanning terrestrial ecosystems, inland and coastal aquatic ecosystems, and snow, (b) the archived algorithm steps (as OGC Application Packages) used to generate products at each step of the workflow, (c) a small number of experimental sample data products produced by the workflow which are based on the Airborne Visible/Infrared Imaging Spectrometer-Classic (AVIRIS or AVIRIS-CL) instrument, and (d) instructions for reproducing the sample products included in this dataset. DISCLAIMER: This collection contains experimental workflows, experimental community algorithms, and experimental sample data products to demonstrate the capabilities of an end-to-end processing system. The experimental sample data products provided have not been fully validated and are not intended for scientific use. The community algorithms provided are placeholders which can be replaced by any user's algorithms for their own science and application interests. These algorithms should not in any capacity be considered the algorithms that will be implemented in the upcoming Surface Biology and Geology mission.

  8. Synthetic Market Mix Modelling Dummy Dataset

    • kaggle.com
    zip
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farid Hamid (2023). Synthetic Market Mix Modelling Dummy Dataset [Dataset]. https://www.kaggle.com/datasets/anonymousfrog95/synthetic-market-mix-modelling-dummy-dataset
    Explore at:
    zip(49821 bytes)Available download formats
    Dataset updated
    Dec 10, 2023
    Authors
    Farid Hamid
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Farid Hamid

    Released under Apache 2.0

    Contents

  9. h

    coldocs-fin-sample

    • huggingface.co
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirant Kasliwal (2024). coldocs-fin-sample [Dataset]. https://huggingface.co/datasets/nirantk/coldocs-fin-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 30, 2024
    Authors
    Nirant Kasliwal
    Description

    Dataset Card for nirantk/coldocs-fin-sample

      Dataset Description
    

    This dataset contains images converted from PDFs using the PDFs to Page Images Converter Space.

    Number of images: 200 Number of PDFs processed: 1 Sample size per PDF: 100 Created on: 2024-10-30 01:19:36

      Dataset Creation
    
    
    
    
    
      Source Data
    

    The images in this dataset were generated from user-uploaded PDF files.

      Processing Steps
    

    PDF files were uploaded to the PDFs to Page Images… See the full description on the dataset page: https://huggingface.co/datasets/nirantk/coldocs-fin-sample.

  10. Sample Dataset for DataFrame Styling

    • kaggle.com
    zip
    Updated Jun 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonie (2022). Sample Dataset for DataFrame Styling [Dataset]. https://www.kaggle.com/datasets/iamleonie/sample-dataset-for-dataframe-styling
    Explore at:
    zip(257 bytes)Available download formats
    Dataset updated
    Jun 11, 2022
    Authors
    Leonie
    Description

    Dataset

    This dataset was created by Leonie

    Contents

  11. h

    x_fake_profile_detection

    • huggingface.co
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weronika Dracewicz (2025). x_fake_profile_detection [Dataset]. https://huggingface.co/datasets/drveronika/x_fake_profile_detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 2, 2025
    Authors
    Weronika Dracewicz
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset: Detecting Fake Accounts on Social Media Portals—The X Portal Case Study

    This dataset was created as part of the study focused on detecting fake accounts on the X Portal (formerly known as Twitter). The primary aim of the study was to classify social media accounts using image data and machine learning techniques, offering a novel approach to identifying fake accounts. The dataset includes generated accounts, which were used to train and test a Convolutional Neural Network… See the full description on the dataset page: https://huggingface.co/datasets/drveronika/x_fake_profile_detection.

  12. Sample Purchasing / Supply Chain Data

    • catalog.data.gov
    • gimi9.com
    • +2more
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Sample Purchasing / Supply Chain Data [Dataset]. https://catalog.data.gov/dataset/sample-purchasing-supply-chain-data
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Sample purchasing data containing information on suppliers, the products they provide, and the projects those products are used for. Data created or adapted from publicly available sources.

  13. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  14. h

    generated-usa-passeports-dataset

    • huggingface.co
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). generated-usa-passeports-dataset [Dataset]. https://huggingface.co/datasets/UniqueData/generated-usa-passeports-dataset
    Explore at:
    Dataset updated
    Jul 15, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.

  15. n

    Data and code for: Generation and applications of simulated datasets to...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Mar 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Silk; Olivier Gimenez (2023). Data and code for: Generation and applications of simulated datasets to integrate social network and demographic analyses [Dataset]. http://doi.org/10.5061/dryad.m0cfxpp7s
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 10, 2023
    Dataset provided by
    Centre d'Écologie Fonctionnelle et Évolutive
    Authors
    Matthew Silk; Olivier Gimenez
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Social networks are tied to population dynamics; interactions are driven by population density and demographic structure, while social relationships can be key determinants of survival and reproductive success. However, difficulties integrating models used in demography and network analysis have limited research at this interface. We introduce the R package genNetDem for simulating integrated network-demographic datasets. It can be used to create longitudinal social networks and/or capture-recapture datasets with known properties. It incorporates the ability to generate populations and their social networks, generate grouping events using these networks, simulate social network effects on individual survival, and flexibly sample these longitudinal datasets of social associations. By generating co-capture data with known statistical relationships it provides functionality for methodological research. We demonstrate its use with case studies testing how imputation and sampling design influence the success of adding network traits to conventional Cormack-Jolly-Seber (CJS) models. We show that incorporating social network effects in CJS models generates qualitatively accurate results, but with downward-biased parameter estimates when network position influences survival. Biases are greater when fewer interactions are sampled or fewer individuals are observed in each interaction. While our results indicate the potential of incorporating social effects within demographic models, they show that imputing missing network measures alone is insufficient to accurately estimate social effects on survival, pointing to the importance of incorporating network imputation approaches. genNetDem provides a flexible tool to aid these methodological advancements and help researchers test other sampling considerations in social network studies. Methods The dataset and code stored here is for Case Studies 1 and 2 in the paper. Datsets were generated using simulations in R. Here we provide 1) the R code used for the simulations; 2) the simulation outputs (as .RDS files); and 3) the R code to analyse simulation outputs and generate the tables and figures in the paper.

  16. Z

    FAD: A Chinese Dataset for Fake Audio Detection

    • data.niaid.nih.gov
    • dataon.kisti.re.kr
    • +1more
    Updated Jul 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoxin Ma; Jiangyan Yi (2023). FAD: A Chinese Dataset for Fake Audio Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6623226
    Explore at:
    Dataset updated
    Jul 9, 2023
    Dataset provided by
    Institute of Automation, Chinese Academy of Sciences
    Authors
    Haoxin Ma; Jiangyan Yi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fake audio detection is a growing concern and some relevant datasets have been designed for research. But there is no standard public Chinese dataset under additive noise conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (FAD) for studying more generalized detection methods. Twelve mainstream speech generation techniques are used to generate fake audios. To simulate the real-life scenarios, three noise datasets are selected for noisy adding at five different signal noise ratios. FAD dataset can be used not only for fake audio detection, but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The FAD dataset is publicly available. The source code of baselines is available on GitHub https://github.com/ADDchallenge/FAD

    The FAD dataset is designed to evaluate the methods of fake audio detection and fake algorithms recognition and other relevant studies. To better study the robustness of the methods under noisy conditions when applied in real life, we construct the corresponding noisy dataset. The total FAD dataset consists of two versions: clean version and noisy version. Both versions are divided into disjoint training, development and test sets in the same way. There is no speaker overlap across these three subsets. Each test sets is further divided into seen and unseen test sets. Unseen test sets can evaluate the generalization of the methods to unknown types. It is worth mentioning that both real audios and fake audios in the unseen test set are unknown to the model. For the noisy speech part, we select three noise database for simulation. Additive noises are added to each audio in the clean dataset at 5 different SNRs. The additive noises of the unseen test set and the remaining subsets come from different noise databases. In each version of FAD dataset, there are 138400 utterances in training set, 14400 utterances in development set, 42000 utterances in seen test set, and 21000 utterances in unseen test set. More detailed statistics are demonstrated in the Tabel 2.

    Clean Real Audios Collection From the point of eliminating the interference of irrelevant factors, we collect clean real audios from two aspects: 5 open resources from OpenSLR platform (http://www.openslr.org/12/) and one self-recording dataset.

    Clean Fake Audios Generation We select 11 representative speech synthesis methods to generate the fake audios and one partially fake audios.

    Noisy Audios Simulation Noisy audios aim to quantify the robustness of the methods under noisy conditions. To simulate the real-life scenarios, we artificially sample the noise signals and add them to clean audios at 5 different SNRs, which are 0dB, 5dB, 10dB, 15dB and 20dB. Additive noises are selected from three noise databases: PNL 100 Nonspeech Sounds, NOISEX-92, and TAU Urban Acoustic Scenes.

    This data set is licensed with a CC BY-NC-ND 4.0 license. You can cite the data using the following BibTeX entry: @inproceedings{ma2022fad, title={FAD: A Chinese Dataset for Fake Audio Detection}, author={Haoxin Ma, Jiangyan Yi, Chenglong Wang, Xinrui Yan, Jianhua Tao, Tao Wang, Shiming Wang, Le Xu, Ruibo Fu}, booktitle={Submitted to the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks }, year={2022}, }

  17. z

    Requirements data sets (user stories)

    • zenodo.org
    • data.mendeley.com
    txt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Mendeley Data
    Authors
    Fabiano Dalpiaz; Fabiano Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 22 data set of 50+ requirements each, expressed as user stories.

    The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

    The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

    This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

    Overview of the datasets [data and links added in December 2024]

    The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

    Public administration and transparency

    g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

    g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

    g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

    g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

    g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

    (Research) data and meta-data management

    g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

    g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

    g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

    g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

    g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

    g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

    g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
    born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

  18. Dataset for: Sample size estimation for case-crossover studies

    • wiley.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Dharmarajan; Joo Yeon Lee; Rima Izem (2023). Dataset for: Sample size estimation for case-crossover studies [Dataset]. http://doi.org/10.6084/m9.figshare.7228559.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Sai Dharmarajan; Joo Yeon Lee; Rima Izem
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Case-crossover study designs are observational studies used to assess post-market safety of medical products (e.g. vaccines or drugs). As a case-crossover study is self-controlled, its advantages include better control for confounding because the design controls for any time-invariant measured and unmeasured confounding, and potentially greater feasibility as only data from those experiencing an event (or cases) is required. However, self-matching also introduces correlation between case and control periods within a subject or matched unit. To estimate sample size in a case-crossover study, investigators currently use Dupont’s formula (Biometrics 1988; 43:1157- 1168), which was originally developed for a matched case-control study. This formula is relevant as it takes into account correlation in exposure between controls and cases which are expected to be high in self-controlled studies. However, in our study, we show that Dupont’s formula and other currently used methods to determine sample size for case-crossover studies may be inadequate. Specifically, these formulae tend to underestimate the true required sample size, determined through simulations, for a range of values in the parameter space. We present mathematical derivations to explain where some currently used methods fail and propose two new sample size estimation methods that provide a more accurate estimate of the true required sample size.

  19. Mock Nonprofit Fundraising Data

    • kaggle.com
    zip
    Updated Mar 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grant Stancliff (2025). Mock Nonprofit Fundraising Data [Dataset]. https://www.kaggle.com/datasets/grantstancliff/mock-nonprofit-fundraising-data
    Explore at:
    zip(1188885 bytes)Available download formats
    Dataset updated
    Mar 2, 2025
    Authors
    Grant Stancliff
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Grant Stancliff

    Released under CC0: Public Domain

    Contents

  20. Project Portfolio Dataset

    • figshare.com
    Updated Oct 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brett Thiele; Michael Ryan; Alireza Abbasi (2021). Project Portfolio Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12998822.v3
    Explore at:
    application/vnd.ms-officeAvailable download formats
    Dataset updated
    Oct 30, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Brett Thiele; Michael Ryan; Alireza Abbasi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises a portfolio of six projects each having a PMB consisting of a budget derived from a standardised first-principles, bottom-up estimation technique utilising a homogeneous set of resources, both consumable and non-consumable, which are inter-related in a highly complex, multi-dimensional manner with appropriate correlation between quantity, productivity rates and cost rates. The data set also includes detailed time-phased actual costs and progress over the life of each project as well as the time-phased values of revenue claimed for each project. The collection and attribution of 12,139 actual costs and the measurement of progress over a period of just under 109 consecutive weeks is consistent and standardised across all six projects providing a unique differentiator to other datasets.

    The data was collected over a continuous 109 week period, and is valid for research in portfolio, program or project management or control of multiple resource-constrained projects, where projects are managed collectively, generally in a standardised manner to support a strategic business aim.The latest revision of the data includes updated information for all eight projects in the form of individual spreadsheets for each project. The updates include:

    ·
    All new spread sheets have the pre-fix – EDM. All original datasets remain unchanged.

    ·
    Earned Duration Information has been added to each project. The ability to create an EDM baseline and record actual EDM data or progress is simplified using the spreadsheets. Tasks, start and completion dates are simply copied into the spread sheet and the data for the EDM baseline, TPD and TED s-curves is created. The spread sheets also include various mechanisms for recording or estimating work progress.

    ·
    Actual start and finish dates have been added to each task for each project. The previous dataset only recorded the progress of each task at the completion of a reporting month. This update provides greater granularity for start and finish dates for each task.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shuvo Kumar Basak-4004 (2023). Fake Dataset for Practice [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/fake-dataset-for-practice
Organization logo

Fake Dataset for Practice

Explore at:
zip(1515599 bytes)Available download formats
Dataset updated
Aug 21, 2023
Authors
Shuvo Kumar Basak-4004
Description

Description: This dataset is created solely for the purpose of practice and learning. It contains entirely fake and fabricated information, including names, phone numbers, emails, cities, ages, and other attributes. None of the information in this dataset corresponds to real individuals or entities. It serves as a resource for those who are learning data manipulation, analysis, and machine learning techniques. Please note that the data is completely fictional and should not be treated as representing any real-world scenarios or individuals.

Attributes: - phone_number: Fake phone numbers in various formats. - name: Fictitious names generated for practice purposes. - email: Imaginary email addresses created for the dataset. - city: Made-up city names to simulate geographical diversity. - age: Randomly generated ages for practice analysis. - sex: Simulated gender values (Male, Female). - married_status: Synthetic marital status information. - job: Fictional job titles for practicing data analysis. - income: Fake income values for learning data manipulation. - religion: Pretend religious affiliations for practice. - nationality: Simulated nationalities for practice purposes.

Please be aware that this dataset is not based on real data and should be used exclusively for educational purposes.

Search
Clear search
Close search
Google apps
Main menu