100+ datasets found
  1. Fake News data set

    • kaggle.com
    zip
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bjørn-Jostein (2021). Fake News data set [Dataset]. https://www.kaggle.com/datasets/bjoernjostein/fake-news-data-set
    Explore at:
    zip(56446259 bytes)Available download formats
    Dataset updated
    Dec 17, 2021
    Authors
    Bjørn-Jostein
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.

    How do we figure out what is true and what is fake? Can we do something about it?

    Content

    The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!

    The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.

    Acknowledgements

    This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.

  2. Simulated Radar Waveform and RF Dataset Generator for Incumbent Signals in...

    • data.nist.gov
    • datasets.ai
    • +2more
    Updated May 7, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2020). Simulated Radar Waveform and RF Dataset Generator for Incumbent Signals in the 3.5 GHz CBRS Band [Dataset]. http://doi.org/10.18434/M32229
    Explore at:
    Dataset updated
    May 7, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    This software tool generates simulated radar signals and creates RF datasets. The datasets can be used to develop and test detection algorithms by utilizing machine learning/deep learning techniques for the 3.5 GHz Citizens Broadband Radio Service (CBRS) or similar bands. In these bands, the primary users of the band are federal incumbent radar systems. The software tool generates radar waveforms and randomizes the radar waveform parameters. The pulse modulation types for the radar signals and their parameters are selected based on NTIA testing procedures for ESC certification, available at http://www.its.bldrdoc.gov/publications/3184.aspx. Furthermore, the tool mixes the waveforms with interference and packages them into one RF dataset file. The tool utilizes a graphical user interface (GUI) to simplify the selection of parameters and the mixing process. A reference RF dataset was generated using this software. The RF dataset is published at https://doi.org/10.18434/M32116.

  3. d

    Data from: Delta Neighborhood Physical Activity Study

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Jun 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Delta Neighborhood Physical Activity Study [Dataset]. https://catalog.data.gov/dataset/delta-neighborhood-physical-activity-study-f82d7
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    The Delta Neighborhood Physical Activity Study was an observational study designed to assess characteristics of neighborhood built environments associated with physical activity. It was an ancillary study to the Delta Healthy Sprouts Project and therefore included towns and neighborhoods in which Delta Healthy Sprouts participants resided. The 12 towns were located in the Lower Mississippi Delta region of Mississippi. Data were collected via electronic surveys between August 2016 and September 2017 using the Rural Active Living Assessment (RALA) tools and the Community Park Audit Tool (CPAT). Scale scores for the RALA Programs and Policies Assessment and the Town-Wide Assessment were computed using the scoring algorithms provided for these tools via SAS software programming. The Street Segment Assessment and CPAT do not have associated scoring algorithms and therefore no scores are provided for them. Because the towns were not randomly selected and the sample size is small, the data may not be generalizable to all rural towns in the Lower Mississippi Delta region of Mississippi. Dataset one contains data collected with the RALA Programs and Policies Assessment (PPA) tool. Dataset two contains data collected with the RALA Town-Wide Assessment (TWA) tool. Dataset three contains data collected with the RALA Street Segment Assessment (SSA) tool. Dataset four contains data collected with the Community Park Audit Tool (CPAT). [Note : title changed 9/4/2020 to reflect study name] Resources in this dataset:Resource Title: Dataset One RALA PPA Data Dictionary. File Name: RALA PPA Data Dictionary.csvResource Description: Data dictionary for dataset one collected using the RALA PPA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Two RALA TWA Data Dictionary. File Name: RALA TWA Data Dictionary.csvResource Description: Data dictionary for dataset two collected using the RALA TWA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Three RALA SSA Data Dictionary. File Name: RALA SSA Data Dictionary.csvResource Description: Data dictionary for dataset three collected using the RALA SSA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Four CPAT Data Dictionary. File Name: CPAT Data Dictionary.csvResource Description: Data dictionary for dataset four collected using the CPAT.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset One RALA PPA. File Name: RALA PPA Data.csvResource Description: Data collected using the RALA PPA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Two RALA TWA. File Name: RALA TWA Data.csvResource Description: Data collected using the RALA TWA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Three RALA SSA. File Name: RALA SSA Data.csvResource Description: Data collected using the RALA SSA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Four CPAT. File Name: CPAT Data.csvResource Description: Data collected using the CPAT.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Data Dictionary. File Name: DataDictionary_RALA_PPA_SSA_TWA_CPAT.csvResource Description: This is a combined data dictionary from each of the 4 dataset files in this set.

  4. NaKnowBase Interoperability Tools

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). NaKnowBase Interoperability Tools [Dataset]. https://catalog.data.gov/dataset/naknowbase-interoperability-tools
    Explore at:
    Dataset updated
    Sep 17, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset is associated with the manuscript "Translating nanoEHS data using EPA NaKnowBase and the Resource Description Framework" mortensen h, Williams A, Beach B, Slaughter W, Senn J and Boyes W submitted 8/3/2023 to F1000:Nanotoxicology. The dataset includes and RDF mapping of EPA NaKnowBase (NKB), the OntoSearcher code used to produce the file NKB RDF, as well as training materials and example files for the user. Portions of this dataset are inaccessible because: this data includes partner data and old code that has been modified since 2021. They can be accessed through the following means: OntoSearcher_Training_Materials.zip. Format: The file entitled "OntoSearcher_Training_Materials.zip" includes updated materials as of 07/11/23. These files include the Ontosearcher tool materials, sample NKB dataset and corresponding training documentation on how to run the tool with the sample dataset, and apply to the users own data. This directory also includes the current RDF mapping of the NKB (NKB_RDF_V3.ttl).

  5. m

    Dataset of development of business during the COVID-19 crisis

    • data.mendeley.com
    • narcis.nl
    Updated Nov 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tatiana N. Litvinova (2020). Dataset of development of business during the COVID-19 crisis [Dataset]. http://doi.org/10.17632/9vvrd34f8t.1
    Explore at:
    Dataset updated
    Nov 9, 2020
    Authors
    Tatiana N. Litvinova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.

  6. d

    Synthetic (fake) youth mental health datasets and data dictionaries

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew P Hamilton (2024). Synthetic (fake) youth mental health datasets and data dictionaries [Dataset]. http://doi.org/10.7910/DVN/HJXYKQ
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Matthew P Hamilton
    Description

    The datasets in this collection are entirely fake. They were developed principally to demonstrate the workings of a number of utility scoring and mapping algorithms. However, they may be of more general use to others. In some limited cases, some of the included files could be used in exploratory simulation based analyses. However, you should read the metadata descriptors for each file to inform yourself of the validity and limitations of each fake dataset. To open the RDS format files included in this dataset, the R package ready4use needs to be installed (see https://ready4-dev.github.io/ready4use/ ). It is also recommended that you install the youthvars package ( https://ready4-dev.github.io/youthvars/) as this provides useful tools for inspecting and validating each dataset.

  7. Data Science Interview Q&A Treasury

    • kaggle.com
    zip
    Updated Feb 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orcun (2024). Data Science Interview Q&A Treasury [Dataset]. https://www.kaggle.com/datasets/memocan/data-science-interview-q-and-a-treasury
    Explore at:
    zip(24538 bytes)Available download formats
    Dataset updated
    Feb 26, 2024
    Authors
    Orcun
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The "Ultimate Data Science Interview Q&A Treasury" dataset is a meticulously curated collection designed to empower aspiring data scientists with the knowledge and insights needed to excel in the competitive field of data science. Whether you're a beginner seeking to ground your foundations or an experienced professional aiming to brush up on the latest trends, this treasury serves as an indispensable guide. Furthermore, you might want to work on the following exercises using this dataset :

    1)Keyword Analysis for Trending Topics: Frequency Analysis: Identify the most common keywords or terms that appear in the questions to spot trending topics or skills. 2)Topic Modeling: Use algorithms like Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF) to group questions into topics automatically. This can reveal the underlying themes or areas of focus in data science interviews. 3)Text Difficulty Level Analysis: Implement Natural Language Processing (NLP) techniques to evaluate the complexity of questions and answers. This could help in categorizing them into beginner, intermediate, and advanced levels. 4)Clustering for Unsupervised Learning: Apply clustering techniques to group similar questions or answers together. This could help identify unique question patterns or common answer structures. 5)Automated Question Generation: Train a model to generate new interview questions based on the patterns and topics discovered in the dataset. This could be a valuable tool for creating mock interviews or study guides.

  8. CNC Milling Process Dataset

    • kaggle.com
    zip
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grzegorz Piecuch (2024). CNC Milling Process Dataset [Dataset]. https://www.kaggle.com/datasets/grzegorzpiecuch/cnc-milling
    Explore at:
    zip(26196989750 bytes)Available download formats
    Dataset updated
    Nov 14, 2024
    Authors
    Grzegorz Piecuch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data set contains raw data (Pxxx_Fyy_Czz.csv files) and processed data (a file with designated features - FeatureAndMetadata_Milling.csv) from the full life cycle of 14 cutting tools used in the milling process. The tools performed 968 milling cycles. The data contain vibration signals (8 measuring channels from the spindle and work table) and current signals (12 measuring channels from the spindle and work table).

    A metadata file is also available, in which each cycle is assigned process data (e.g. tool number, sample number, sample hardness)

    The data set is useful for work on the classification of tool condition or estimation of their service life.

    It is possible to use only FeatureAndMetadata_Milling.csv and work with calculated features or download all files and work with raw data.

    Full description is avaliable at (Open Access article): https://www.nature.com/articles/s41597-025-04923-y

    When you want to reuse this dataset in your research, please cite this article.

  9. u

    Dawnn benchmarking dataset: Simulated linear trajectories processing and...

    • rdr.ucl.ac.uk
    application/gzip
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Hall; Sergi Castellano Hereza (2023). Dawnn benchmarking dataset: Simulated linear trajectories processing and label simulation [Dataset]. http://doi.org/10.5522/04/22616611.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 4, 2023
    Dataset provided by
    University College London
    Authors
    George Hall; Sergi Castellano Hereza
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This project is a collection of files to allow users to reproduce the model development and benchmarking in "Dawnn: single-cell differential abundance with neural networks" (Hall and Castellano, under review). Dawnn is a tool for detecting differential abundance in single-cell RNAseq datasets. It is available as an R package here. Please contact us if you are unable to reproduce any of the analysis in our paper. The files in this collection correspond to the benchmarking dataset based on simulated linear trajectories.

    FILES: Data processing code

    adapted_traj_sim_milo_paper.R Lightly adapted code from Dann et al. to simulate single-cell RNAseq datasets that form linear trajectories . generate_test_data_linear_traj_sim_milo_paper.R R code to assign simulated labels to datatsets generated from adapted_traj_sim_milo_paper.R. Seurat objects saved as cells_sim_linear_traj_gex_seed_*.rds. Simulated labels saved as benchmark_dataset_sim_linear_traj.csv.

    Resulting datasets

    cells_sim_linear_traj_gex_seed_*.rds Seurat objects generated by generate_test_data_linear_traj_sim_milo_paper.R. benchmark_dataset_sim_linear_traj.csv Cell labels generated by generate_test_data_linear_traj_sim_milo_paper.R.

  10. Cynthia Data - synthetic EHR records

    • kaggle.com
    zip
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Craig Calderone (2025). Cynthia Data - synthetic EHR records [Dataset]. https://www.kaggle.com/datasets/craigcynthiaai/cynthia-data-synthetic-ehr-records
    Explore at:
    zip(2654924 bytes)Available download formats
    Dataset updated
    Jan 24, 2025
    Authors
    Craig Calderone
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description: This dataset contains 5 sample PDF Electronic Health Records (EHRs), generated as part of a synthetic healthcare data project. The purpose of this dataset is to assist with sales distribution, offering potential users and stakeholders a glimpse of how synthetic EHRs can look and function. These records have been crafted to mimic realistic admission data while ensuring privacy and compliance with all data protection regulations.

    Key Features: 1. Synthetic Data: Entirely artificial data created for testing and demonstration purposes. 1. PDF Format: Records are presented in PDF format, commonly used in healthcare systems. 1. Diverse Use Cases: Useful for evaluating tools related to data parsing, machine learning in healthcare, or EHR management systems. 1. Rich Admission Details: Includes admission-related data that highlights the capabilities of synthetic EHR generation.

    Potential Use Cases:

    • Demonstrating EHR-related tools or services.
    • Benchmarking data parsing models for PDF health records.
    • Showcasing synthetic healthcare data in sales or marketing efforts.

    Feel free to use this dataset for non-commercial testing and demonstration purposes. Feedback and suggestions for improvements are always welcome!

  11. c

    Sephora Makeup Dataset – Free Beauty Product CSV

    • crawlfeeds.com
    csv, zip
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Sephora Makeup Dataset – Free Beauty Product CSV [Dataset]. https://crawlfeeds.com/datasets/sephora-sample-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Dec 2, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Looking for a free dataset of cosmetic products? The Sephora Makeup Products Sample Dataset provides a ready-to-use CSV of beauty product data containing 340 verified Sephora makeup product records. It includes details like product name, brand, price, ingredients, availability, user reviews count, and images - perfect for e-commerce research, market analysis, price tracking, or building machine-learning and recommendation systems for the beauty industry.

    Key Features

    • Complete Product Metadata: Each record includes URL, product name, brand, price, SKU, ingredients, product description, usage instructions, review count, image links, availability status, and more.
    • CSV Format: Ready to Use: Download instantly without any scraping or data cleaning required.
    • Ideal for Beauty-Tech & ML Projects: Useful for price comparison tools, recommendation engines, product cataloging, trend analysis, sentiment analysis based on reviews/ratings.
    • Free Sample Access: This sample comes at zero cost (USD $0.0) — an excellent starting point for analysts, developers, or researchers.

    This dataset is perfect for market research, price tracking, sentiment analysis, and AI-based recommendation systems. Whether you're an e-commerce retailer, a data analyst, or a machine learning professional, this dataset provides valuable insights into the beauty industry.

    Explore the Beauty and Cosmetics Data Collection and elevate your data-driven strategies today!

    Who Can Use This Dataset?

    • E-commerce analysts/retailers analyzing cosmetic product catalogs and pricing.
    • Data scientists / ML engineers building recommendation engines or product-based machine-learning models.
    • Market researchers & beauty industry analysts tracking brand/product trends, availability, and consumer preferences.
    • Students/hobby developers exploring beauty-tech projects, demo analyses, or building portfolios with real-world data.

    Why This Sephora Dataset?

    • Skip the hassle: no need for manual scraping or dealing with anti-scraping restrictions.
    • Clean, structured data - ready for immediate integration with tools or pipelines.
    • Free and accessible: great for testing, proof-of-concept or small-scale analysis.
    • Beauty industry focus: concentrated on makeup and cosmetics products - ideal for niche analyses or applications.
  12. AREI Sample Data Set

    • data.europa.eu
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). AREI Sample Data Set [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15181963?locale=da
    Explore at:
    unknown(3504513)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AR-Enhanced Inspection is a tool consisting of a service managing the inspection data created by other processes, and AREI mobile application, which is used to superimpose said data onto the inspected object. This setup enables a human operator to verify and further inspect found defects on an object, even if they are impossible to find by eye (microscopic defects or on large objects. The provided Dataset is a sample for the results of an inspection process, which can be uploaded to the defect service API. Included is also a python script to upload the data to said API.

  13. Z

    Dataset used for "A Recommender System of Buggy App Checkers for App Store...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jun 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Gomez; Romain Rouvoy; Martin Monperrus; Lionel Seinturier (2021). Dataset used for "A Recommender System of Buggy App Checkers for App Store Moderators" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5034291
    Explore at:
    Dataset updated
    Jun 28, 2021
    Dataset provided by
    University of Lille / Inria
    Authors
    Maria Gomez; Romain Rouvoy; Martin Monperrus; Lionel Seinturier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used for paper: "A Recommender System of Buggy App Checkers for App Store Moderators", published on the International Conference on Mobile Software Engineering and Systems (MOBILESoft) in 2015.

    Dataset Collection We built a dataset that consists of a random sample of Android app metadata and user reviews available on the Google Play Store on January and March 2014. Since the Google Play Store is continuously evolving (adding, removing and/or updating apps), we updated the dataset twice. The dataset D1 contains available apps in the Google Play Store in January 2014. Then, we created a new snapshot (D2) of the Google Play Store in March 2014.

    The apps belong to the 27 different categories defined by Google (at the time of writing the paper), and the 4 predefined subcategories (free, paid, new_free, and new_paid). For each category-subcategory pair (e.g. tools-free, tools-paid, sports-new_free, etc.), we collected a maximum of 500 samples, resulting in a median number of 1.978 apps per category.

    For each app, we retrieved the following metadata: name, package, creator, version code, version name, number of downloads, size, upload date, star rating, star counting, and the set of permission requests.

    In addition, for each app, we collected up to a maximum of the latest 500 reviews posted by users in the Google Play Store. For each review, we retrieved its metadata: title, description, device, and version of the app. None of these fields were mandatory, thus several reviews lack some of these details. From all the reviews attached to an app, we only considered the reviews associated with the latest version of the app —i.e., we discarded unversioned and old-versioned reviews. Thus, resulting in a corpus of 1,402,717 reviews (2014 Jan.).

    Dataset Stats Some stats about the datasets:

    • D1 (Jan. 2014) contains 38,781 apps requesting 7,826 different permissions, and 1,402,717 user reviews.

    • D2 (Mar. 2014) contains 46,644 apps and 9,319 different permission requests, and 1,361,319 user reviews.

    Additional stats about the datasets are available here.

    Dataset Description To store the dataset, we created a graph database with Neo4j. This dataset therefore consists of a graph describing the apps as nodes and edges. We chose a graph database because the graph visualization helps to identify connections among data (e.g., clusters of apps sharing similar sets of permission requests).

    In particular, our dataset graph contains six types of nodes: - APP nodes containing metadata of each app, - PERMISSION nodes describing permission types, - CATEGORY nodes describing app categories, - SUBCATEGORY nodes describing app subcategories, - USER_REVIEW nodes storing user reviews. - TOPIC topics mined from user reviews (using LDA).

    Furthermore, there are five types of relationships between APP nodes and each of the remaining nodes:

    • USES_PERMISSION relationships between APP and PERMISSION nodes
    • HAS_REVIEW between APP and USER_REVIEW nodes
    • HAS_TOPIC between USER_REVIEW and TOPIC nodes
    • BELONGS_TO_CATEGORY between APP and CATEGORY nodes
    • BELONGS_TO_SUBCATEGORY between APP and SUBCATEGORY nodes

    Dataset Files Info

    Neo4j 2.0 Databases

    googlePlayDB1-Jan2014_neo4j_2_0.rar

    googlePlayDB2-Mar2014_neo4j_2_0.rar We provide two Neo4j databases containing the 2 snapshots of the Google Play Store (January and March 2014). These are the original databases created for the paper. The databases were created with Neo4j 2.0. In particular with the tool version 'Neo4j 2.0.0-M06 Community Edition' (latest version available at the time of implementing the paper in 2014).

    Neo4j 3.5 Databases

    googlePlayDB1-Jan2014_neo4j_3_5_28.rar

    googlePlayDB2-Mar2014_neo4j_3_5_28.rar Currently, the version Neo4j 2.0 is deprecated and it is not available for download in the official Neo4j Download Center. We have migrated the original databases (Neo4j 2.0) to Neo4j 3.5.28. The databases can be opened with the tool version: 'Neo4j Community Edition 3.5.28'. The tool can be downloaded from the official Neo4j Donwload page.

      In order to open the databases with more recent versions of Neo4j, the databases must be first migrated to the corresponding version. Instructions about the migration process can be found in the Neo4j Migration Guide.
    
      First time the Neo4j database is connected, it could request credentials. The username and pasword are: neo4j/neo4j
    
  14. E

    MockConf: Student Interpretation Dataset

    • live.european-language-grid.eu
    binary format
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). MockConf: Student Interpretation Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23892
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Dec 31, 2024
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This repository contains the dataset centered on Czech, comprising simultaneous interpreting data with human-annotated transcriptions at both the span and word levels. The dataset interpretings that were collected from Mock Conferences run as part of the student interpreters curriculum. These data was then manually aligned and annotated at the word and span level using InterAlign, a dedicated tool designed to facilitate the annotation at the span and word levels. The dataset is described and used in the paper MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines.

  15. Sample data having columns with the information like ID, Tweet, and Label.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Tayyab Zamir; Fida Ullah; Rasikh Tariq; Waqas Haider Bangyal; Muhammad Arif; Alexander Gelbukh (2024). Sample data having columns with the information like ID, Tweet, and Label. [Dataset]. http://doi.org/10.1371/journal.pone.0315407.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Muhammad Tayyab Zamir; Fida Ullah; Rasikh Tariq; Waqas Haider Bangyal; Muhammad Arif; Alexander Gelbukh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  16. f

    The preprocessed Tara data set.

    • plos.figshare.com
    zip
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao (2025). The preprocessed Tara data set. [Dataset]. http://doi.org/10.1371/journal.pone.0300490.s003
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Omics-wide association analysis is a very important tool for medicine and human health study. However, the modern omics data sets collected often exhibit the high-dimensionality, unknown distribution response, unknown distribution features and unknown complex association relationships between the response and its explanatory features. Reliable association analysis results depend on an accurate modeling for such data sets. Most of the existing association analysis methods rely on the specific model assumptions and lack effective false discovery rate (FDR) control. To address these limitations, the paper firstly applies a single index model for omics data. The model shows robust performance in allowing the relationships between the response variable and linear combination of covariates to be connected by any unknown monotonic link function, and both the random error and the covariates can follow any unknown distribution. Then based on this model, the paper combines rank-based approach and symmetrized data aggregation approach to develop a novel and robust feature selection method for achieving fine-mapping of risk features while controlling the false positive rate of selection. The theoretical results support the proposed method and the analysis results of simulated data show the new method possesses effective and robust performance for all the scenarios. The new method is also used to analyze the two real datasets and identifies some risk features unreported by the existing finds.

  17. Synthetic E-Commerce Relational Datasets

    • kaggle.com
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nael Aqel (2025). Synthetic E-Commerce Relational Datasets [Dataset]. https://www.kaggle.com/datasets/naelaqel/synthetic-e-commerce-relational-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nael Aqel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic E-Commerce Relational Dataset

    This dataset is synthetically generated fake data designed to simulate a realistic e-commerce environment.

    Purpose

    To provide large-scale relational datasets for practicing database operations, analytics, and testing tools like DuckDB, Pandas, and SQL engines. Ideal for benchmarking, educational projects, and data engineering experiments.

    Entity Relationship Diagram (ERD) - Tables Overview

    1. Customers

    • customer_id (int): Unique identifier for each customer
    • name (string): Customer full name
    • email (string): Customer email address
    • gender (string): Customer gender ('Male', 'Female', 'Other')
    • signup_date (date): Date customer signed up
    • country (string): Customer country of residence

    2. Products

    • product_id (int): Unique identifier for each product
    • product_name (string): Name of the product
    • category (string): Product category (e.g., Electronics, Books)
    • price (float): Price per unit
    • stock_quantity (int): Available stock count
    • brand (string): Product brand name

    3. Orders

    • order_id (int): Unique identifier for each order
    • customer_id (int): ID of the customer who placed the order (foreign key to Customers)
    • order_date (date): Date when order was placed
    • total_amount (float): Total amount for the order
    • payment_method (string): Payment method used (Credit Card, PayPal, etc.)
    • shipping_country (string): Country where the order is shipped

    4. Order Items

    • order_item_id (int): Unique identifier for each order item
    • order_id (int): ID of the order this item belongs to (foreign key to Orders)
    • product_id (int): ID of the product ordered (foreign key to Products)
    • quantity (int): Number of units ordered
    • unit_price (float): Price per unit at order time

    5. Product Reviews

    • review_id (int): Unique identifier for each review
    • product_id (int): ID of the reviewed product (foreign key to Products)
    • customer_id (int): ID of the customer who wrote the review (foreign key to Customers)
    • rating (int): Rating score (1 to 5)
    • review_text (string): Text content of the review
    • review_date (date): Date the review was written

    Visual EDR

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9179978%2F7681afe8fc52a116ff56a2a4e179ad19%2FEDR.png?generation=1754741998037680&alt=media" alt="">

    Notes

    • All data is randomly generated using Python’s Faker library, so it does not reflect any real individuals or companies.
    • The data is provided in both CSV and Parquet formats.
    • The generator script is available in the accompanying GitHub repository for reproducibility and customization.

    Output

    The script saves two folders inside the specified output path:

    csv/    # CSV files
    parquet/  # Parquet files
    

    License

    MIT License

    References

  18. Data from: Delta Food Outlets Study

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Delta Food Outlets Study [Dataset]. https://catalog.data.gov/dataset/delta-food-outlets-study-2786d
    Explore at:
    Dataset updated
    May 8, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    The Delta Food Outlets Study was an observational study designed to assess the nutritional environments of 5 towns located in the Lower Mississippi Delta region of Mississippi. It was an ancillary study to the Delta Healthy Sprouts Project and therefore included towns in which Delta Healthy Sprouts participants resided and that contained at least one convenience (corner) store, grocery store, or gas station. Data were collected via electronic surveys between March 2016 and September 2018 using the Nutrition Environment Measures Survey (NEMS) tools. Survey scores for the NEMS Corner Store, NEMS Grocery Store, and NEMS Restaurant were computed using modified scoring algorithms provided for these tools via SAS software programming. Because the towns were not randomly selected and the sample sizes are relatively small, the data may not be generalizable to all rural towns in the Lower Mississippi Delta region of Mississippi. Dataset one (NEMS-C) contains data collected with the NEMS Corner (convenience) Store tool. Dataset two (NEMS-G) contains data collected with the NEMS Grocery Store tool. Dataset three (NEMS-R) contains data collected with the NEMS Restaurant tool. Resources in this dataset:Resource Title: Delta Food Outlets Data Dictionary. File Name: DFO_DataDictionary_Public.csvResource Description: This file contains the data dictionary for all 3 datasets that are part of the Delta Food Outlets Study.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset One NEMS-C. File Name: NEMS-C Data.csvResource Description: This file contains data collected with the Nutrition Environment Measures Survey (NEMS) tool for convenience stores.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Two NEMS-G. File Name: NEMS-G Data.csvResource Description: This file contains data collected with the Nutrition Environment Measures Survey (NEMS) tool for grocery stores.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Three NEMS-R. File Name: NEMS-R Data.csvResource Description: This file contains data collected with the Nutrition Environment Measures Survey (NEMS) tool for restaurants.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel

  19. CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

    • zenodo.org
    • data.niaid.nih.gov
    • +3more
    bin, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farah Zaib Khan; Farah Zaib Khan; Stian Soiland-Reyes; Stian Soiland-Reyes (2020). CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object) [Dataset]. http://doi.org/10.17632/xnwncxpw42.1
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Farah Zaib Khan; Farah Zaib Khan; Stian Soiland-Reyes; Stian Soiland-Reyes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:

    1. Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.
    2. The Genome BAM file is processed using Picard MarkDuplicates. producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).
    3. SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.
    4. The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.
    5. In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.

    For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.

    This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl

    Steps to reproduce

    To build the research object again, use Python 3 on macOS. Built with:

    • Processor 2.8GHz Intel Core i7
    • Memory: 16GB
    • OS: macOS High Sierra, Version 10.13.3
    • Storage: 250GB
    1. Install cwltool

      pip3 install cwltool==1.0.20180912090223
    2. Install git lfs
      The data download with the git repository requires the installation of Git lfs:
      https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs

    3. Get the data and make the analysis environment ready:

      git clone https://github.com/FarahZKhan/cwl_workflows.git
      cd cwl_workflows/
      git checkout CWLProvTesting
      ./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh
    4. Run the following commands to create the CWLProv Research Object:

      cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json
      
      zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac
      sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256

    The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120

  20. D

    Simulated-Orchards Dataset

    • datasetninja.com
    • kaggle.com
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Hasperhoven; Maya Aghaei; Klaas Dijkstra (2023). Simulated-Orchards Dataset [Dataset]. https://datasetninja.com/simulated-orchards
    Explore at:
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Dataset Ninja
    Authors
    Dylan Hasperhoven; Maya Aghaei; Klaas Dijkstra
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Simulated-Orchards presents a dataset designed explicitly for object detection tasks, featuring 1499 images containing a total of 44885 labeled objects all falling within a singular class — apple. Notably, this dataset is generated systematically through a tool developed in the Unity 3D game engine, allowing for the systematic creation of simulated datasets. The focus on a singular class, in this case, apples, caters to applications in object detection, offering a rich resource for training models to identify and locate apples within simulated orchard environments, providing a valuable asset for agricultural and computer vision research.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bjørn-Jostein (2021). Fake News data set [Dataset]. https://www.kaggle.com/datasets/bjoernjostein/fake-news-data-set
Organization logo

Fake News data set

Can we predict fake news from head lines

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(56446259 bytes)Available download formats
Dataset updated
Dec 17, 2021
Authors
Bjørn-Jostein
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.

How do we figure out what is true and what is fake? Can we do something about it?

Content

The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!

The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.

Acknowledgements

This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.

Search
Clear search
Close search
Google apps
Main menu