100+ datasets found

Fake News data set
kaggle.com
zip
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bjørn-Jostein (2021). Fake News data set [Dataset]. https://www.kaggle.com/datasets/bjoernjostein/fake-news-data-set
Explore at:
zip(56446259 bytes)Available download formats
Dataset updated
Dec 17, 2021
Authors
Bjørn-Jostein
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.

How do we figure out what is true and what is fake? Can we do something about it?

Content

The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!

The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.

Acknowledgements

This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.
Simulated Radar Waveform and RF Dataset Generator for Incumbent Signals in...
data.nist.gov
datasets.ai
+2more
Updated May 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2020). Simulated Radar Waveform and RF Dataset Generator for Incumbent Signals in the 3.5 GHz CBRS Band [Dataset]. http://doi.org/10.18434/M32229
Explore at:
Unique identifier
https://doi.org/10.18434/M32229, https://identifiers.org/ark:/88434/mds2-2229
Dataset updated
May 7, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
This software tool generates simulated radar signals and creates RF datasets. The datasets can be used to develop and test detection algorithms by utilizing machine learning/deep learning techniques for the 3.5 GHz Citizens Broadband Radio Service (CBRS) or similar bands. In these bands, the primary users of the band are federal incumbent radar systems. The software tool generates radar waveforms and randomizes the radar waveform parameters. The pulse modulation types for the radar signals and their parameters are selected based on NTIA testing procedures for ESC certification, available at http://www.its.bldrdoc.gov/publications/3184.aspx. Furthermore, the tool mixes the waveforms with interference and packages them into one RF dataset file. The tool utilizes a graphical user interface (GUI) to simplify the selection of parameters and the mixing process. A reference RF dataset was generated using this software. The RF dataset is published at https://doi.org/10.18434/M32116.
d
Data from: Delta Neighborhood Physical Activity Study
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Delta Neighborhood Physical Activity Study [Dataset]. https://catalog.data.gov/dataset/delta-neighborhood-physical-activity-study-f82d7
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
The Delta Neighborhood Physical Activity Study was an observational study designed to assess characteristics of neighborhood built environments associated with physical activity. It was an ancillary study to the Delta Healthy Sprouts Project and therefore included towns and neighborhoods in which Delta Healthy Sprouts participants resided. The 12 towns were located in the Lower Mississippi Delta region of Mississippi. Data were collected via electronic surveys between August 2016 and September 2017 using the Rural Active Living Assessment (RALA) tools and the Community Park Audit Tool (CPAT). Scale scores for the RALA Programs and Policies Assessment and the Town-Wide Assessment were computed using the scoring algorithms provided for these tools via SAS software programming. The Street Segment Assessment and CPAT do not have associated scoring algorithms and therefore no scores are provided for them. Because the towns were not randomly selected and the sample size is small, the data may not be generalizable to all rural towns in the Lower Mississippi Delta region of Mississippi. Dataset one contains data collected with the RALA Programs and Policies Assessment (PPA) tool. Dataset two contains data collected with the RALA Town-Wide Assessment (TWA) tool. Dataset three contains data collected with the RALA Street Segment Assessment (SSA) tool. Dataset four contains data collected with the Community Park Audit Tool (CPAT). [Note : title changed 9/4/2020 to reflect study name] Resources in this dataset:Resource Title: Dataset One RALA PPA Data Dictionary. File Name: RALA PPA Data Dictionary.csvResource Description: Data dictionary for dataset one collected using the RALA PPA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Two RALA TWA Data Dictionary. File Name: RALA TWA Data Dictionary.csvResource Description: Data dictionary for dataset two collected using the RALA TWA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Three RALA SSA Data Dictionary. File Name: RALA SSA Data Dictionary.csvResource Description: Data dictionary for dataset three collected using the RALA SSA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Four CPAT Data Dictionary. File Name: CPAT Data Dictionary.csvResource Description: Data dictionary for dataset four collected using the CPAT.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset One RALA PPA. File Name: RALA PPA Data.csvResource Description: Data collected using the RALA PPA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Two RALA TWA. File Name: RALA TWA Data.csvResource Description: Data collected using the RALA TWA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Three RALA SSA. File Name: RALA SSA Data.csvResource Description: Data collected using the RALA SSA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Four CPAT. File Name: CPAT Data.csvResource Description: Data collected using the CPAT.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Data Dictionary. File Name: DataDictionary_RALA_PPA_SSA_TWA_CPAT.csvResource Description: This is a combined data dictionary from each of the 4 dataset files in this set.
NaKnowBase Interoperability Tools
catalog.data.gov
datasets.ai
+1more
Updated Sep 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). NaKnowBase Interoperability Tools [Dataset]. https://catalog.data.gov/dataset/naknowbase-interoperability-tools
Explore at:
Dataset updated
Sep 17, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset is associated with the manuscript "Translating nanoEHS data using EPA NaKnowBase and the Resource Description Framework" mortensen h, Williams A, Beach B, Slaughter W, Senn J and Boyes W submitted 8/3/2023 to F1000:Nanotoxicology. The dataset includes and RDF mapping of EPA NaKnowBase (NKB), the OntoSearcher code used to produce the file NKB RDF, as well as training materials and example files for the user. Portions of this dataset are inaccessible because: this data includes partner data and old code that has been modified since 2021. They can be accessed through the following means: OntoSearcher_Training_Materials.zip. Format: The file entitled "OntoSearcher_Training_Materials.zip" includes updated materials as of 07/11/23. These files include the Ontosearcher tool materials, sample NKB dataset and corresponding training documentation on how to run the tool with the sample dataset, and apply to the users own data. This directory also includes the current RDF mapping of the NKB (NKB_RDF_V3.ttl).
m
Dataset of development of business during the COVID-19 crisis
data.mendeley.com
narcis.nl
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatiana N. Litvinova (2020). Dataset of development of business during the COVID-19 crisis [Dataset]. http://doi.org/10.17632/9vvrd34f8t.1
Explore at:
Unique identifier
https://doi.org/10.17632/9vvrd34f8t.1
Dataset updated
Nov 9, 2020
Authors
Tatiana N. Litvinova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.
d
Synthetic (fake) youth mental health datasets and data dictionaries
search.dataone.org
dataverse.harvard.edu
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew P Hamilton (2024). Synthetic (fake) youth mental health datasets and data dictionaries [Dataset]. http://doi.org/10.7910/DVN/HJXYKQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/HJXYKQ
Dataset updated
Mar 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Matthew P Hamilton
Description
The datasets in this collection are entirely fake. They were developed principally to demonstrate the workings of a number of utility scoring and mapping algorithms. However, they may be of more general use to others. In some limited cases, some of the included files could be used in exploratory simulation based analyses. However, you should read the metadata descriptors for each file to inform yourself of the validity and limitations of each fake dataset. To open the RDS format files included in this dataset, the R package ready4use needs to be installed (see https://ready4-dev.github.io/ready4use/ ). It is also recommended that you install the youthvars package ( https://ready4-dev.github.io/youthvars/) as this provides useful tools for inspecting and validating each dataset.
Data Science Interview Q&A Treasury
kaggle.com
zip
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orcun (2024). Data Science Interview Q&A Treasury [Dataset]. https://www.kaggle.com/datasets/memocan/data-science-interview-q-and-a-treasury
Explore at:
zip(24538 bytes)Available download formats
Dataset updated
Feb 26, 2024
Authors
Orcun
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The "Ultimate Data Science Interview Q&A Treasury" dataset is a meticulously curated collection designed to empower aspiring data scientists with the knowledge and insights needed to excel in the competitive field of data science. Whether you're a beginner seeking to ground your foundations or an experienced professional aiming to brush up on the latest trends, this treasury serves as an indispensable guide. Furthermore, you might want to work on the following exercises using this dataset :

1)Keyword Analysis for Trending Topics: Frequency Analysis: Identify the most common keywords or terms that appear in the questions to spot trending topics or skills. 2)Topic Modeling: Use algorithms like Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF) to group questions into topics automatically. This can reveal the underlying themes or areas of focus in data science interviews. 3)Text Difficulty Level Analysis: Implement Natural Language Processing (NLP) techniques to evaluate the complexity of questions and answers. This could help in categorizing them into beginner, intermediate, and advanced levels. 4)Clustering for Unsupervised Learning: Apply clustering techniques to group similar questions or answers together. This could help identify unique question patterns or common answer structures. 5)Automated Question Generation: Train a model to generate new interview questions based on the patterns and topics discovered in the dataset. This could be a valuable tool for creating mock interviews or study guides.
CNC Milling Process Dataset
kaggle.com
zip
Updated Nov 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grzegorz Piecuch (2024). CNC Milling Process Dataset [Dataset]. https://www.kaggle.com/datasets/grzegorzpiecuch/cnc-milling
Explore at:
zip(26196989750 bytes)Available download formats
Dataset updated
Nov 14, 2024
Authors
Grzegorz Piecuch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data set contains raw data (Pxxx_Fyy_Czz.csv files) and processed data (a file with designated features - FeatureAndMetadata_Milling.csv) from the full life cycle of 14 cutting tools used in the milling process. The tools performed 968 milling cycles. The data contain vibration signals (8 measuring channels from the spindle and work table) and current signals (12 measuring channels from the spindle and work table).

A metadata file is also available, in which each cycle is assigned process data (e.g. tool number, sample number, sample hardness)

The data set is useful for work on the classification of tool condition or estimation of their service life.

It is possible to use only FeatureAndMetadata_Milling.csv and work with calculated features or download all files and work with raw data.

Full description is avaliable at (Open Access article): https://www.nature.com/articles/s41597-025-04923-y

When you want to reuse this dataset in your research, please cite this article.
u
Dawnn benchmarking dataset: Simulated linear trajectories processing and...
rdr.ucl.ac.uk
application/gzip
Updated May 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Hall; Sergi Castellano Hereza (2023). Dawnn benchmarking dataset: Simulated linear trajectories processing and label simulation [Dataset]. http://doi.org/10.5522/04/22616611.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5522/04/22616611.v1
Dataset updated
May 4, 2023
Dataset provided by
University College London
Authors
George Hall; Sergi Castellano Hereza
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This project is a collection of files to allow users to reproduce the model development and benchmarking in "Dawnn: single-cell differential abundance with neural networks" (Hall and Castellano, under review). Dawnn is a tool for detecting differential abundance in single-cell RNAseq datasets. It is available as an R package here. Please contact us if you are unable to reproduce any of the analysis in our paper. The files in this collection correspond to the benchmarking dataset based on simulated linear trajectories.

FILES: Data processing code

adapted_traj_sim_milo_paper.R Lightly adapted code from Dann et al. to simulate single-cell RNAseq datasets that form linear trajectories . generate_test_data_linear_traj_sim_milo_paper.R R code to assign simulated labels to datatsets generated from adapted_traj_sim_milo_paper.R. Seurat objects saved as cells_sim_linear_traj_gex_seed_*.rds. Simulated labels saved as benchmark_dataset_sim_linear_traj.csv.

Resulting datasets

cells_sim_linear_traj_gex_seed_*.rds Seurat objects generated by generate_test_data_linear_traj_sim_milo_paper.R. benchmark_dataset_sim_linear_traj.csv Cell labels generated by generate_test_data_linear_traj_sim_milo_paper.R.
Cynthia Data - synthetic EHR records
kaggle.com
zip
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Craig Calderone (2025). Cynthia Data - synthetic EHR records [Dataset]. https://www.kaggle.com/datasets/craigcynthiaai/cynthia-data-synthetic-ehr-records
Explore at:
zip(2654924 bytes)Available download formats
Dataset updated
Jan 24, 2025
Authors
Craig Calderone
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Description: This dataset contains 5 sample PDF Electronic Health Records (EHRs), generated as part of a synthetic healthcare data project. The purpose of this dataset is to assist with sales distribution, offering potential users and stakeholders a glimpse of how synthetic EHRs can look and function. These records have been crafted to mimic realistic admission data while ensuring privacy and compliance with all data protection regulations.

Key Features: 1. Synthetic Data: Entirely artificial data created for testing and demonstration purposes. 1. PDF Format: Records are presented in PDF format, commonly used in healthcare systems. 1. Diverse Use Cases: Useful for evaluating tools related to data parsing, machine learning in healthcare, or EHR management systems. 1. Rich Admission Details: Includes admission-related data that highlights the capabilities of synthetic EHR generation.

Potential Use Cases:

Demonstrating EHR-related tools or services.

Benchmarking data parsing models for PDF health records.

Showcasing synthetic healthcare data in sales or marketing efforts.

Feel free to use this dataset for non-commercial testing and demonstration purposes. Feedback and suggestions for improvements are always welcome!
c
Sephora Makeup Dataset – Free Beauty Product CSV
crawlfeeds.com
csv, zip
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Sephora Makeup Dataset – Free Beauty Product CSV [Dataset]. https://crawlfeeds.com/datasets/sephora-sample-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Dec 2, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Looking for a free dataset of cosmetic products? The Sephora Makeup Products Sample Dataset provides a ready-to-use CSV of beauty product data containing 340 verified Sephora makeup product records. It includes details like product name, brand, price, ingredients, availability, user reviews count, and images - perfect for e-commerce research, market analysis, price tracking, or building machine-learning and recommendation systems for the beauty industry.

Key Features

Complete Product Metadata: Each record includes URL, product name, brand, price, SKU, ingredients, product description, usage instructions, review count, image links, availability status, and more.

CSV Format: Ready to Use: Download instantly without any scraping or data cleaning required.

Ideal for Beauty-Tech & ML Projects: Useful for price comparison tools, recommendation engines, product cataloging, trend analysis, sentiment analysis based on reviews/ratings.

Free Sample Access: This sample comes at zero cost (USD $0.0) — an excellent starting point for analysts, developers, or researchers.

This dataset is perfect for market research, price tracking, sentiment analysis, and AI-based recommendation systems. Whether you're an e-commerce retailer, a data analyst, or a machine learning professional, this dataset provides valuable insights into the beauty industry.

Explore the Beauty and Cosmetics Data Collection and elevate your data-driven strategies today!

Who Can Use This Dataset?

E-commerce analysts/retailers analyzing cosmetic product catalogs and pricing.

Data scientists / ML engineers building recommendation engines or product-based machine-learning models.

Market researchers & beauty industry analysts tracking brand/product trends, availability, and consumer preferences.

Students/hobby developers exploring beauty-tech projects, demo analyses, or building portfolios with real-world data.

Why This Sephora Dataset?

Skip the hassle: no need for manual scraping or dealing with anti-scraping restrictions.

Clean, structured data - ready for immediate integration with tools or pipelines.

Free and accessible: great for testing, proof-of-concept or small-scale analysis.

Beauty industry focus: concentrated on makeup and cosmetics products - ideal for niche analyses or applications.
AREI Sample Data Set
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). AREI Sample Data Set [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15181963?locale=da
Explore at:
unknown(3504513)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AR-Enhanced Inspection is a tool consisting of a service managing the inspection data created by other processes, and AREI mobile application, which is used to superimpose said data onto the inspected object. This setup enables a human operator to verify and further inspect found defects on an object, even if they are impossible to find by eye (microscopic defects or on large objects. The provided Dataset is a sample for the results of an inspection process, which can be uploaded to the defect service API. Included is also a python script to upload the data to said API.
Z
Dataset used for "A Recommender System of Buggy App Checkers for App Store...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jun 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Gomez; Romain Rouvoy; Martin Monperrus; Lionel Seinturier (2021). Dataset used for "A Recommender System of Buggy App Checkers for App Store Moderators" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5034291
Explore at:
Dataset updated
Jun 28, 2021
Dataset provided by
University of Lille / Inria
Authors
Maria Gomez; Romain Rouvoy; Martin Monperrus; Lionel Seinturier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for paper: "A Recommender System of Buggy App Checkers for App Store Moderators", published on the International Conference on Mobile Software Engineering and Systems (MOBILESoft) in 2015.

Dataset Collection We built a dataset that consists of a random sample of Android app metadata and user reviews available on the Google Play Store on January and March 2014. Since the Google Play Store is continuously evolving (adding, removing and/or updating apps), we updated the dataset twice. The dataset D1 contains available apps in the Google Play Store in January 2014. Then, we created a new snapshot (D2) of the Google Play Store in March 2014.

The apps belong to the 27 different categories defined by Google (at the time of writing the paper), and the 4 predefined subcategories (free, paid, new_free, and new_paid). For each category-subcategory pair (e.g. tools-free, tools-paid, sports-new_free, etc.), we collected a maximum of 500 samples, resulting in a median number of 1.978 apps per category.

For each app, we retrieved the following metadata: name, package, creator, version code, version name, number of downloads, size, upload date, star rating, star counting, and the set of permission requests.

In addition, for each app, we collected up to a maximum of the latest 500 reviews posted by users in the Google Play Store. For each review, we retrieved its metadata: title, description, device, and version of the app. None of these fields were mandatory, thus several reviews lack some of these details. From all the reviews attached to an app, we only considered the reviews associated with the latest version of the app —i.e., we discarded unversioned and old-versioned reviews. Thus, resulting in a corpus of 1,402,717 reviews (2014 Jan.).

Dataset Stats Some stats about the datasets:

D1 (Jan. 2014) contains 38,781 apps requesting 7,826 different permissions, and 1,402,717 user reviews.

D2 (Mar. 2014) contains 46,644 apps and 9,319 different permission requests, and 1,361,319 user reviews.

Additional stats about the datasets are available here.

Dataset Description To store the dataset, we created a graph database with Neo4j. This dataset therefore consists of a graph describing the apps as nodes and edges. We chose a graph database because the graph visualization helps to identify connections among data (e.g., clusters of apps sharing similar sets of permission requests).

In particular, our dataset graph contains six types of nodes: - APP nodes containing metadata of each app, - PERMISSION nodes describing permission types, - CATEGORY nodes describing app categories, - SUBCATEGORY nodes describing app subcategories, - USER_REVIEW nodes storing user reviews. - TOPIC topics mined from user reviews (using LDA).

Furthermore, there are five types of relationships between APP nodes and each of the remaining nodes:

USES_PERMISSION relationships between APP and PERMISSION nodes

HAS_REVIEW between APP and USER_REVIEW nodes

HAS_TOPIC between USER_REVIEW and TOPIC nodes

BELONGS_TO_CATEGORY between APP and CATEGORY nodes

BELONGS_TO_SUBCATEGORY between APP and SUBCATEGORY nodes

Dataset Files Info

Neo4j 2.0 Databases

googlePlayDB1-Jan2014_neo4j_2_0.rar

googlePlayDB2-Mar2014_neo4j_2_0.rar We provide two Neo4j databases containing the 2 snapshots of the Google Play Store (January and March 2014). These are the original databases created for the paper. The databases were created with Neo4j 2.0. In particular with the tool version 'Neo4j 2.0.0-M06 Community Edition' (latest version available at the time of implementing the paper in 2014).

Neo4j 3.5 Databases

googlePlayDB1-Jan2014_neo4j_3_5_28.rar

googlePlayDB2-Mar2014_neo4j_3_5_28.rar Currently, the version Neo4j 2.0 is deprecated and it is not available for download in the official Neo4j Download Center. We have migrated the original databases (Neo4j 2.0) to Neo4j 3.5.28. The databases can be opened with the tool version: 'Neo4j Community Edition 3.5.28'. The tool can be downloaded from the official Neo4j Donwload page.

In order to open the databases with more recent versions of Neo4j, the databases must be first migrated to the corresponding version. Instructions about the migration process can be found in the Neo4j Migration Guide. First time the Neo4j database is connected, it could request credentials. The username and pasword are: neo4j/neo4j
E
MockConf: Student Interpretation Dataset
live.european-language-grid.eu
binary format
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). MockConf: Student Interpretation Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23892
Explore at:
binary formatAvailable download formats
Dataset updated
Dec 31, 2024
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This repository contains the dataset centered on Czech, comprising simultaneous interpreting data with human-annotated transcriptions at both the span and word levels. The dataset interpretings that were collected from Mock Conferences run as part of the student interpreters curriculum. These data was then manually aligned and annotated at the word and span level using InterAlign, a dedicated tool designed to facilitate the annotation at the span and word levels. The dataset is described and used in the paper MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines.
Sample data having columns with the information like ID, Tweet, and Label.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Tayyab Zamir; Fida Ullah; Rasikh Tariq; Waqas Haider Bangyal; Muhammad Arif; Alexander Gelbukh (2024). Sample data having columns with the information like ID, Tweet, and Label. [Dataset]. http://doi.org/10.1371/journal.pone.0315407.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315407.t002
Dataset updated
Dec 19, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Muhammad Tayyab Zamir; Fida Ullah; Rasikh Tariq; Waqas Haider Bangyal; Muhammad Arif; Alexander Gelbukh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adapted from: [https://www.kaggle.com/datasets/csmalarkodi/covid-fake-news-dataset].
f
The preprocessed Tara data set.
plos.figshare.com
zip
Updated Aug 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao (2025). The preprocessed Tara data set. [Dataset]. http://doi.org/10.1371/journal.pone.0300490.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300490.s003
Dataset updated
Aug 22, 2025
Dataset provided by
PLOS ONE
Authors
Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Omics-wide association analysis is a very important tool for medicine and human health study. However, the modern omics data sets collected often exhibit the high-dimensionality, unknown distribution response, unknown distribution features and unknown complex association relationships between the response and its explanatory features. Reliable association analysis results depend on an accurate modeling for such data sets. Most of the existing association analysis methods rely on the specific model assumptions and lack effective false discovery rate (FDR) control. To address these limitations, the paper firstly applies a single index model for omics data. The model shows robust performance in allowing the relationships between the response variable and linear combination of covariates to be connected by any unknown monotonic link function, and both the random error and the covariates can follow any unknown distribution. Then based on this model, the paper combines rank-based approach and symmetrized data aggregation approach to develop a novel and robust feature selection method for achieving fine-mapping of risk features while controlling the false positive rate of selection. The theoretical results support the proposed method and the analysis results of simulated data show the new method possesses effective and robust performance for all the scenarios. The new method is also used to analyze the two real datasets and identifies some risk features unreported by the existing finds.
Synthetic E-Commerce Relational Datasets
kaggle.com
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nael Aqel (2025). Synthetic E-Commerce Relational Datasets [Dataset]. https://www.kaggle.com/datasets/naelaqel/synthetic-e-commerce-relational-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nael Aqel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic E-Commerce Relational Dataset

This dataset is synthetically generated fake data designed to simulate a realistic e-commerce environment.

Purpose

To provide large-scale relational datasets for practicing database operations, analytics, and testing tools like DuckDB, Pandas, and SQL engines. Ideal for benchmarking, educational projects, and data engineering experiments.

Entity Relationship Diagram (ERD) - Tables Overview

1. Customers

customer_id (int): Unique identifier for each customer

name (string): Customer full name

email (string): Customer email address

gender (string): Customer gender ('Male', 'Female', 'Other')

signup_date (date): Date customer signed up

country (string): Customer country of residence

2. Products

product_id (int): Unique identifier for each product

product_name (string): Name of the product

category (string): Product category (e.g., Electronics, Books)

price (float): Price per unit

stock_quantity (int): Available stock count

brand (string): Product brand name

3. Orders

order_id (int): Unique identifier for each order

customer_id (int): ID of the customer who placed the order (foreign key to Customers)

order_date (date): Date when order was placed

total_amount (float): Total amount for the order

payment_method (string): Payment method used (Credit Card, PayPal, etc.)

shipping_country (string): Country where the order is shipped

4. Order Items

order_item_id (int): Unique identifier for each order item

order_id (int): ID of the order this item belongs to (foreign key to Orders)

product_id (int): ID of the product ordered (foreign key to Products)

quantity (int): Number of units ordered

unit_price (float): Price per unit at order time

5. Product Reviews

review_id (int): Unique identifier for each review

product_id (int): ID of the reviewed product (foreign key to Products)

customer_id (int): ID of the customer who wrote the review (foreign key to Customers)

rating (int): Rating score (1 to 5)

review_text (string): Text content of the review

review_date (date): Date the review was written

Visual EDR

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9179978%2F7681afe8fc52a116ff56a2a4e179ad19%2FEDR.png?generation=1754741998037680&alt=media" alt="">

Notes

All data is randomly generated using Python’s Faker library, so it does not reflect any real individuals or companies.

The data is provided in both CSV and Parquet formats.

The generator script is available in the accompanying GitHub repository for reproducibility and customization.

Output

The script saves two folders inside the specified output path:

csv/ # CSV files parquet/ # Parquet files

License

MIT License

References

Github Repo: https://github.com/NaelAqel/db_gen

Notebook: https://www.kaggle.com/code/naelaqel/synthetic-e-commerce-relational-dataset-generator
Data from: Delta Food Outlets Study
catalog.data.gov
agdatacommons.nal.usda.gov
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Delta Food Outlets Study [Dataset]. https://catalog.data.gov/dataset/delta-food-outlets-study-2786d
Explore at:
Dataset updated
May 8, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
The Delta Food Outlets Study was an observational study designed to assess the nutritional environments of 5 towns located in the Lower Mississippi Delta region of Mississippi. It was an ancillary study to the Delta Healthy Sprouts Project and therefore included towns in which Delta Healthy Sprouts participants resided and that contained at least one convenience (corner) store, grocery store, or gas station. Data were collected via electronic surveys between March 2016 and September 2018 using the Nutrition Environment Measures Survey (NEMS) tools. Survey scores for the NEMS Corner Store, NEMS Grocery Store, and NEMS Restaurant were computed using modified scoring algorithms provided for these tools via SAS software programming. Because the towns were not randomly selected and the sample sizes are relatively small, the data may not be generalizable to all rural towns in the Lower Mississippi Delta region of Mississippi. Dataset one (NEMS-C) contains data collected with the NEMS Corner (convenience) Store tool. Dataset two (NEMS-G) contains data collected with the NEMS Grocery Store tool. Dataset three (NEMS-R) contains data collected with the NEMS Restaurant tool. Resources in this dataset:Resource Title: Delta Food Outlets Data Dictionary. File Name: DFO_DataDictionary_Public.csvResource Description: This file contains the data dictionary for all 3 datasets that are part of the Delta Food Outlets Study.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset One NEMS-C. File Name: NEMS-C Data.csvResource Description: This file contains data collected with the Nutrition Environment Measures Survey (NEMS) tool for convenience stores.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Two NEMS-G. File Name: NEMS-G Data.csvResource Description: This file contains data collected with the Nutrition Environment Measures Survey (NEMS) tool for grocery stores.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Three NEMS-R. File Name: NEMS-R Data.csvResource Description: This file contains data collected with the Nutrition Environment Measures Survey (NEMS) tool for restaurants.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)
zenodo.org
data.niaid.nih.gov
+3more
bin, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farah Zaib Khan; Farah Zaib Khan; Stian Soiland-Reyes; Stian Soiland-Reyes (2020). CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object) [Dataset]. http://doi.org/10.17632/xnwncxpw42.1
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.17632/xnwncxpw42.1
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Farah Zaib Khan; Farah Zaib Khan; Stian Soiland-Reyes; Stian Soiland-Reyes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:

Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.

The Genome BAM file is processed using Picard MarkDuplicates. producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).

SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.

The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.

In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.

For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.

This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl

Steps to reproduce

To build the research object again, use Python 3 on macOS. Built with:

Processor 2.8GHz Intel Core i7

Memory: 16GB

OS: macOS High Sierra, Version 10.13.3

Storage: 250GB

Install cwltool

pip3 install cwltool==1.0.20180912090223

Install git lfs
The data download with the git repository requires the installation of Git lfs:
https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs

Get the data and make the analysis environment ready:

git clone https://github.com/FarahZKhan/cwl_workflows.git cd cwl_workflows/ git checkout CWLProvTesting ./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh

Run the following commands to create the CWLProv Research Object:

cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256

The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120
D
Simulated-Orchards Dataset
datasetninja.com
kaggle.com
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Hasperhoven; Maya Aghaei; Klaas Dijkstra (2023). Simulated-Orchards Dataset [Dataset]. https://datasetninja.com/simulated-orchards
Explore at:
Dataset updated
Nov 30, 2023
Dataset provided by
Dataset Ninja
Authors
Dylan Hasperhoven; Maya Aghaei; Klaas Dijkstra
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Simulated-Orchards presents a dataset designed explicitly for object detection tasks, featuring 1499 images containing a total of 44885 labeled objects all falling within a singular class — apple. Notably, this dataset is generated systematically through a tool developed in the Unity 3D game engine, allowing for the systematic creation of simulated datasets. The focus on a singular class, in this case, apples, caters to applications in object detection, offering a rich resource for training models to identify and locate apples within simulated orchard environments, providing a valuable asset for agricultural and computer vision research.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bjørn-Jostein (2021). Fake News data set [Dataset]. https://www.kaggle.com/datasets/bjoernjostein/fake-news-data-set

Fake News data set

Can we predict fake news from head lines

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(56446259 bytes)Available download formats

Dataset updated

Dec 17, 2021

Authors

Bjørn-Jostein

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Today, we are producing more information than ever before, but not all information is true. Some of it is actually malicious and harmful. And it makes it harder for us to trust any piece of information we come across! Not only that, now the bad actors are able to use language modelling tools like Open AI's GPT 2 to generate fake news too. Ever since its initial release, there have been talks on how it can be potentially misused for generating misleading news articles, automating the production of abusive or fake content for social media, and automating the creation of spam and phishing content.

How do we figure out what is true and what is fake? Can we do something about it?

Content

The dataset consists of around 387,000 pieces of text which has been sourced from various news articles on the web as well as texts generated by Open AI's GPT 2 language model!

The dataset is split into train, validation and test such that each of the sets has an equal split of the two classes.

Acknowledgements

This dataset was published on AI Crowd in a so-called KIIT AI (mini)Blitz⚡ Challenge. AI Blitz⚡ is a series of educational challenges by AIcrowd, with an aim to make it really easy for anyone to get started with the world of AI. This AI Blitz⚡challenge was an exclusive challenge just for the students and the faculty of the Kalinga Institute of Industrial Technology.

Clear search

Close search

Google apps

Main menu

Fake News data set

Context

Content

Acknowledgements

Simulated Radar Waveform and RF Dataset Generator for Incumbent Signals in...

Data from: Delta Neighborhood Physical Activity Study

NaKnowBase Interoperability Tools

Dataset of development of business during the COVID-19 crisis

Synthetic (fake) youth mental health datasets and data dictionaries

Data Science Interview Q&A Treasury

CNC Milling Process Dataset

Dawnn benchmarking dataset: Simulated linear trajectories processing and...

Cynthia Data - synthetic EHR records

Sephora Makeup Dataset – Free Beauty Product CSV

Key Features

Who Can Use This Dataset?

Why This Sephora Dataset?

AREI Sample Data Set

Dataset used for "A Recommender System of Buggy App Checkers for App Store...

MockConf: Student Interpretation Dataset

Sample data having columns with the information like ID, Tweet, and Label.

The preprocessed Tara data set.

Synthetic E-Commerce Relational Datasets

Synthetic E-Commerce Relational Dataset

Purpose

Entity Relationship Diagram (ERD) - Tables Overview

1. Customers

2. Products

3. Orders

4. Order Items

5. Product Reviews

Visual EDR

Notes

Output

License

References

Data from: Delta Food Outlets Study

CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

Simulated-Orchards Dataset

Fake News data set

Can we predict fake news from head lines

Context

Content

Acknowledgements