100+ datasets found

Fake News Detection
kaggle.com
zip
Updated Nov 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KranNaik777 (2025). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/krannaik777/train-news
Explore at:
zip(38846301 bytes)Available download formats
Dataset updated
Nov 4, 2025
Authors
KranNaik777
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
The fake news detection dataset used in this project contains labeled news articles categorized as either "fake" or "real." These articles have been collected from credible real-world sources and fact-checking websites, ensuring diverse and high-quality data. The dataset includes textual features such as the news content, along with metadata like publication date, author, and source details. On average, articles vary in length, providing a rich linguistic variety for model training. The dataset is balanced to minimize bias between fake and real news categories, supporting robust classification. It often contains thousands to hundreds of thousands of articles, enabling effective machine learning model development and evaluation. Additionally, some versions of the dataset may also include image URLs for multimodal analysis, expanding the detection capability beyond text alone. This comprehensive dataset plays a critical role in training and validating the fake news detection model used in this project.

Here is a description for each column header of the fake news dataset:

id: A unique identifier assigned to each news article in the dataset for easy reference and indexing.

headline: The title or headline of the news article, summarizing the key news story in brief.

written by: The author or journalist who wrote the news article; this may sometimes be missing or anonymized.

news: The full text content of the news article, which is the main body used for analysis and classification.

label: The classification label indicating the authenticity of the news article, typically a binary value such as "fake" or "real" (or 0 for real and 1 for fake), indicating whether the news is deceptive or truthful.

This detailed column description provides clarity on the structure and contents of the dataset used for fake news detection modeling.
Z
DustNet - structured data and Python code to reproduce the model,...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10631953
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
Kingston University
University of Exeter
Authors
Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

Model input data and code

Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

Model output data and code

This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

*datasets are NumPy arrays (v1.23) created in Python v3.8.18.

**all ML models were created with Keras in Python v3.10.10.
deepfakeguard-ml-dataset
kaggle.com
zip
Updated Nov 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
karthikeyan1817 (2025). deepfakeguard-ml-dataset [Dataset]. https://www.kaggle.com/datasets/karthikeyan1817/deepfakeguard-ml-dataset
Explore at:
zip(23279339385 bytes)Available download formats
Dataset updated
Nov 15, 2025
Authors
karthikeyan1817
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains DeepFakeGuard image data split into multiple 1GB parts for stable uploading and training. The original dataset includes real and fake image frames extracted from video sources for deepfake detection. These split files can be joined to reconstruct the full dataset and used for training machine learning and computer vision models. This dataset is created only for educational and research use in the DeepFakeGuard ML project.
Ad-hoc statistical analysis: 2020/21 Quarter 2
gov.uk
s3.amazonaws.com
Updated Sep 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Digital, Culture, Media & Sport (2020). Ad-hoc statistical analysis: 2020/21 Quarter 2 [Dataset]. https://www.gov.uk/government/statistical-data-sets/ad-hoc-statistical-analysis-202021-quarter-2
Explore at:
Dataset updated
Sep 11, 2020
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Department for Digital, Culture, Media & Sport
Description
This page lists ad-hoc statistics released during the period July - September 2020. These are additional analyses not included in any of the Department for Digital, Culture, Media and Sport’s standard publications.

If you would like any further information please contact evidence@dcms.gov.uk.

July 2020 - DCMS Economic Estimates: Number of businesses and Gross Value Added (GVA) by turnover band (2018)

This analysis considers businesses in the DCMS Sectors split by whether they had reported annual turnover above or below £500 million, at one time the threshold for the Coronavirus Business Interruption Loan Scheme (CBILS). Please note the DCMS Sectors totals here exclude the Tourism and Civil Society sectors, for which data is not available or has been excluded for ease of comparability.

The analysis looked at number of businesses; and total GVA generated for both turnover bands. In 2018, an estimated 112 DCMS Sector businesses had an annual turnover of £500m or more (0.03% of the total DCMS Sector businesses). These businesses generated 35.3% (£73.9bn) of all GVA by the DCMS Sectors.

These are trends are broadly similar for the wider non-financial UK business economy, where an estimated 823 businesses had an annual turnover of £500m or more (0.03% of the total) and generated 24.3% (£409.9bn) of all GVA.

The Digital Sector had an estimated 89 businesses (0.04% of all Digital Sector businesses) – the largest number – with turnover of £500m or more; and these businesses generated 41.5% (£61.9bn) of all GVA for the Digital Sector. By comparison, the Creative Industries had an estimated 44 businesses with turnover of £500m or more (0.01% of all Creative Industries businesses), and these businesses generated 23.9% (£26.7bn) of GVA for the Creative Industries sector.

https://assets.publishing.service.gov.uk/media/5f05e78ce90e0712cc90b6f7/dcms-businesses-turnover-split-by-number-and-gva-2018.xlsx">

https://assets.publishing.service.gov.uk/media/5f05e78ce90e0712cc90b6f7/dcms-businesses-turnover-split-by-number-and-gva-2018.xlsx">Number and Gross Value Added by businesses in DCMS sectors, split by annual turnover, 2018

MS Excel Spreadsheet, 42.5 KB

July 2020 - ONS Opinions and Lifestyle Omnibus Survey, February 2020 Data Module

This analysis shows estimates from the ONS Opinion and Lifestyle Omnibus Survey Data Module, commissioned by DCMS in February 2020. The Opinions and Lifestyles Survey (OPN) is run by the Office for National Statistics. For more information on the survey, please see the https://www.ons.gov.uk/aboutus/whatwedo/paidservices/opinions" class="govuk-link">ONS website.

DCMS commissioned 19 questions to be included in the February 2020 survey relating to the public’s views on a range of data related issues, such as trust in different types of organisations when handling personal data, confidence using data skills at work, understanding of how data is managed by companies and the use of data skills at work.

The high level results are included in the accompanying tables. The survey samples adults (16+) across the whole of Great Britain (excluding the Isles of Scilly).

<a class="govuk-link" target="_s
Riga Data Science Club
kaggle.com
zip
Updated Mar 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitry Yemelyanov (2021). Riga Data Science Club [Dataset]. https://www.kaggle.com/datasets/dmitryyemelyanov/rigadsclub
Explore at:
zip(494849 bytes)Available download formats
Dataset updated
Mar 29, 2021
Authors
Dmitry Yemelyanov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Riga
Description
Context

Riga Data Science Club is a non-profit organisation to share ideas, experience and build machine learning projects together. Data Science community should known own data, so this is a dataset about ourselves: our website analytics, social media activity, slack statistics and even meetup transcriptions!

Content

Dataset is split up in several folders by the context: * linkedin - company page visitor, follower and post stats * slack - messaging and member activity * typeform - new member responses * website - website visitors by country, language, device, operating system, screen resolution * youtube - meetup transcriptions

Inspiration

Let's make Riga Data Science Club better! We expect this data to bring lots of insights on how to improve.

"Know your c̶u̶s̶t̶o̶m̶e̶r̶ member" - Explore member interests by analysing sign-up survey (typeform) responses - Explore messaging patterns in Slack to understand how members are retained and when they are lost

Social media intelligence * Define LinkedIn posting strategy based on historical engagement data * Define target user profile based on LinkedIn page attendance data

Website * Define website localisation strategy based on data about visitor countries and languages * Define website responsive design strategy based on data about visitor devices, operating systems and screen resolutions

Have some fun * NLP analysis of meetup transcriptions: word frequencies, question answering, something else?
encode-split-data-50
kaggle.com
zip
Updated Apr 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
quan242 (2023). encode-split-data-50 [Dataset]. https://www.kaggle.com/datasets/quan242/encode-split-data-50
Explore at:
zip(10931547 bytes)Available download formats
Dataset updated
Apr 23, 2023
Authors
quan242
Description
Dataset

This dataset was created by quan242

Contents
d
SplitSmart: An Open Dataset for Enabling Research in Energy-Efficient...
catalog.data.gov
data.openei.org
Updated Oct 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BITS Pilani - Goa (2024). SplitSmart: An Open Dataset for Enabling Research in Energy-Efficient Ductless-Split Air Conditioner [Dataset]. https://catalog.data.gov/dataset/splitsmart-an-open-dataset-for-enabling-research-in-energy-efficient-ductless-split-air-co
Explore at:
Dataset updated
Oct 2, 2024
Dataset provided by
BITS Pilani - Goa
Description
SplitSmart provides a context-rich open dataset to facilitate research in energy efficient ductless-split cooling systems. The objective is to enable research advancements to make the ductless-split cooling systems (aka air conditioners) energy efficient and reduce the associated carbon emissions. The data presented here was collected over a period of four years from 2019 to 2023 in a living lab setting within Birla Institute of Technology and Science (BITS) Pilani, Goa campus using IoT sensors.
m
EV/ESS Split Data Market Industry Size, Share & Insights for 2033
marketresearchintellect.com
Updated Nov 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Intellect (2025). EV/ESS Split Data Market Industry Size, Share & Insights for 2033 [Dataset]. https://www.marketresearchintellect.com/product/global-ev-ess-split-data-market/
Explore at:
Dataset updated
Nov 25, 2025
Dataset authored and provided by
Market Research Intellect
License
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
Area covered
Global
Description
In 2024, Market Research Intellect valued the EVESS Split Data Market Report at USD 500 million, with expectations to reach USD 1.5 billion by 2033 at a CAGR of 15%.Understand drivers of market demand, strategic innovations, and the role of top competitors.
EA Stock Price
kaggle.com
zip
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prathamjyot Singh (2024). EA Stock Price [Dataset]. https://www.kaggle.com/datasets/prathamjyotsingh/ea-stocks-latest
Explore at:
zip(98826 bytes)Available download formats
Dataset updated
Sep 17, 2024
Authors
Prathamjyot Singh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description

This project involves collecting and analyzing financial data for Electronic Arts (EA) using the Alpha Vantage API. The data includes historical stock prices, dividend payments, and stock splits. The project aims to provide a detailed view of EA’s financial performance and corporate actions over time.

Detail

The project consists of three main datasets:

1) Stock Price Data: Daily records of EA’s stock prices, including opening, high, low, and closing prices, as well as trading volume.

2) Dividend Data: Historical records of dividend payments by EA, detailing declaration dates, record dates, payment dates, and dividend amounts.

3) Stock Split Data: Records of stock split events, showing the date of each split and the split ratio.

The data is sourced from the Alpha Vantage API, which provides comprehensive financial market data. The datasets are cleaned and formatted to ensure consistency and accuracy. They are then saved in CSV files for easy access and analysis.

Usage

The collected data can be used for various financial analyses and insights:

Stock Price Analysis: Evaluate EA’s stock price trends, volatility, and trading volumes over time.

Dividend Analysis: Analyze dividend payment trends, yield, and changes in dividend policy.

Stock Split Analysis: Understand the impact of stock splits on EA’s stock price and overall market behavior.

This data can be used by investors, financial analysts, and researchers to make informed decisions or conduct further financial research. It can also be integrated into financial models or visualizations to provide a clearer picture of EA’s financial health and corporate actions.

Summary

The project provides a detailed dataset of Electronic Arts’ financial data, including stock prices, dividends, and stock splits. By sourcing data from the Alpha Vantage API and carefully formatting it, the project offers valuable insights into EA’s historical financial performance. The data is organized into CSV files, making it accessible for analysis, research, and decision-making purposes.
Z
Downsized camera trap images for automated classification
data.niaid.nih.gov
Updated Dec 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M (2022). Downsized camera trap images for automated classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6627706
Explore at:
Dataset updated
Dec 1, 2022
Dataset provided by
Imperial College London
Authors
Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:

NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:

Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:

filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance
S
Split Testing Tools Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Split Testing Tools Report [Dataset]. https://www.datainsightsmarket.com/reports/split-testing-tools-1971939
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Jun 10, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Discover the booming split testing tools market! Learn about key trends, leading companies like Optimizely and VWO, and projected growth to $6 billion by 2033. Improve your website conversion rates with this insightful market analysis.
Gamelytics: Mobile Analytics Challenge
kaggle.com
zip
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
letocen (2025). Gamelytics: Mobile Analytics Challenge [Dataset]. https://www.kaggle.com/datasets/debs2x/gamelytics-mobile-analytics-challenge
Explore at:
zip(66154620 bytes)Available download formats
Dataset updated
Feb 16, 2025
Authors
letocen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Gamelytics: Mobile Analytics Challenge 🎮📊

Subtitle

Unlock key insights into player behavior, optimize game metrics, and make data-driven decisions!

Description

Welcome to the Gamelytics: Mobile Analytics Challenge, a real-world-inspired dataset designed for data enthusiasts eager to dive deep into mobile game analytics. This dataset challenges you to analyze player behavior, evaluate A/B test results, and develop metrics for assessing game event performance.

Project Context & Tasks

Task 1: Retention Analysis

🔍 Objective: Calculate the daily retention rate of players, starting from their registration date.
📄 Data Sources:
- reg_data.csv: Contains user registration timestamps (reg_ts) and unique user IDs (uid).
- auth_data.csv: Contains user login timestamps (auth_ts) and unique user IDs (uid).
💡 Challenge: Develop a Python function to calculate retention, allowing you to test its performance on both the complete dataset and smaller samples.

Task 2: A/B Testing for Promotional Offers

🔍 Objective: Identify the best-performing promotional offer set by comparing key revenue metrics.
💰 Context:
- The test group has a 5% higher ARPU than the control group.
- In the control group, 1928 users out of 202,103 are paying customers.
- In the test group, 1805 users out of 202,667 are paying customers.
📊 Data Sources:
- ab_test.csv: Includes user_id, revenue, and testgroup columns.
💡 Challenge: Decide which offer set performs best, and determine the appropriate metrics for a robust evaluation.

Task 3: Event Performance Evaluation in "Plants & Gardens"

🔍 Objective: Develop metrics to assess the success of a time-limited in-game event where players can earn unique rewards.
🍃 Context: Players complete levels to win exclusive items, bonuses, or coins. In a variation, players may be penalized (sent back levels) after failed attempts.
💡 Challenge: Define how metrics should change under the penalty variation and identify KPIs for evaluating event success.

Dataset Information

The provided data is split into three files, each detailing a specific aspect of the application. Here's a breakdown:

1. User Registration Data (reg_data.csv)

Records: 1,000,000

Columns:

reg_ts: Registration time (Unix time, int64)

uid: Unique user ID (int64)

Memory Usage: 15.3 MB

Description: This dataset contains user registration timestamps and IDs. It is clean and contains no missing data.

2. User Activity Data (auth_data.csv)

Records: 9,601,013

Columns:

auth_ts: Login time (Unix time, int64)

uid: Unique user ID (int64)

Memory Usage: 146.5 MB

Description: This dataset captures user login timestamps and IDs. It is clean and contains no missing data.

3. A/B Testing Data (ab_test.csv)

Records: 404,770

Columns:

user_id: Unique user ID (int64)

revenue: Revenue (int64)

testgroup: Test group (object)

Memory Usage: ~9.3 MB

Description: This dataset provides insights into A/B test results, including revenue and group allocation for each user. It is clean and ready for analysis.

Inspiration & Benefits

Real-World Relevance: Inspired by actual challenges in mobile gaming analytics, this dataset lets you solve meaningful problems.

Diverse Data Types: Work with registration logs, activity timestamps, and experimental results to gain a holistic understanding of mobile game data.

Skill Building: Perfect for those honing skills in retention analysis, A/B testing, and event-based performance evaluation.

Community Driven: Built to inspire collaboration and innovation in the data analytics community. 🚀

Whether you’re a beginner or an expert, this dataset offers an engaging challenge to sharpen your analytical skills and drive actionable insights. Happy analyzing! 🎉📈
m
Dividend 15 Split Corp. Alternative Data Analytics
meyka.com
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meyka (2025). Dividend 15 Split Corp. Alternative Data Analytics [Dataset]. https://meyka.com/stock/DVSPF/alt-data/
Explore at:
Dataset updated
Oct 6, 2025
Dataset provided by
Meyka
Description
Non-traditional data signals from social media and employment platforms for DVSPF stock analysis
h
tc_ldc_split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lawrence Stewart (2023). tc_ldc_split [Dataset]. https://huggingface.co/datasets/winddude/tc_ldc_split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Lawrence Stewart
Description
This is a split of the two fine tuning datasets from https://huggingface.co/datasets/togethercomputer/Long-Data-Collections split to make analysis, and customization easier.

Licensing Information

Please refer to the original sources of the datasets for information on their respective licenses.
h
Data from: The University of California’s Split with Elsevier
hsscommons.ca
hsscommons.rs-dev.uvic.ca
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caroline Winter (2024). The University of California’s Split with Elsevier [Dataset]. http://doi.org/10.25547/WZW8-4X35
Explore at:
Unique identifier
https://doi.org/10.25547/WZW8-4X35
Dataset updated
Apr 11, 2024
Dataset provided by
Canadian HSS Commons
Authors
Caroline Winter
Description
On February 28, 2019, the University of California (UC) announced that it would not renew its subscriptions to Elsevier journals. UC is a public research university in California, USA, with 10 campuses across the state.
Emodata_v2
kaggle.com
zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Las HTN (2024). Emodata_v2 [Dataset]. https://www.kaggle.com/datasets/lashtn/emodata-v2
Explore at:
zip(138625038 bytes)Available download formats
Dataset updated
Aug 2, 2024
Authors
Las HTN
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description: emodata

The emodata dataset is designed to analyze and predict emotions based on numerical labels and pixel data. It is structured to include information about emotion labels, pixel values, and their usage in training and testing. Below is a detailed description of the dataset:

1. General Information - Purpose: Emotion analysis and prediction based on numerical scales and pixel data. - Total Samples: 49,400 - Emotion Labels: Represented as numerical intervals, each corresponding to a specific emotional intensity or category. - Pixel Data: Images are represented as pixel intensity values. - Data Split: - Training set: 82% of the data - Testing set: 18% of the data

2. Emotion Labels

The labels are grouped into numerical intervals to categorize emotional intensity or types. Each interval corresponds to the count of samples:

0.00 - 0.30: 6,221 samples

0.90 - 1.20: 6,319 samples

1.80 - 2.10: 6,420 samples

3.00 - 3.30: 8,789 samples

3.90 - 4.20: 7,498 samples

4.80 - 5.10: 7,377 samples

5.70 - 6.00: 6,763 samples

Statistical Summary:

Mean: 3.1

Standard Deviation: 1.94

Quantiles:

Minimum: 0

25%: 1

Median: 3

75%: 5

Maximum: 6

3. Pixel Data

Unique Values:

Total Unique Values: 34,000

Most Common Pixel Intensities: Common pixel intensity values for various samples are listed, indicating grayscale or color representation.

Pixel Usage:

Training: 82%

Testing: 18%

4. Data Quality

Valid Samples: 100% (49.4k samples)

Mismatched Samples: 0%

Missing Samples: 0%

5. Usage

This dataset is particularly suited for: - Emotion Classification Tasks: Training machine learning models to classify emotions based on numerical and image data. - Deep Learning Tasks: Utilizing pixel intensity data for convolutional neural networks (CNNs) to predict emotional states. - Statistical Analysis: Exploring the distribution of emotional intensities and their relationship with image features.

Potential Applications

Sentiment Analysis

Emotion Detection in Images

Human-Computer Interaction Systems

AI-based Feedback Systems

This dataset provides a comprehensive structure for emotion analysis through a combination of numerical and image data, making it versatile for both machine learning and deep learning applications.
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
N
Split Rock Township, Minnesota Population Pyramid Dataset: Age Groups, Male...
neilsberg.com
csv, json
Updated Sep 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2023). Split Rock Township, Minnesota Population Pyramid Dataset: Age Groups, Male and Female Population, and Total Population for Demographics Analysis [Dataset]. https://www.neilsberg.com/research/datasets/63655a02-3d85-11ee-9abe-0aa64bf2eeb2/
Explore at:
json, csvAvailable download formats
Dataset updated
Sep 16, 2023
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Minnesota, Split Rock Township
Variables measured
Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Total Population for Age Groups, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the three variables, namely (a) male population, (b) female population and (b) total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the data for the Split Rock Township, Minnesota population pyramid, which represents the Split Rock township population distribution across age and gender, using estimates from the U.S. Census Bureau American Community Survey 5-Year estimates. It lists the male and female population for each age group, along with the total population for those age groups. Higher numbers at the bottom of the table suggest population growth, whereas higher numbers at the top indicate declining birth rates. Furthermore, the dataset can be utilized to understand the youth dependency ratio, old-age dependency ratio, total dependency ratio, and potential support ratio.

Key observations

Youth dependency ratio, which is the number of children aged 0-14 per 100 persons aged 15-64, for Split Rock Township, Minnesota, is 15.5.

Old-age dependency ratio, which is the number of persons aged 65 or over per 100 persons aged 15-64, for Split Rock Township, Minnesota, is 26.7.

Total dependency ratio for Split Rock Township, Minnesota is 42.2.

Potential support ratio, which is the number of youth (working age population) per elderly, for Split Rock Township, Minnesota is 3.7.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group for the Split Rock township population analysis. Total expected values are 18 and are define above in the age groups section.

Population (Male): The male population in the Split Rock township for the selected age group is shown in the following column.

Population (Female): The female population in the Split Rock township for the selected age group is shown in the following column.

Total Population: The total population of the Split Rock township for the selected age group is shown in the following column.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Split Rock township Population by Age. You can refer the same here

Netflix Data: Cleaning, Analysis and Visualization

kaggle.com

zip

Updated Aug 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v1
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v1
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.

Facebook

Twitter

Click to copy link

Link copied

Cite

KranNaik777 (2025). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/krannaik777/train-news

Fake News Detection

Training the dataset using the Train-Test Split

Explore at:

zip(38846301 bytes)Available download formats

Dataset updated

Nov 4, 2025

Authors

KranNaik777

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

The fake news detection dataset used in this project contains labeled news articles categorized as either "fake" or "real." These articles have been collected from credible real-world sources and fact-checking websites, ensuring diverse and high-quality data. The dataset includes textual features such as the news content, along with metadata like publication date, author, and source details. On average, articles vary in length, providing a rich linguistic variety for model training. The dataset is balanced to minimize bias between fake and real news categories, supporting robust classification. It often contains thousands to hundreds of thousands of articles, enabling effective machine learning model development and evaluation. Additionally, some versions of the dataset may also include image URLs for multimodal analysis, expanding the detection capability beyond text alone. This comprehensive dataset plays a critical role in training and validating the fake news detection model used in this project.

Here is a description for each column header of the fake news dataset:

id: A unique identifier assigned to each news article in the dataset for easy reference and indexing.

headline: The title or headline of the news article, summarizing the key news story in brief.

written by: The author or journalist who wrote the news article; this may sometimes be missing or anonymized.

news: The full text content of the news article, which is the main body used for analysis and classification.

label: The classification label indicating the authenticity of the news article, typically a binary value such as "fake" or "real" (or 0 for real and 1 for fake), indicating whether the news is deceptive or truthful.

This detailed column description provides clarity on the structure and contents of the dataset used for fake news detection modeling.

Clear search

Close search

Google apps

Main menu

Fake News Detection

DustNet - structured data and Python code to reproduce the model,...

deepfakeguard-ml-dataset

Ad-hoc statistical analysis: 2020/21 Quarter 2

July 2020 - DCMS Economic Estimates: Number of businesses and Gross Value Added (GVA) by turnover band (2018)

https://assets.publishing.service.gov.uk/media/5f05e78ce90e0712cc90b6f7/dcms-businesses-turnover-split-by-number-and-gva-2018.xlsx">Number and Gross Value Added by businesses in DCMS sectors, split by annual turnover, 2018

July 2020 - ONS Opinions and Lifestyle Omnibus Survey, February 2020 Data Module

Riga Data Science Club

Context

Content

Inspiration

encode-split-data-50

Dataset

Contents

SplitSmart: An Open Dataset for Enabling Research in Energy-Efficient...

EV/ESS Split Data Market Industry Size, Share & Insights for 2033

EA Stock Price

Description

Detail

The project consists of three main datasets:

Usage

The collected data can be used for various financial analyses and insights:

Summary

Downsized camera trap images for automated classification

Split Testing Tools Report

Gamelytics: Mobile Analytics Challenge

Gamelytics: Mobile Analytics Challenge 🎮📊

Subtitle

Description

Project Context & Tasks

Task 1: Retention Analysis

Task 2: A/B Testing for Promotional Offers

Task 3: Event Performance Evaluation in "Plants & Gardens"

Dataset Information

1. User Registration Data (reg_data.csv)

2. User Activity Data (auth_data.csv)

3. A/B Testing Data (ab_test.csv)

Inspiration & Benefits

Dividend 15 Split Corp. Alternative Data Analytics

tc_ldc_split

Data from: The University of California’s Split with Elsevier

Emodata_v2

2. Emotion Labels

3. Pixel Data

4. Data Quality

5. Usage

Potential Applications

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Split Rock Township, Minnesota Population Pyramid Dataset: Age Groups, Male...

About this dataset

Content

Inspiration

Recommended for further research

Netflix Data: Cleaning, Analysis and Visualization

Data Cleaning

LSC (Leicester Scientific Corpus)

Fake News Detection

Training the dataset using the Train-Test Split

1. User Registration Data (`reg_data.csv`)

2. User Activity Data (`auth_data.csv`)

3. A/B Testing Data (`ab_test.csv`)