23 datasets found

Clickstream data for online shopping
kaggle.com
Updated Apr 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Long Luu (2021). Clickstream data for online shopping [Dataset]. https://www.kaggle.com/aeryss/clickstream-data-for-online-shopping/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Long Luu
Description
Dataset

This dataset was created by Long Luu

Contents
c
Clickstream for Online Shopping Dataset
cubig.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG, Clickstream for Online Shopping Dataset [Dataset]. https://cubig.ai/store/products/376/clickstream-for-online-shopping-dataset
Explore at:
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
Description
1) Data Introduction • The Clickstream Data for Online Shopping is an e-commerce analysis dataset that summarizes user clickstream, product information, country, price, and other session-specific behavior data from April to August 2008 at an online shopping mall specializing in maternity clothing.

2) Data Utilization (1) Clickstream Data for Online Shopping has characteristics that: • Each row contains 14 key variables: year, month, day, click order, country (by access IP), session ID, main category, product code, color, photo location, model photo type, price, category average price, page number, etc. • Data is configured to enable analysis of various consumer behaviors such as click flows for each session, product attributes, and country-specific access patterns. (2) Clickstream Data for Online Shopping can be used to: • Online Shopping Mall User Behavior Analysis: Using clickstream, session, and product information, you can analyze purchase conversion routes, popular products, and behavioral patterns by country and category. • Improve marketing strategies and UI/UX: analyze the relationship between product photo location, color, price, etc. and click behavior and apply to establish effective marketing strategies and improvement of shopping mall UI/UX.
AI-Driven Consumer Behavior Dataset
kaggle.com
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2025). AI-Driven Consumer Behavior Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/ai-driven-consumer-behavior-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This AI-Driven Consumer Behavior Dataset captures key aspects of online shopping behavior, including purchase decisions, browsing activity, customer reviews, and demographic details. The dataset is designed for research in consumer behavior analysis, AI-driven recommendation systems, and digital marketing optimization.

Key Features: ✔ Consumer Purchase Data – Tracks product purchases, prices, discounts, and payment methods. ✔ Clickstream Data – Includes browsing behavior, pages visited, session duration, and cart abandonment. ✔ Customer Reviews & Sentiments – Provides ratings, textual reviews, and sentiment analysis scores. ✔ Demographic Information – Includes age, gender, location, and income levels. ✔ Target Column (purchase_decision) – Indicates whether a customer completed a purchase (1) or not (0).
h
grass-clickstream-dataset
huggingface.co
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grass (2025). grass-clickstream-dataset [Dataset]. https://huggingface.co/datasets/GrassData/grass-clickstream-dataset
Explore at:
Dataset updated
Aug 26, 2025
Dataset authored and provided by
Grass
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Grass Clickstream Dataset

Wynd Labs

This is the clickstream dataset produced by the team at Wynd Labs. The provided embeddings are an aggregate of clip embeddings produced by selected keyframes from the respective video. We aim that these embeddings be used for task-specific clustering and automatic segmentation. If it clips, it ships.
d
Swash User Search and Consumer Journey Data - 1.5M Worldwide Users - GDPR...
datarade.ai
.csv, .xls
Updated Jun 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swash (2023). Swash User Search and Consumer Journey Data - 1.5M Worldwide Users - GDPR Compliant [Dataset]. https://datarade.ai/data-products/users-searching-data-on-top-search-engines
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Jun 27, 2023
Dataset authored and provided by
Swash
Area covered
Korea (Republic of), Taiwan, Panama, Honduras, Bangladesh, United States of America, Israel, Macao, Japan, Kuwait
Description
Unlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.

Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.

User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.

Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.

GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.

Market Intelligence and Consumer Behaviour: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.

High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.

Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.

Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.
Clickstream 2008 E-commerce Dataset
kaggle.com
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dev Patel (2024). Clickstream 2008 E-commerce Dataset [Dataset]. https://www.kaggle.com/datasets/ddevvedd/e-commerce-2008/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dev Patel
Description
Dataset

This dataset was created by Dev Patel

Contents
t
Modeling online browsing and path analysis using clickstream data - Dataset...
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Modeling online browsing and path analysis using clickstream data - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/modeling-online-browsing-and-path-analysis-using-clickstream-data
Explore at:
Dataset updated
Dec 16, 2024
Description
Modeling online browsing and path analysis using clickstream data.
Data from: Click stream dataset
kaggle.com
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raghu Mariswamegowda (2025). Click stream dataset [Dataset]. https://www.kaggle.com/datasets/raghumariswamegowda/click-stream-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raghu Mariswamegowda
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Raghu Mariswamegowda

Released under Apache 2.0

Contents
Data ClickStream Banco Galicia 2019
kaggle.com
zip
Updated Aug 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federico Garcia Blanco (2019). Data ClickStream Banco Galicia 2019 [Dataset]. https://www.kaggle.com/fgarciablanco/data-clickstream-banco-galicia-2019
Explore at:
zip(212409866 bytes)Available download formats
Dataset updated
Aug 29, 2019
Authors
Federico Garcia Blanco
Description
Dataset

This dataset was created by Federico Garcia Blanco

Contents
E-Shop Clothing Dataset
kaggle.com
Updated Aug 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aditya Wisnugraha S
Description
Data description “e-shop clothing 2008”

Variables:

YEAR (2008)

========================================================

MONTH -> from April (4) to August (8)

========================================================

DAY -> day number of the month

========================================================

ORDER -> sequence of clicks during one session

========================================================

COUNTRY -> variable indicating the country of origin of the IP address with the following categories:

1-Australia 2-Austria 3-Belgium 4-British Virgin Islands 5-Cayman Islands 6-Christmas Island 7-Croatia 8-Cyprus 9-Czech Republic 10-Denmark 11-Estonia 12-unidentified 13-Faroe Islands 14-Finland 15-France 16-Germany 17-Greece 18-Hungary 19-Iceland 20-India 21-Ireland 22-Italy 23-Latvia 24-Lithuania 25-Luxembourg 26-Mexico 27-Netherlands 28-Norway 29-Poland 30-Portugal 31-Romania 32-Russia 33-San Marino 34-Slovakia 35-Slovenia 36-Spain 37-Sweden 38-Switzerland 39-Ukraine 40-United Arab Emirates 41-United Kingdom 42-USA 43-biz (.biz) 44-com (.com) 45-int (.int) 46-net (.net) 47-org (*.org)

========================================================

SESSION ID -> variable indicating session id (short record)

========================================================

PAGE 1 (MAIN CATEGORY) -> concerns the main product category: 1-trousers 2-skirts 3-blouses 4-sale

========================================================

PAGE 2 (CLOTHING MODEL) -> contains information about the code for each product (217 products)

========================================================

COLOUR -> colour of product

1-beige 2-black 3-blue 4-brown 5-burgundy 6-gray 7-green 8-navy blue 9-of many colors 10-olive 11-pink 12-red 13-violet 14-white

========================================================

LOCATION -> photo location on the page, the screen has been divided into six parts:

1-top left 2-top in the middle 3-top right 4-bottom left 5-bottom in the middle 6-bottom right

========================================================

MODEL PHOTOGRAPHY -> variable with two categories:

1-en face 2-profile

========================================================

PRICE -> price in US dollars

========================================================

PRICE 2 -> variable informing whether the price of a particular product is higher than the average price for the entire product category

1-yes 2-no

========================================================

PAGE -> page number within the e-store website (from 1 to 5)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I want to know how to solve this data regarding any problem (clustering, regression, classification, EDA)

Source: https://archive.ics.uci.edu/ml/datasets/clickstream+data+for+online+shopping
i
Simple English Wikipedia Link Graph with Clickstream Transitions 2018-12 -...
rdm.inesctec.pt
Updated Mar 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Simple English Wikipedia Link Graph with Clickstream Transitions 2018-12 - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2018-004
Explore at:
Dataset updated
Mar 6, 2019
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Simple English Wikipedia Link Graph with Clickstream Transitions is a gzipped GML file representing the hyperlink graph of the Simple English Wikipedia. It was prepared using the "pagelinks" and "page" SQL dumps for 2019-01-01 and extended with an edge property called "transitions" based on the Clickstream dump for the English Wikipedia from 2018-12. It was designed to be used as a ground truth to evaluate node ranking metrics, like PageRank, but it can be useful for Network Science in general, or for Machine Learning and Information Retrieval to compute features over a medium-sized, complete Wikipedia link graph.
d
Datasys | Clickstream Data | Gamer Audiences (10M+ gamers | PC, console &...
data.datasys.com
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasys (2025). Datasys | Clickstream Data | Gamer Audiences (10M+ gamers | PC, console & mobile) [Dataset]. https://data.datasys.com/products/datasys-clickstream-data-gamer-audiences-10m-gamers-p-datasys
Explore at:
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Datasys
Area covered
Lebanon, Falkland Islands (Malvinas), Saudi Arabia, Peru, Thailand, China, Israel, Ecuador, North Korea, Bahamas
Description
Datasys Gamer Audiences dataset tracks 10M+ gaming consumers, including platform usage, time spent, and title engagement.
Wikipedia Clickstream
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Dario Taraborelli (2023). Wikipedia Clickstream [Dataset]. http://doi.org/10.6084/m9.figshare.1305770.v16
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1305770.v16
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Ellery Wulczyn; Dario Taraborelli
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This project contains data sets containing counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. For more information and documentation, see the link in the references section below.
m
Mobile Web Clickstream | 1st Party | 3B+ events verified, US consumers |...
omnitrafficdata.mfour.com
datarade.ai
Updated Aug 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MFour (2021). Mobile Web Clickstream | 1st Party | 3B+ events verified, US consumers | Safari, Chrome, any iOS or Android [Dataset]. https://omnitrafficdata.mfour.com/products/mobile-web-clickstream-1st-party-3b-events-verified-us-mfour
Explore at:
Dataset updated
Aug 1, 2021
Dataset authored and provided by
MFour
Area covered
United States
Description
This dataset encompasses mobile web clickstream behavior on any browser, collected from over 150,000 triple-opt-in first-party US Daily Active Users (DAU). Use it for measurement, attribution or path to purchase and consumer journey understanding. Full URL deliverable available including searches.
m
Data from: Data-driven E-commerce UI Personalization: Going Beyond Product...
data.mendeley.com
Updated Dec 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Wasilewski (2023). Data-driven E-commerce UI Personalization: Going Beyond Product Recommendations [Dataset]. http://doi.org/10.17632/sxmgyvxpv9.1
Explore at:
Unique identifier
https://doi.org/10.17632/sxmgyvxpv9.1
Dataset updated
Dec 29, 2023
Authors
Adam Wasilewski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset includes 1. online store customer behavior data (clickstream) from 1.04.-30.11.2023, used to cluster customers and evaluate the effectiveness of implemented modifications (catalog: learning-dataset) 2. clustering results to verify the effectiveness of implemented changes (catalog: clustering) 3. detailed data for calculation of macro-conversion indicators (catalog: macro-conversion-indicators) 3. detailed data for calculation of micro-conversion indicators (catalog: micro-conversion-indicators)
f
Image Dataset for Predicting Early Dropouts in DigitalLearning Platforms
figshare.com
zip
Updated Jan 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishant Sharma; Manish Kumar Pandey; M Ali Akber Dewan (2025). Image Dataset for Predicting Early Dropouts in DigitalLearning Platforms [Dataset]. http://doi.org/10.6084/m9.figshare.28282832.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28282832.v2
Dataset updated
Jan 26, 2025
Dataset provided by
figshare
Authors
Nishant Sharma; Manish Kumar Pandey; M Ali Akber Dewan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article presents a student click-stream database comprising of 120542 train images and 80362 test images where each directory contains two sub directories i.e. "Dropouts" and "NonDropouts" as two different classes.The original dataset was provided by KDD Cup Challenge 2015 in which the dataset was provided by chinese MOOC(Massive open online course) platform XuetangX. These samples have been acquired or captured through the clickstream activity/user activity on the platform. We transformed the KDD-Cup 2015 dataset into an image dataset. This transformation will enable the application of novel deep learning and computer vision techniques to develop more sustainable, accurate, and robust predictive models for identifying students at risk of dropping out and will enable MOOC platforms to design highly robust Early Warning Systems. Furthermore, this dataset will be made publicly available to the research community to advance interdisciplinary research at the intersection of education and computer vision.
n
DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS
narcis.nl
data.mendeley.com
Updated Mar 13, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Constante, F (via Mendeley Data) (2019). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. http://doi.org/10.17632/8gx2fvg2k6.5
Explore at:
Unique identifier
https://doi.org/10.17632/8gx2fvg2k6.5
Dataset updated
Mar 13, 2019
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Constante, F (via Mendeley Data)
Description
A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

Types of Products : Clothing , Sports , and Electronic Supplies

Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.
Language Specific Event Recommendation Ground Truth
zenodo.org
csv, txt
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Abdollahi; Sara Abdollahi; Simon Gottschalk; Simon Gottschalk; Elena Demidova; Elena Demidova (2021). Language Specific Event Recommendation Ground Truth [Dataset]. http://doi.org/10.5281/zenodo.5735580
Explore at:
txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5735580
Dataset updated
Dec 1, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sara Abdollahi; Sara Abdollahi; Simon Gottschalk; Simon Gottschalk; Elena Demidova; Elena Demidova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a multilingual ground truth dataset for training, evaluating and testing the LaSER (Language-Specific Event Recommendation) model. It contains language-specific relevance scores for event-centric click-through pairs according to the publicly available Clickstream dataset in German, French and Russian as well as the user study annotations conducted for evaluating the language-specific recommendations by LaSER. For more details, refer to EventKG+Click and LaSER.

This dataset consists of two sets of files as follows:
1. The ground truth dataset that is used for training the learning to rank (LTR) model in LaSER in three languages. The following files contain the language-specific relevance scores between a source and target entity based on EventKG+Click dataset:

german_ground_truth.txt

french_ground_truth.txt

russian_ground_truth.txt

In these files source and target represent the label of entities and events in the respective language.

2. The second set contains the user study participants' annotations regarding different relevance criteria of recommended events by LaSER. The following three files contain the annotations of at least three participants per event:

german_user_study_annotations.csv

french_user_study_annotations.csv

russian_user_study_annotations.csv

In these files, "r1", "r2" and "r3" denote relevance to the topic, language community and general audience respectively. And topic and event represent the wikidata-id of entities and events.
Z
Student oriented subset of the Open University Learning Analytics dataset
data.niaid.nih.gov
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriella Casalino (2021). Student oriented subset of the Open University Learning Analytics dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4264396
Explore at:
Dataset updated
Sep 30, 2021
Dataset provided by
Giovanna Castellano
Gennaro Vessio
Gabriella Casalino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Open University (OU) dataset is an open database containing student demographic and click-stream interaction with the virtual learning platform. The available data are structured in different CSV files. You can find more information about the original dataset at the following link: https://analyse.kmi.open.ac.uk/open_dataset.

We extracted a subset of the original dataset that focuses on student information. 25,819 records were collected referring to a specific student, course and semester. Each record is described by the following 20 attributes: code_module, code_presentation, gender, highest_education, imd_band, age_band, num_of_prev_attempts, studies_credits, disability, resource, homepage, forum, glossary, outcontent, subpage, url, outcollaborate, quiz, AvgScore, count.

Two target classes were considered, namely Fail and Pass, combining the original four classes (Fail and Withdrawn and Pass and Distinction, respectively). The final_result attribute contains the target values.

All features have been converted to numbers for automatic processing.

Below is the mapping used to convert categorical values to numeric:

code_module: 'AAA'=0, 'BBB'=1, 'CCC'=2, 'DDD'=3, 'EEE'=4, 'FFF'=5, 'GGG'=6

code_presentation: '2013B'=0, '2013J'=1, '2014B'=2, '2014J'=3

gender: 'F'=0, 'M'=1

highest_education: 'No_Formal_quals'=0, 'Post_Graduate_Qualification'=1, 'HE_Qualification'=2, 'Lower_Than_A_Level'=3, 'A_level_or_Equivalent'=4

IMBD_band: 'unknown'=0, 'between_0_and_10_percent'=1, 'between_10_and_20_percent'=2, 'between_20_and_30_percent'=3, 'between_30_and_40_percent'=4, 'between_40_and_50_percent'=5, 'between_50_and_60_percent'=6, 'between_60_and_70_percent'=7, 'between_70_and_80_percent'=8, 'between_80_and_90_percent'=9, 'between_90_and_100_percent'=10

age_band: 'between_0_and_35'=0, 'between_35_and_55'=1, 'higher_than_55'=2

disability: 'N'=0, 'Y'=1

student's outcome: 'Fail'=0, 'Pass'=1

For more detailed information, please refer to:

Casalino G., Castellano G., Vessio G. (2021) Exploiting Time in Adaptive Learning from Educational Data. In: Agrati L.S. et al. (eds) Bridges and Mediation in Higher Distance Education. HELMeTO 2020. Communications in Computer and Information Science, vol 1344. Springer, Cham. https://doi.org/10.1007/978-3-030-67435-9_1
COVID-19 Pandemic Wikipedia Readership
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson; Leila Zia; Joseph Allemandou; Marcel Ruiz Forns; Nuria Ruiz; Fabian Kaelin (2023). COVID-19 Pandemic Wikipedia Readership [Dataset]. http://doi.org/10.6084/m9.figshare.14548032.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14548032.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Isaac Johnson; Leila Zia; Joseph Allemandou; Marcel Ruiz Forns; Nuria Ruiz; Fabian Kaelin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data release includes two Wikipedia datasets related to the readership of the project as it relates to the early COVID-19 pandemic period. The first dataset is COVID-19 article page views by country, the second dataset is one hop navigation where one of the two pages are COVID-19 related. The data covers roughly the first six months of the pandemic, more specifically from January 1st 2020 to June 30th 2020. For more background on the pandemic in those months, see English Wikipedia's Timeline of the COVID-19 pandemic.Wikipedia articles are considered COVID-19 related according the methodology described here, the list of COVID-19 articles used for the released datasets is available in covid_articles.tsv. For simplicity and transparency, the same list of articles from 20 April 2020 was used for the entire dataset though in practice new COVID-19-relevant articles were constantly being created as the pandemic evolved.Privacy considerationsWhile this data is considered valuable for the insight that it can provide about information-seeking behaviors around the pandemic in its early months across diverse geographies, care must be taken to not inadvertently reveal information about the behavior of individual Wikipedia readers. We put in place a number of filters to release as much data as we can while minimizing the risk to readers.The Wikimedia foundation started to release most viewed articles by country from Jan 2021. At the beginning of the COVID-19 an exemption was made to store reader data about the pandemic with additional privacy protections:- exclude the page views from users engaged in an edit session- exclude reader data from specific countries (with a few exceptions)- the aggregated statistics are based on 50% of reader sessions that involve a pageview to a COVID-19-related article (see covid_pages.tsv). As a control, a 1% random sample of reader sessions that have no pageviews to COVID-19-related articles was kept. In aggregate, we make sure this 1% non-COVID-19 sample and 50% COVID-19 sample represents less than 10% of pageviews for a country for that day. The randomization and filters occurs on a daily cadence with all timestamps in UTC.- exclude power users - i.e. userhashes with greater than 500 pageviews in a day. This doubles as another form of likely bot removal, protects very heavy users of the project, and also in theory would help reduce the chance of a single user heavily skewing the data.- exclude readership from users of the iOS and Android Wikipedia apps. In effect, the view counts in this dataset represent comparable trends rather than the total amount of traffic from a given country. For more background on readership data per country data, and the COVID-19 privacy protections in particular, see this phabricator.To further minimize privacy risks, a k-anonymity threshold of 100 was applied to the aggregated counts. For example, a page needs to be viewed at least 100 times in a given country and week in order to be included in the dataset. In addition, the view counts are floored to a multiple of 100.DatasetsThe datasets published in this release are derived from a reader session dataset generated by the code in this notebook with the filtering described above. The raw reader session data itself will not be publicly available due to privacy considerations. The datasets described below are similar to the pageviews and clickstream data that the Wikimedia foundation publishes already, with the addition of the country specific counts.COVID-19 pageviewsThe file covid_pageviews.tsv contains:- pageview counts for COVID-19 related pages, aggregated by week and country- k-anonymity threshold of 100- example: In the 13th week of 2020 (23 March - 29 March 2020), the page 'Pandémie_de_Covid-19_en_Italie' on French Wikipedia was visited 11700 times from readers in Belgium- as a control bucket, we include pageview counts to all pages aggregated by week and country. Due to privacy considerations during the collection of the data, the control bucket was sampled at ~1% of all view traffic. The view counts for the control title are thus proportional to the total number of pageviews to all pages.The file is ~8 MB and contains ~134000 data points across the 27 weeks, 108 countries, and 168 projects.Covid reader session bigramsThe file covid_session_bigrams.tsv contains:- number of occurrences of visits to pages A -> B, where either A or B is a COVID-19 related article. Note that the bigrams are tuples (from, to) of articles viewed in succession, the underlying mechanism can be clicking on a link in an article, but it may also have been a new search or reading both articles based on links from third source articles. In contrast, the clickstream data is based on referral information only- aggregated by month and country- k-anonymity threshold of 100- example: In March of 2020, there were a 1000 occurences of readers accessing the page es.wikipedia/SARS-CoV-2 followed by es.wikipedia/Orthocoronavirinae from ChileThe file is ~10 MB and contains ~90000 bigrams across the 6 months, 96 countries, and 56 projects.ContactPlease reach out to research-feedback@wikimedia.org for any questions.