92 datasets found

Wikipedia: most viewed articles in 2024
statista.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Wikipedia: most viewed articles in 2024 [Dataset]. https://www.statista.com/statistics/1358978/wikipedia-most-viewed-articles-by-number-of-views/
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
Worldwide
Description
The most viewed English-language article on Wikipedia in 2023 was Deaths in 2024, with a total of 44.4 million views. Political topics also dominated the list, with articles related to the 2024 U.S. presidential election and key political figures like Kamala Harris and Donald Trump ranking among the top ten most viewed pages. Wikipedia's language diversity As of December 2024, the English Wikipedia subdomain contained approximately 6.91 million articles, making it the largest in terms of content and registered active users. Interestingly, the Cebuano language ranked second with around 6.11 million entries, although many of these articles are reportedly generated by bots. German and French followed as the next most populous European language subdomains, each with over 18,000 active users. Compared to the rest of the internet, as of January 2024, English was the primary language for over 52 percent of websites worldwide, far outpacing Spanish at 5.5 percent and German at 4.8 percent. Global traffic to Wikipedia.org Hosted by the Wikimedia Foundation, Wikipedia.org saw around 4.4 billion unique global visits in March 2024, a slight decrease from 4.6 billion visitors in January. In addition, as of January 2024, Wikipedia ranked amongst the top ten websites with the most referring subnets worldwide.
Most visited Wikipedia pages in the U.S. 2020, by visits
statista.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most visited Wikipedia pages in the U.S. 2020, by visits [Dataset]. https://www.statista.com/statistics/1115251/most-visited-wikipedia-pages-usa/
Explore at:
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Mar 2020
Area covered
United States
Description
As of March 2020, the most visited Wikipedia page in the United States was "2020 Democratic party presidential primaries" with * million visits during the month. The second-most visited page was "2019-20 coronavirus pandemic" with *** million visits. A significant portion of the top visited Wikipedia pages in March are related to the global coronavirus pandemic.
Wikipedia.org: number of articles 2024, by language
statista.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Wikipedia.org: number of articles 2024, by language [Dataset]. https://www.statista.com/statistics/1427961/wikipedia-org-articles-language/
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
Worldwide
Description
As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.

Extended Wikipedia Multimodal Dataset

kaggle.com

Updated Apr 4, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/1058023

Dataset updated

Apr 4, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Oleh Onyshchak

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Wikipedia Featured Articles multimodal dataset

Overview

This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

Dataset structure

The high-level structure of the dataset is as follows:

.
+-- page1 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
+-- page2 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
: 
+-- pageN 
|  +-- text.json 
|  +-- img 
|    +-- meta.json

label	description
pageN	is the title of N-th Wikipedia page and contains all information about the page
text.json	text of the page saved as JSON. Please refer to the details of JSON schema below.
meta.json	a collection of all images of the page. Please refer to the details of JSON schema below.
imageN	is the N-th image of an article, saved in `jpg` format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

text.JSON Schema

Below you see an example of how data is stored:

{
 "title": "Naval Battle of Guadalcanal",
 "id": 405411,
 "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
 "html": "...

...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

key	description
title	page title
id	unique page id
url	url of a page on Wikipedia
html	HTML content of the article
wikitext	wikitext content of the article

Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

meta.JSON Schema

{
 "img_meta": [
  {
   "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
   "title": "IronbottomSound.jpg",
   "parsed_title": "ironbottom sound",
   "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
   "is_icon": False,
   "on_commons": True,
   "description": "A U.S. destroyer steams up what later became known as ...",
   "caption": "Ironbottom Sound. The majority of the warship surface ...",
   "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
   "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
   },
   ...
  ]
}

key	description
filename	unique image id, md5 hashcode of original image title
title	image title retrieved from Commons, if applicable
parsed_title	image title split into words, i.e. "helloWorld.jpg" -> "hello world"
url	url of an image on Wikipedia
is_icon	True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
on_commons	True if image is available from Wikimedia Commons dataset
description	description of an image parsed from Wikimedia Commons page, if available
caption	caption of an image parsed from Wikipedia article, if available
headings	list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
features	output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in `jpeg` format with fixed width of 600px. Practically, it is a list of floats with len = 2048

Collection method

Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

Kaggle Data Collection Notebook...

Wikipedia Article Titles
kaggle.com
Updated Sep 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksey Bilogur (2017). Wikipedia Article Titles [Dataset]. https://www.kaggle.com/residentmario/wikipedia-article-titles/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 22, 2017
Dataset provided by
Kaggle
Authors
Aleksey Bilogur
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Wikipedia, the world's largest encyclopedia, is a crowdsourced open knowledge project and website with millions of individual web pages. This dataset is a grab of the title of every article on Wikipedia as of September 20, 2017.

Content

This dataset is a simple newline () delimited list of article titles. No distinction is made between redirects (like Schwarzenegger) and actual article pages (like Arnold Schwarzenegger).

Acknowledgements

This dataset was created by scraping Special:AllPages on Wikipedia. It was originally shared here.

Inspiration

What are common article title tokens? How do they compare against frequent words in the English language?

What is the longest article title? The shortest?

What countries are most popular within article titles?
A Comprehensive Dataset of Classified Citations with Identifiers from...
zenodo.org
zip
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza (2023). A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2023) [Dataset]. http://doi.org/10.5281/zenodo.8107239
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8107239
Dataset updated
Jul 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https://dumps.wikimedia.org/enwiki/20230220/).

Version 1: en_citations.zip is a dataset of extracted citations

Version 2: en_final.zip is the same dataset with classified citations augmented with identifiers

The fields are as follows:

type_of_citation - Wikipedia template type used to define the citation, e.g., 'cite journal', 'cite news', etc.

page_title - title of the Wikipedia article from which the citation was extracted.

Title - source title, e.g., title of the book, newspaper article, etc.

URL - link to the source, e.g., webpage where news article was published, description of the book at the publisher's website, online library webpage, etc.

tld - top link domain extracted from the URL, e.g., 'bbc' for https://www.bbc.co.uk/...

Authors - list of article or book authors, if available.

ID_list - list of publication identifiers mentioned in the citation, e.g., DOI, ISBN, etc.

citations - citation text as used in Wikipedia code

actual_label - 'book', 'journal', 'news', or 'other' label assigned based on the analysis of citation identifiers or top link domain.

acquired_ID_list - identifiers located via Google Books and Crossref APIs for citations which are likely to refer to books or journals, i.e., defined using 'cite book', 'cite journal', 'cite encyclopedia', and 'cite proceedings' templates.

The total number of news: 9.926.598

The total number of books: 2.994.601

The total number of journals: 2.052.172

Augmented with IDs via lookup 929.601 (out of 2.445.913 book, journal, encyclopedia, and proceedings template citations not classified as books or journals via given identifiers).

The source code to extract citations can be found here: https://github.com/albatros13/wikicite.

The code is a fork of the earlier project on Wikipedia citation extraction: https://github.com/Harshdeep1996/cite-classifications-wiki.
Total global visitor traffic to Wikipedia.org 2024
statista.com
ai-chatbox.pro
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Total global visitor traffic to Wikipedia.org 2024 [Dataset]. https://www.statista.com/statistics/1259907/wikipedia-website-traffic/
Explore at:
Dataset updated
Nov 11, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Oct 2023 - Mar 2024
Area covered
Worldwide
Description
In March 2024, close to 4.4 billion unique global visitors had visited Wikipedia.org, slightly down from 4.4 billion visitors since August of the same year. Wikipedia is a free online encyclopedia with articles generated by volunteers worldwide. The platform is hosted by the Wikimedia Foundation.
E
Pairwise Multi-Class Document Classification for Semantic Relations between...
live.european-language-grid.eu
csv
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
Explore at:
csvAvailable download formats
Dataset updated
Apr 15, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
Datasets (articles, class labels, cross-validation splits)
Pretrained models (Transformers, GloVe, Doc2vec)
Model output (prediction) for the best performing models
This package consists of the Dataset part.
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files
f
Wikipedia Articles and Associated WikiProject Templates
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson; Aaron Halfaker (2023). Wikipedia Articles and Associated WikiProject Templates [Dataset]. http://doi.org/10.6084/m9.figshare.10248344.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10248344.v4
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Isaac Johnson; Aaron Halfaker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
== wikiproject_to_template.halfak_20191202.yaml The mapping of the canonical names of WikiProjects to all the templates that might be used to tag an article with this WikiProject that was used for generating this dump. For instance, the line 'WikiProject Trade: ["WikiProject Trade", "WikiProject trade", "Wptrade"]' indicates that WikiProject Trade (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Trade) is associated with the following templates:* https://en.wikipedia.org/wiki/Template:WikiProject_Trade* https://en.wikipedia.org/wiki/Template:WikiProject_trade* https://en.wikipedia.org/wiki/Template:Wptrade wikiproject_taxonomy.halfak_20191202.yaml A proposed mapping of WikiProjects to higher-level categories. This mapping has not been applied to the JSON dump contained here. It is based on the WikiProjects' canonical names. gather_wikiprojects_per_article.py Old Python script that built the JSON dump described below for English Wikipedia based on wikitext/wikidata dumps (slow and more prone to errors). gather_wikiprojects_per_article_pageassessments.py New Python script to build the JSON dump described below that uses the PageAssessments Mediawiki table in MariaDB and so is much faster and can handle languages beyond Enlgihs much more easily. labeled_wiki_with_topics_metadata.json.bz2 ==Each line of this bzipped JSON file corresponds with a Wikipedia article in that language (currently Arabic, English, French, Hungarian, Turkish). The intended usage of this JSON file is to build topic classification models for Wikipedia articles.While the English file has good coverage because a more or less complete mapping exists between WikiProjects and topics, the other languages are much more sparse in their labels because they do not cover any WikiProjects in that language that don't have English equivalents (per Wikidata). The other languages are probably best used for supplementation of the English labels or a separate test set that might have a different topic distribution.The following properties are recorded:* title: Wikipedia article title in that language* article_revid: Most recent revision ID associated with the article for which a WikiProject asssessment was made (might not be current revision ID)* talk_pid: Page ID corresponding with the talk page for the Wikipedia article* talk_revid: Most recent revision ID associated with the talk page for which a WikiProject asssessment was made (might not be current revision ID)* wp_templates: List of WikiProject templates from the page_assessments table.* qid: Wikidata ID corresponding to the Wikipedia article* sitelinks: Based on Wikidata, the other languages in which this article exists and the corresponding page IDs.* topics: topic labels associated with the article based on its WikiProject templates and the WikiProjectLabel mapping (wikiproject_taxonomy)This version is based on the 24 May 2020 page_assessment tables and 4 May 2020 Wikidata item_page_link table. Articles with no associated WikiProject templates are not included. Of note in comparison to previous versions of this file, the revision IDs are now that revision IDs that were most recently assessed by a WikiProject, not the current versions of the page. The sitelinks are now as page IDs, which are more stable and less prone to encoding issues etc. The WikiProject templates are now pulled via the Mediawiki page_assessments table and so are in a different format than the templates that were extracted from the raw talk pages.For example, here is the line for Agatha Christie from the English JSON file:{'title': 'Agatha_Christie','article_revid': 958377791, 'talk_pid': 1001, 'talk_revid': 958103309, 'wp_templates': ["Women","Women's History","Women writers","Biography","Novels/Crime task force","Novels","Biography/science and academia work group","Biography/arts and entertainment work group","Devon","Archaeology/Women in archaeology task force","Archaeology"], 'qid': 'Q35064', 'sitelinks': { 'afwiki': 19274, 'amwiki': 47582, 'anwiki': 115127, 'arwiki': 12886, ...'enwiki': 984,... 'zhwiki': 10983, 'zh_min_nanwiki': 21828, 'zh_yuewiki': 131652}}
Leading websites worldwide 2024, by monthly visits
statista.com
ai-chatbox.pro
+19more
Updated Mar 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading websites worldwide 2024, by monthly visits [Dataset]. https://www.statista.com/statistics/1201880/most-visited-websites-worldwide/
Explore at:
Dataset updated
Mar 24, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2024
Area covered
Worldwide
Description
In November 2024, Google.com was the most popular website worldwide with 136 billion average monthly visits. The online platform has held the top spot as the most popular website since June 2010, when it pulled ahead of Yahoo into first place. Second-ranked YouTube generated more than 72.8 billion monthly visits in the measured period. The internet leaders: search, social, and e-commerce Social networks, search engines, and e-commerce websites shape the online experience as we know it. While Google leads the global online search market by far, YouTube and Facebook have become the world’s most popular websites for user generated content, solidifying Alphabet’s and Meta’s leadership over the online landscape. Meanwhile, websites such as Amazon and eBay generate millions in profits from the sale and distribution of goods, making the e-market sector an integral part of the global retail scene. What is next for online content? Powering social media and websites like Reddit and Wikipedia, user-generated content keeps moving the internet’s engines. However, the rise of generative artificial intelligence will bring significant changes to how online content is produced and handled. ChatGPT is already transforming how online search is performed, and news of Google's 2024 deal for licensing Reddit content to train large language models (LLMs) signal that the internet is likely to go through a new revolution. While AI's impact on the online market might bring both opportunities and challenges, effective content management will remain crucial for profitability on the web.
r
Wikipedia
rrid.site
scicrunch.org
Updated Jul 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Wikipedia [Dataset]. http://identifiers.org/RRID:SCR_004897
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004897 https://identifiers.org/RRID:SCR_004897/resolver
Dataset updated
Jul 20, 2025
Description
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 19 million articles (over 3.6 million in English) have been written collaboratively by volunteers around the world, and almost all of its articles can be edited by anyone with access to the site. As of July 2011, there were editions of Wikipedia in 282 languages. Wikipedia was launched in 2001 by Jimmy Wales and Larry Sanger and has become the largest and most popular general reference work on the Internet, ranking around seventh among all websites on Alexa and having 365 million readers. The name Wikipedia was coined by Larry Sanger and is a combination of wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning quick) and encyclopedia. Wikipedia''s departure from the expert-driven style of encyclopedia building and the large presence of unacademic content has been noted several times. Some have noted the importance of Wikipedia not only as an encyclopedic reference but also as a frequently updated news resource because of how quickly articles about recent events appear. Although the policies of Wikipedia strongly espouse verifiability and a neutral point of view, critics of Wikipedia accuse it of systemic bias and inconsistencies (including undue weight given to popular culture), and allege that it favors consensus over credentials in its editorial processes. Its reliability and accuracy are also targeted. A 2005 investigation in Nature showed that the science articles they compared came close to the level of accuracy of Encyclopedia Britannica and had a similar rate of serious errors.
I
WikiCSSH - Computer Science Subject Headings from Wikipedia
databank.illinois.edu
Updated Sep 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanyao Han; Pingjing Yang; Shubhanshu Mishra; Jana Diesner (2020). WikiCSSH - Computer Science Subject Headings from Wikipedia [Dataset]. http://doi.org/10.13012/B2IDB-0424970_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-0424970_V1
Dataset updated
Sep 13, 2020
Authors
Kanyao Han; Pingjing Yang; Shubhanshu Mishra; Jana Diesner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WikiCSSH If you are using WikiCSSH please cite the following: > Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. “WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia.” In Workshop on Scientific Knowledge Graphs (SKG 2020). https://skg.kmi.open.ac.uk/SKG2020/papers/HAN_et_al_SKG_2020.pdf > Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. "WikiCSSH - Computer Science Subject Headings from Wikipedia". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0424970_V1 Download the WikiCSSH files from: https://doi.org/10.13012/B2IDB-0424970_V1 More details about the WikiCSSH project can be found at: https://github.com/uiuc-ischool-scanr/WikiCSSH This folder contains the following files: WikiCSSH_categories.csv - Categories in WikiCSSH WikiCSSH_category_links.csv - Links between categories in WikiCSSH Wikicssh_core_categories.csv - Core categories as mentioned in the paper WikiCSSH_category_links_all.csv - Links between categories in WikiCSSH (includes a dummy category called
f
Top-5 Wikipedia pages selected by the LASSO models for the influenza seasons...
plos.figshare.com
xls
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giovanni De Toni; Cristian Consonni; Alberto Montresor (2023). Top-5 Wikipedia pages selected by the LASSO models for the influenza seasons from 2015 to 2019 for all the examined countries. [Dataset]. http://doi.org/10.1371/journal.pone.0256858.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0256858.t007
Dataset updated
Jun 6, 2023
Dataset provided by
PLOS ONE
Authors
Giovanni De Toni; Cristian Consonni; Alberto Montresor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We report only the models which performed better (see Table 3 for the best models). For each model, we report the page name, the shortest-path distance DI between the page and the corresponding “Influenza” page, and the Pearson Correlation Coefficient (PCC) measured against the influenza incidence. We also report the corresponding page in the English Wikipedia in parentheses. We used the value NE to specify when a page has no English equivalent. The value DI > 3 indicates that the page is more than three hops away from the “Influenza” page.
P
Wikipedia Generation Dataset
paperswithcode.com
opendatalab.com
Updated Feb 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter J. Liu; Mohammad Saleh; Etienne Pot; Ben Goodrich; Ryan Sepassi; Lukasz Kaiser; Noam Shazeer (2021). Wikipedia Generation Dataset [Dataset]. https://paperswithcode.com/dataset/wikipedia-generation
Explore at:
Dataset updated
Feb 7, 2021
Authors
Peter J. Liu; Mohammad Saleh; Etienne Pot; Ben Goodrich; Ryan Sepassi; Lukasz Kaiser; Noam Shazeer
Description
Wikipedia Generation is a dataset for article generation from Wikipedia from references at the end of Wikipedia page and the top 10 search results for the Wikipedia topic.
f
File S1 - Highlighting Entanglement of Cultures via Ranking of Multilingual...
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Ho Eom; Dima L. Shepelyansky (2023). File S1 - Highlighting Entanglement of Cultures via Ranking of Multilingual Wikipedia Articles [Dataset]. http://doi.org/10.1371/journal.pone.0074554.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0074554.s001
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Young-Ho Eom; Dima L. Shepelyansky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Presents Figures S1, S2, S3 in SI file showing comparison between probability distributions over activity fields and language for top 30 and 100 persons for EN, IT, NK respectively; tables S1, S2, … S27 in SI file showing top 30 persons in PageRank, CheiRank and 2DRank for all 9 Wikipedia editions. All names are given in English. Supplementary methods, tables, ranking lists and figures are available at http://www.quantware.ups-tlse.fr/QWLIB/wikiculturenetwork/; data sets of 9 hyperlink networks are available at [29] by a direct request addressed to S.Vigna. (PDF)
f
Example of list of top 10 persons by PageRank for English Wikipedia with...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Ho Eom; Dima L. Shepelyansky (2023). Example of list of top 10 persons by PageRank for English Wikipedia with their field of activity and native language. [Dataset]. http://doi.org/10.1371/journal.pone.0074554.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0074554.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Young-Ho Eom; Dima L. Shepelyansky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example of list of top 10 persons by PageRank for English Wikipedia with their field of activity and native language.
Wikipedia Category Granularity (WikiGrain) data
zenodo.org
csv, txt
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
Explore at:
txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1005175
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jürgen Lerner; Jürgen Lerner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

The WikiGrain Data is analyzed in the paper

Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

===============================================================
Individual files (tables in comma-separated-values-format):

---------------------------------------------------------------
* article_info.csv contains the following variables:

- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.

- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.

---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:

- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

- "title.of.tlc"
(string) Title of the TLC in which the article is contained.

---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:

- "id"
Article id.

- "is.FA"
Boolean indicator for whether the article is featured.

- "log1p.length"
Length measured by the number of bytes.

- "age"
Age measured by the time since the first edit.

- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.

- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.

- "log1p.number.of.contributors"
Number of unique contributors to the article.

- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').

- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').

- "number.of.level.1.sections"
Number of first level sections in the article.

- "number.of.level.2.sections"
Number of second level sections in the article.

- "number.of.categories"
Number of categories the article is in.

- "log1p.average.size.of.categories"
Average size of the categories the article is in.

- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.

- "log1p.number.of.external.references"
Number of external references given in the article.

- "log1p.number.of.images"
Number of images in the article.

- "log1p.number.of.templates"
Number of templates that the article uses.

- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.

- "granularity"
As in article_info.csv (but normalized to standard deviation one).
e
Wikipedia paths - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Oct 22, 2023
Description
Wikipedia category embedding starting at the top category Biology for English, French and Czech. English data are not complete.
f
Wikipedia Clickstream
figshare.com
application/gzip
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Dario Taraborelli (2023). Wikipedia Clickstream [Dataset]. http://doi.org/10.6084/m9.figshare.1305770.v7
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1305770.v7
Dataset updated
Jul 20, 2023
Dataset provided by
figshare
Authors
Ellery Wulczyn; Dario Taraborelli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About This dataset contains counts of (referer, article) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included in the request in an HTTP header called the "referer". This data captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. Data Preparation- The dataset only includes requests to articles in the main namespace of the desktop version of English Wikipedia (see https://en.wikipedia.org/wiki/Wikipedia:Namespace) - Requests to MediaWiki redirects are excluded - Spider traffic was excluded using the ua-parser library (https://github.com/tobie/ua-parser) - Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources of English Wikipedia, based on this scheme: - an article in the main namespace of English Wikipedia -> the article title - any Wikipedia page that is not in the main namespace of English Wikipedia -> 'other-wikipedia' - an empty referer -> 'other-empty' - a page from any other Wikimedia project -> 'other-internal' - Google -> 'other-google' - Yahoo -> 'other-yahoo' - Bing -> 'other-bing' - Facebook -> 'other-facebook' - Twitter -> 'other-twitter' - anything else -> 'other' For the exact mapping see https://github.com/ewulczyn/wmf/blob/master/mc/oozie/hive_query.sql#L30-L48 - (referer, article) pairs with 10 or fewer observations were removed from the dataset Note: When a user requests a page through the search bar, the page the user searched from is listed as a referer. Hence, the data contains '(referer, article)' pairs for which the referer does not contain a link to the article. For an example, consider the '(Wikipedia, Chris_Kyle)' pair. Users went to the 'Wikipedia' article to search for Chris Kyle within English Wikipedia. ApplicationsThis data can be used for various purposes: - determining the most frequent links people click on for a given article- determining the most common links people followed to an article- determining how much of the total traffic to an article clicked on a link in that article- generating a Markov chain over English Wikipedia Format:- prev_id: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on- curr_id: the MediaWiki unique page ID of the article the client requested- n: the number of occurrences of the '(referer, article)' pair- prev_title: the result of mapping the referer URL to the fixed set of values described above- curr_title: the title of the article the client requested

LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/ Source codehttps://github.com/ewulczyn/wmf/blob/master/mc/oozie/hive_query.sql (MIT license)
Dataset of Higher Education Activities that Incorporate Wikipedia as a...
zenodo.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flávia Varella; Flávia Varella; Danielly Figueredo; Danielly Figueredo (2025). Dataset of Higher Education Activities that Incorporate Wikipedia as a Pedagogical Tool [Dataset]. http://doi.org/10.5281/zenodo.15124862
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15124862
Dataset updated
Jun 4, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Flávia Varella; Flávia Varella; Danielly Figueredo; Danielly Figueredo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset compiles educational activities in higher education that incorporate Wikipedia as a pedagogical tool between 2003 and 2024. Each activity includes detailed information such as:

Wikipedia version used

Activity title

Description of the activity

Related discipline and course code (if applicable)

Dates of implementation (start and end)

City and country where the activity took place

Educational level and university

National Council for Scientific and Technological Development (CNPq) knowledge area classification

Names and usernames of supporting staff and responsible professors

Partnerships with other institutions or initiatives

Links to relevant outputs on Wikipedia, Wikimedia Commons, and external sites

Its goal is to organize these experiences to facilitate comparative analysis, identify best practices, and support the development of new educational projects involving open knowledge and active learning strategies.

The mapping table available on this page is the result of an extensive two-year research effort, involving the analysis of over 20,000 educational projects related to the use of Wikipedia in higher education. Despite the systematic effort and methodological rigor applied, the volume, diversity, and limitations in accessing consistent data ultimately compromised the final consolidation of the table, especially after the research funding ended. Therefore, we recommend that the table be consulted with caution and critical thinking, bearing in mind that some information may be incomplete or inaccurate. A broader contextualization of the results, as well as reflections on the challenges faced, can be found in the project's final report: https://w.wiki/DeBS .

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2024). Wikipedia: most viewed articles in 2024 [Dataset]. https://www.statista.com/statistics/1358978/wikipedia-most-viewed-articles-by-number-of-views/

Wikipedia: most viewed articles in 2024

Explore at:

Dataset updated

Dec 4, 2024

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2024

Area covered

Worldwide

Description

The most viewed English-language article on Wikipedia in 2023 was Deaths in 2024, with a total of 44.4 million views. Political topics also dominated the list, with articles related to the 2024 U.S. presidential election and key political figures like Kamala Harris and Donald Trump ranking among the top ten most viewed pages. Wikipedia's language diversity As of December 2024, the English Wikipedia subdomain contained approximately 6.91 million articles, making it the largest in terms of content and registered active users. Interestingly, the Cebuano language ranked second with around 6.11 million entries, although many of these articles are reportedly generated by bots. German and French followed as the next most populous European language subdomains, each with over 18,000 active users. Compared to the rest of the internet, as of January 2024, English was the primary language for over 52 percent of websites worldwide, far outpacing Spanish at 5.5 percent and German at 4.8 percent. Global traffic to Wikipedia.org Hosted by the Wikimedia Foundation, Wikipedia.org saw around 4.4 billion unique global visits in March 2024, a slight decrease from 4.6 billion visitors in January. In addition, as of January 2024, Wikipedia ranked amongst the top ten websites with the most referring subnets worldwide.

Clear search

Close search

Google apps

Main menu

Wikipedia: most viewed articles in 2024

Most visited Wikipedia pages in the U.S. 2020, by visits

Wikipedia.org: number of articles 2024, by language

Extended Wikipedia Multimodal Dataset

Wikipedia Featured Articles multimodal dataset

Overview

Dataset structure

text.JSON Schema

meta.JSON Schema

Collection method

Wikipedia Article Titles

Context

Content

Acknowledgements

Inspiration

A Comprehensive Dataset of Classified Citations with Identifiers from...

Total global visitor traffic to Wikipedia.org 2024

Pairwise Multi-Class Document Classification for Semantic Relations between...

Wikipedia Articles and Associated WikiProject Templates

Leading websites worldwide 2024, by monthly visits

Wikipedia

WikiCSSH - Computer Science Subject Headings from Wikipedia

Top-5 Wikipedia pages selected by the LASSO models for the influenza seasons...

Wikipedia Generation Dataset

File S1 - Highlighting Entanglement of Cultures via Ranking of Multilingual...

Example of list of top 10 persons by PageRank for English Wikipedia with...

Wikipedia Category Granularity (WikiGrain) data

Wikipedia paths - Dataset - B2FIND

Wikipedia Clickstream

Dataset of Higher Education Activities that Incorporate Wikipedia as a...

Wikipedia: most viewed articles in 2024