92 datasets found
  1. Wikipedia: most viewed articles in 2024

    • statista.com
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Wikipedia: most viewed articles in 2024 [Dataset]. https://www.statista.com/statistics/1358978/wikipedia-most-viewed-articles-by-number-of-views/
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    Worldwide
    Description

    The most viewed English-language article on Wikipedia in 2023 was Deaths in 2024, with a total of 44.4 million views. Political topics also dominated the list, with articles related to the 2024 U.S. presidential election and key political figures like Kamala Harris and Donald Trump ranking among the top ten most viewed pages. Wikipedia's language diversity As of December 2024, the English Wikipedia subdomain contained approximately 6.91 million articles, making it the largest in terms of content and registered active users. Interestingly, the Cebuano language ranked second with around 6.11 million entries, although many of these articles are reportedly generated by bots. German and French followed as the next most populous European language subdomains, each with over 18,000 active users. Compared to the rest of the internet, as of January 2024, English was the primary language for over 52 percent of websites worldwide, far outpacing Spanish at 5.5 percent and German at 4.8 percent. Global traffic to Wikipedia.org Hosted by the Wikimedia Foundation, Wikipedia.org saw around 4.4 billion unique global visits in March 2024, a slight decrease from 4.6 billion visitors in January. In addition, as of January 2024, Wikipedia ranked amongst the top ten websites with the most referring subnets worldwide.

  2. Most visited Wikipedia pages in the U.S. 2020, by visits

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most visited Wikipedia pages in the U.S. 2020, by visits [Dataset]. https://www.statista.com/statistics/1115251/most-visited-wikipedia-pages-usa/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Mar 2020
    Area covered
    United States
    Description

    As of March 2020, the most visited Wikipedia page in the United States was "2020 Democratic party presidential primaries" with * million visits during the month. The second-most visited page was "2019-20 coronavirus pandemic" with *** million visits. A significant portion of the top visited Wikipedia pages in March are related to the global coronavirus pandemic.

  3. Wikipedia.org: number of articles 2024, by language

    • statista.com
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Wikipedia.org: number of articles 2024, by language [Dataset]. https://www.statista.com/statistics/1427961/wikipedia-org-articles-language/
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    Worldwide
    Description

    As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.

  4. Extended Wikipedia Multimodal Dataset

    • kaggle.com
    Updated Apr 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oleh Onyshchak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Featured Articles multimodal dataset

    Overview

    • This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
    • Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

    It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

    You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

    Dataset structure

    The high-level structure of the dataset is as follows:

    .
    +-- page1 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    +-- page2 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    : 
    +-- pageN 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    
    labeldescription
    pageNis the title of N-th Wikipedia page and contains all information about the page
    text.jsontext of the page saved as JSON. Please refer to the details of JSON schema below.
    meta.jsona collection of all images of the page. Please refer to the details of JSON schema below.
    imageNis the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

    text.JSON Schema

    Below you see an example of how data is stored:

    {
     "title": "Naval Battle of Guadalcanal",
     "id": 405411,
     "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
     "html": "... 
    

    ...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

    keydescription
    titlepage title
    idunique page id
    urlurl of a page on Wikipedia
    htmlHTML content of the article
    wikitextwikitext content of the article

    Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

    meta.JSON Schema

    {
     "img_meta": [
      {
       "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
       "title": "IronbottomSound.jpg",
       "parsed_title": "ironbottom sound",
       "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
       "is_icon": False,
       "on_commons": True,
       "description": "A U.S. destroyer steams up what later became known as ...",
       "caption": "Ironbottom Sound. The majority of the warship surface ...",
       "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
       "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
       },
       ...
      ]
    }
    
    keydescription
    filenameunique image id, md5 hashcode of original image title
    titleimage title retrieved from Commons, if applicable
    parsed_titleimage title split into words, i.e. "helloWorld.jpg" -> "hello world"
    urlurl of an image on Wikipedia
    is_iconTrue if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
    on_commonsTrue if image is available from Wikimedia Commons dataset
    descriptiondescription of an image parsed from Wikimedia Commons page, if available
    captioncaption of an image parsed from Wikipedia article, if available
    headingslist of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
    featuresoutput of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048

    Collection method

    Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

  5. Wikipedia Article Titles

    • kaggle.com
    Updated Sep 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksey Bilogur (2017). Wikipedia Article Titles [Dataset]. https://www.kaggle.com/residentmario/wikipedia-article-titles/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 22, 2017
    Dataset provided by
    Kaggle
    Authors
    Aleksey Bilogur
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Wikipedia, the world's largest encyclopedia, is a crowdsourced open knowledge project and website with millions of individual web pages. This dataset is a grab of the title of every article on Wikipedia as of September 20, 2017.

    Content

    This dataset is a simple newline () delimited list of article titles. No distinction is made between redirects (like Schwarzenegger) and actual article pages (like Arnold Schwarzenegger).

    Acknowledgements

    This dataset was created by scraping Special:AllPages on Wikipedia. It was originally shared here.

    Inspiration

    • What are common article title tokens? How do they compare against frequent words in the English language?
    • What is the longest article title? The shortest?
    • What countries are most popular within article titles?
  6. A Comprehensive Dataset of Classified Citations with Identifiers from...

    • zenodo.org
    zip
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza (2023). A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2023) [Dataset]. http://doi.org/10.5281/zenodo.8107239
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https://dumps.wikimedia.org/enwiki/20230220/).

    Version 1: en_citations.zip is a dataset of extracted citations

    Version 2: en_final.zip is the same dataset with classified citations augmented with identifiers

    The fields are as follows:

    • type_of_citation - Wikipedia template type used to define the citation, e.g., 'cite journal', 'cite news', etc.
    • page_title - title of the Wikipedia article from which the citation was extracted.
    • Title - source title, e.g., title of the book, newspaper article, etc.
    • URL - link to the source, e.g., webpage where news article was published, description of the book at the publisher's website, online library webpage, etc.
    • tld - top link domain extracted from the URL, e.g., 'bbc' for https://www.bbc.co.uk/...
    • Authors - list of article or book authors, if available.
    • ID_list - list of publication identifiers mentioned in the citation, e.g., DOI, ISBN, etc.
    • citations - citation text as used in Wikipedia code
    • actual_label - 'book', 'journal', 'news', or 'other' label assigned based on the analysis of citation identifiers or top link domain.
    • acquired_ID_list - identifiers located via Google Books and Crossref APIs for citations which are likely to refer to books or journals, i.e., defined using 'cite book', 'cite journal', 'cite encyclopedia', and 'cite proceedings' templates.
    1. The total number of news: 9.926.598
    2. The total number of books: 2.994.601
    3. The total number of journals: 2.052.172
    4. Augmented with IDs via lookup 929.601 (out of 2.445.913 book, journal, encyclopedia, and proceedings template citations not classified as books or journals via given identifiers).

    The source code to extract citations can be found here: https://github.com/albatros13/wikicite.

    The code is a fork of the earlier project on Wikipedia citation extraction: https://github.com/Harshdeep1996/cite-classifications-wiki.

  7. Total global visitor traffic to Wikipedia.org 2024

    • statista.com
    • ai-chatbox.pro
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Total global visitor traffic to Wikipedia.org 2024 [Dataset]. https://www.statista.com/statistics/1259907/wikipedia-website-traffic/
    Explore at:
    Dataset updated
    Nov 11, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2023 - Mar 2024
    Area covered
    Worldwide
    Description

    In March 2024, close to 4.4 billion unique global visitors had visited Wikipedia.org, slightly down from 4.4 billion visitors since August of the same year. Wikipedia is a free online encyclopedia with articles generated by volunteers worldwide. The platform is hosted by the Wikimedia Foundation.

  8. E

    Pairwise Multi-Class Document Classification for Semantic Relations between...

    • live.european-language-grid.eu
    csv
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 15, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
    which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

    Additional information can be found on GitHub.

    The following data is supplemental to the experiments described in our research paper. The data consists of:

    • Datasets (articles, class labels, cross-validation splits)
    • Pretrained models (Transformers, GloVe, Doc2vec)
    • Model output (prediction) for the best performing models

    This package consists of the Dataset part.

    Dataset

    The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.

    The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.

    ├── 1
    │  ├── test.csv
    │  └── train.csv
    ├── 2
    │  ├── test.csv
    │  └── train.csv
    ├── 3
    │  ├── test.csv
    │  └── train.csv
    └── 4
     ├── test.csv
     └── train.csv

    4 directories, 8 files

  9. f

    Wikipedia Articles and Associated WikiProject Templates

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Johnson; Aaron Halfaker (2023). Wikipedia Articles and Associated WikiProject Templates [Dataset]. http://doi.org/10.6084/m9.figshare.10248344.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Isaac Johnson; Aaron Halfaker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    == wikiproject_to_template.halfak_20191202.yaml The mapping of the canonical names of WikiProjects to all the templates that might be used to tag an article with this WikiProject that was used for generating this dump. For instance, the line 'WikiProject Trade: ["WikiProject Trade", "WikiProject trade", "Wptrade"]' indicates that WikiProject Trade (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Trade) is associated with the following templates:* https://en.wikipedia.org/wiki/Template:WikiProject_Trade* https://en.wikipedia.org/wiki/Template:WikiProject_trade* https://en.wikipedia.org/wiki/Template:Wptrade wikiproject_taxonomy.halfak_20191202.yaml A proposed mapping of WikiProjects to higher-level categories. This mapping has not been applied to the JSON dump contained here. It is based on the WikiProjects' canonical names. gather_wikiprojects_per_article.py Old Python script that built the JSON dump described below for English Wikipedia based on wikitext/wikidata dumps (slow and more prone to errors). gather_wikiprojects_per_article_pageassessments.py New Python script to build the JSON dump described below that uses the PageAssessments Mediawiki table in MariaDB and so is much faster and can handle languages beyond Enlgihs much more easily. labeled_wiki_with_topics_metadata.json.bz2 ==Each line of this bzipped JSON file corresponds with a Wikipedia article in that language (currently Arabic, English, French, Hungarian, Turkish). The intended usage of this JSON file is to build topic classification models for Wikipedia articles.While the English file has good coverage because a more or less complete mapping exists between WikiProjects and topics, the other languages are much more sparse in their labels because they do not cover any WikiProjects in that language that don't have English equivalents (per Wikidata). The other languages are probably best used for supplementation of the English labels or a separate test set that might have a different topic distribution.The following properties are recorded:* title: Wikipedia article title in that language* article_revid: Most recent revision ID associated with the article for which a WikiProject asssessment was made (might not be current revision ID)* talk_pid: Page ID corresponding with the talk page for the Wikipedia article* talk_revid: Most recent revision ID associated with the talk page for which a WikiProject asssessment was made (might not be current revision ID)* wp_templates: List of WikiProject templates from the page_assessments table.* qid: Wikidata ID corresponding to the Wikipedia article* sitelinks: Based on Wikidata, the other languages in which this article exists and the corresponding page IDs.* topics: topic labels associated with the article based on its WikiProject templates and the WikiProjectLabel mapping (wikiproject_taxonomy)This version is based on the 24 May 2020 page_assessment tables and 4 May 2020 Wikidata item_page_link table. Articles with no associated WikiProject templates are not included. Of note in comparison to previous versions of this file, the revision IDs are now that revision IDs that were most recently assessed by a WikiProject, not the current versions of the page. The sitelinks are now as page IDs, which are more stable and less prone to encoding issues etc. The WikiProject templates are now pulled via the Mediawiki page_assessments table and so are in a different format than the templates that were extracted from the raw talk pages.For example, here is the line for Agatha Christie from the English JSON file:{'title': 'Agatha_Christie','article_revid': 958377791, 'talk_pid': 1001, 'talk_revid': 958103309, 'wp_templates': ["Women","Women's History","Women writers","Biography","Novels/Crime task force","Novels","Biography/science and academia work group","Biography/arts and entertainment work group","Devon","Archaeology/Women in archaeology task force","Archaeology"], 'qid': 'Q35064', 'sitelinks': { 'afwiki': 19274, 'amwiki': 47582, 'anwiki': 115127, 'arwiki': 12886, ...'enwiki': 984,... 'zhwiki': 10983, 'zh_min_nanwiki': 21828, 'zh_yuewiki': 131652}}

  10. Leading websites worldwide 2024, by monthly visits

    • statista.com
    • ai-chatbox.pro
    • +19more
    Updated Mar 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Leading websites worldwide 2024, by monthly visits [Dataset]. https://www.statista.com/statistics/1201880/most-visited-websites-worldwide/
    Explore at:
    Dataset updated
    Mar 24, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Nov 2024
    Area covered
    Worldwide
    Description

    In November 2024, Google.com was the most popular website worldwide with 136 billion average monthly visits. The online platform has held the top spot as the most popular website since June 2010, when it pulled ahead of Yahoo into first place. Second-ranked YouTube generated more than 72.8 billion monthly visits in the measured period. The internet leaders: search, social, and e-commerce Social networks, search engines, and e-commerce websites shape the online experience as we know it. While Google leads the global online search market by far, YouTube and Facebook have become the world’s most popular websites for user generated content, solidifying Alphabet’s and Meta’s leadership over the online landscape. Meanwhile, websites such as Amazon and eBay generate millions in profits from the sale and distribution of goods, making the e-market sector an integral part of the global retail scene. What is next for online content? Powering social media and websites like Reddit and Wikipedia, user-generated content keeps moving the internet’s engines. However, the rise of generative artificial intelligence will bring significant changes to how online content is produced and handled. ChatGPT is already transforming how online search is performed, and news of Google's 2024 deal for licensing Reddit content to train large language models (LLMs) signal that the internet is likely to go through a new revolution. While AI's impact on the online market might bring both opportunities and challenges, effective content management will remain crucial for profitability on the web.

  11. r

    Wikipedia

    • rrid.site
    • scicrunch.org
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Wikipedia [Dataset]. http://identifiers.org/RRID:SCR_004897
    Explore at:
    Dataset updated
    Jul 20, 2025
    Description

    Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 19 million articles (over 3.6 million in English) have been written collaboratively by volunteers around the world, and almost all of its articles can be edited by anyone with access to the site. As of July 2011, there were editions of Wikipedia in 282 languages. Wikipedia was launched in 2001 by Jimmy Wales and Larry Sanger and has become the largest and most popular general reference work on the Internet, ranking around seventh among all websites on Alexa and having 365 million readers. The name Wikipedia was coined by Larry Sanger and is a combination of wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning quick) and encyclopedia. Wikipedia''s departure from the expert-driven style of encyclopedia building and the large presence of unacademic content has been noted several times. Some have noted the importance of Wikipedia not only as an encyclopedic reference but also as a frequently updated news resource because of how quickly articles about recent events appear. Although the policies of Wikipedia strongly espouse verifiability and a neutral point of view, critics of Wikipedia accuse it of systemic bias and inconsistencies (including undue weight given to popular culture), and allege that it favors consensus over credentials in its editorial processes. Its reliability and accuracy are also targeted. A 2005 investigation in Nature showed that the science articles they compared came close to the level of accuracy of Encyclopedia Britannica and had a similar rate of serious errors.

  12. I

    WikiCSSH - Computer Science Subject Headings from Wikipedia

    • databank.illinois.edu
    Updated Sep 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanyao Han; Pingjing Yang; Shubhanshu Mishra; Jana Diesner (2020). WikiCSSH - Computer Science Subject Headings from Wikipedia [Dataset]. http://doi.org/10.13012/B2IDB-0424970_V1
    Explore at:
    Dataset updated
    Sep 13, 2020
    Authors
    Kanyao Han; Pingjing Yang; Shubhanshu Mishra; Jana Diesner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WikiCSSH If you are using WikiCSSH please cite the following: > Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. “WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia.” In Workshop on Scientific Knowledge Graphs (SKG 2020). https://skg.kmi.open.ac.uk/SKG2020/papers/HAN_et_al_SKG_2020.pdf > Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. "WikiCSSH - Computer Science Subject Headings from Wikipedia". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0424970_V1 Download the WikiCSSH files from: https://doi.org/10.13012/B2IDB-0424970_V1 More details about the WikiCSSH project can be found at: https://github.com/uiuc-ischool-scanr/WikiCSSH This folder contains the following files: WikiCSSH_categories.csv - Categories in WikiCSSH WikiCSSH_category_links.csv - Links between categories in WikiCSSH Wikicssh_core_categories.csv - Core categories as mentioned in the paper WikiCSSH_category_links_all.csv - Links between categories in WikiCSSH (includes a dummy category called

  13. f

    Top-5 Wikipedia pages selected by the LASSO models for the influenza seasons...

    • plos.figshare.com
    xls
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni De Toni; Cristian Consonni; Alberto Montresor (2023). Top-5 Wikipedia pages selected by the LASSO models for the influenza seasons from 2015 to 2019 for all the examined countries. [Dataset]. http://doi.org/10.1371/journal.pone.0256858.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Giovanni De Toni; Cristian Consonni; Alberto Montresor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We report only the models which performed better (see Table 3 for the best models). For each model, we report the page name, the shortest-path distance DI between the page and the corresponding “Influenza” page, and the Pearson Correlation Coefficient (PCC) measured against the influenza incidence. We also report the corresponding page in the English Wikipedia in parentheses. We used the value NE to specify when a page has no English equivalent. The value DI > 3 indicates that the page is more than three hops away from the “Influenza” page.

  14. P

    Wikipedia Generation Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter J. Liu; Mohammad Saleh; Etienne Pot; Ben Goodrich; Ryan Sepassi; Lukasz Kaiser; Noam Shazeer (2021). Wikipedia Generation Dataset [Dataset]. https://paperswithcode.com/dataset/wikipedia-generation
    Explore at:
    Dataset updated
    Feb 7, 2021
    Authors
    Peter J. Liu; Mohammad Saleh; Etienne Pot; Ben Goodrich; Ryan Sepassi; Lukasz Kaiser; Noam Shazeer
    Description

    Wikipedia Generation is a dataset for article generation from Wikipedia from references at the end of Wikipedia page and the top 10 search results for the Wikipedia topic.

  15. f

    File S1 - Highlighting Entanglement of Cultures via Ranking of Multilingual...

    • plos.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Ho Eom; Dima L. Shepelyansky (2023). File S1 - Highlighting Entanglement of Cultures via Ranking of Multilingual Wikipedia Articles [Dataset]. http://doi.org/10.1371/journal.pone.0074554.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Young-Ho Eom; Dima L. Shepelyansky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Presents Figures S1, S2, S3 in SI file showing comparison between probability distributions over activity fields and language for top 30 and 100 persons for EN, IT, NK respectively; tables S1, S2, … S27 in SI file showing top 30 persons in PageRank, CheiRank and 2DRank for all 9 Wikipedia editions. All names are given in English. Supplementary methods, tables, ranking lists and figures are available at http://www.quantware.ups-tlse.fr/QWLIB/wikiculturenetwork/; data sets of 9 hyperlink networks are available at [29] by a direct request addressed to S.Vigna. (PDF)

  16. f

    Example of list of top 10 persons by PageRank for English Wikipedia with...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Ho Eom; Dima L. Shepelyansky (2023). Example of list of top 10 persons by PageRank for English Wikipedia with their field of activity and native language. [Dataset]. http://doi.org/10.1371/journal.pone.0074554.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Young-Ho Eom; Dima L. Shepelyansky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example of list of top 10 persons by PageRank for English Wikipedia with their field of activity and native language.

  17. Wikipedia Category Granularity (WikiGrain) data

    • zenodo.org
    csv, txt
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
    Explore at:
    txt, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jürgen Lerner; Jürgen Lerner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

    The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

    WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

    The WikiGrain Data is analyzed in the paper

    Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

    ===============================================================
    Individual files (tables in comma-separated-values-format):

    ---------------------------------------------------------------
    * article_info.csv contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "granularity"
    (decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

    - "is.FA"
    (boolean) True ('1') if the article is a featured article; false ('0') else.

    - "is.FA.or.GA"
    (boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

    - "is.top.importance"
    (boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

    - "number.of.revisions"
    (integer) Number of times a new version of the article has been uploaded.


    ---------------------------------------------------------------
    * article_to_tlc.csv
    is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
    The file contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "id.of.tlc"
    (integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

    - "title.of.tlc"
    (string) Title of the TLC in which the article is contained.

    ---------------------------------------------------------------
    * article_info_normalized.csv
    contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
    The file contains the following variables:

    - "id"
    Article id.

    - "is.FA"
    Boolean indicator for whether the article is featured.

    - "log1p.length"
    Length measured by the number of bytes.

    - "age"
    Age measured by the time since the first edit.

    - "log1p.number.of.edits"
    Number of times a new version of the article has been uploaded.

    - "log1p.number.of.reverts"
    Number of times a revision has been reverted to a previous one.

    - "log1p.number.of.contributors"
    Number of unique contributors to the article.

    - "number.of.characters.per.word"
    Average number of characters per word (one component of 'reading complexity').

    - "number.of.words.per.sentence"
    Average number of words per sentence (second component of 'reading complexity').

    - "number.of.level.1.sections"
    Number of first level sections in the article.

    - "number.of.level.2.sections"
    Number of second level sections in the article.

    - "number.of.categories"
    Number of categories the article is in.

    - "log1p.average.size.of.categories"
    Average size of the categories the article is in.

    - "log1p.number.of.intra.wiki.links"
    Number of links to pages in the English-language version of Wikipedia.

    - "log1p.number.of.external.references"
    Number of external references given in the article.

    - "log1p.number.of.images"
    Number of images in the article.

    - "log1p.number.of.templates"
    Number of templates that the article uses.

    - "log1p.number.of.inter.language.links"
    Number of links to articles in different language edition of Wikipedia.

    - "granularity"
    As in article_info.csv (but normalized to standard deviation one).

  18. e

    Wikipedia paths - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Oct 22, 2023
    Description

    Wikipedia category embedding starting at the top category Biology for English, French and Czech. English data are not complete.

  19. f

    Wikipedia Clickstream

    • figshare.com
    application/gzip
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Dario Taraborelli (2023). Wikipedia Clickstream [Dataset]. http://doi.org/10.6084/m9.figshare.1305770.v7
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    figshare
    Authors
    Ellery Wulczyn; Dario Taraborelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About This dataset contains counts of (referer, article) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included in the request in an HTTP header called the "referer". This data captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. Data Preparation- The dataset only includes requests to articles in the main namespace of the desktop version of English Wikipedia (see https://en.wikipedia.org/wiki/Wikipedia:Namespace) - Requests to MediaWiki redirects are excluded - Spider traffic was excluded using the ua-parser library (https://github.com/tobie/ua-parser) - Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources of English Wikipedia, based on this scheme: - an article in the main namespace of English Wikipedia -> the article title - any Wikipedia page that is not in the main namespace of English Wikipedia -> 'other-wikipedia' - an empty referer -> 'other-empty' - a page from any other Wikimedia project -> 'other-internal' - Google -> 'other-google' - Yahoo -> 'other-yahoo' - Bing -> 'other-bing' - Facebook -> 'other-facebook' - Twitter -> 'other-twitter' - anything else -> 'other' For the exact mapping see https://github.com/ewulczyn/wmf/blob/master/mc/oozie/hive_query.sql#L30-L48 - (referer, article) pairs with 10 or fewer observations were removed from the dataset Note: When a user requests a page through the search bar, the page the user searched from is listed as a referer. Hence, the data contains '(referer, article)' pairs for which the referer does not contain a link to the article. For an example, consider the '(Wikipedia, Chris_Kyle)' pair. Users went to the 'Wikipedia' article to search for Chris Kyle within English Wikipedia. ApplicationsThis data can be used for various purposes: - determining the most frequent links people click on for a given article- determining the most common links people followed to an article- determining how much of the total traffic to an article clicked on a link in that article- generating a Markov chain over English Wikipedia Format:- prev_id: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on- curr_id: the MediaWiki unique page ID of the article the client requested- n: the number of occurrences of the '(referer, article)' pair- prev_title: the result of mapping the referer URL to the fixed set of values described above- curr_title: the title of the article the client requested

    LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/ Source codehttps://github.com/ewulczyn/wmf/blob/master/mc/oozie/hive_query.sql (MIT license)

  20. Dataset of Higher Education Activities that Incorporate Wikipedia as a...

    • zenodo.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flávia Varella; Flávia Varella; Danielly Figueredo; Danielly Figueredo (2025). Dataset of Higher Education Activities that Incorporate Wikipedia as a Pedagogical Tool [Dataset]. http://doi.org/10.5281/zenodo.15124862
    Explore at:
    Dataset updated
    Jun 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Flávia Varella; Flávia Varella; Danielly Figueredo; Danielly Figueredo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset compiles educational activities in higher education that incorporate Wikipedia as a pedagogical tool between 2003 and 2024. Each activity includes detailed information such as:

    • Wikipedia version used

    • Activity title

    • Description of the activity

    • Related discipline and course code (if applicable)

    • Dates of implementation (start and end)

    • City and country where the activity took place

    • Educational level and university

    • National Council for Scientific and Technological Development (CNPq) knowledge area classification

    • Names and usernames of supporting staff and responsible professors

    • Partnerships with other institutions or initiatives

    • Links to relevant outputs on Wikipedia, Wikimedia Commons, and external sites

    Its goal is to organize these experiences to facilitate comparative analysis, identify best practices, and support the development of new educational projects involving open knowledge and active learning strategies.

    The mapping table available on this page is the result of an extensive two-year research effort, involving the analysis of over 20,000 educational projects related to the use of Wikipedia in higher education. Despite the systematic effort and methodological rigor applied, the volume, diversity, and limitations in accessing consistent data ultimately compromised the final consolidation of the table, especially after the research funding ended. Therefore, we recommend that the table be consulted with caution and critical thinking, bearing in mind that some information may be incomplete or inaccurate. A broader contextualization of the results, as well as reflections on the challenges faced, can be found in the project's final report: https://w.wiki/DeBS .

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2024). Wikipedia: most viewed articles in 2024 [Dataset]. https://www.statista.com/statistics/1358978/wikipedia-most-viewed-articles-by-number-of-views/
Organization logo

Wikipedia: most viewed articles in 2024

Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
Worldwide
Description

The most viewed English-language article on Wikipedia in 2023 was Deaths in 2024, with a total of 44.4 million views. Political topics also dominated the list, with articles related to the 2024 U.S. presidential election and key political figures like Kamala Harris and Donald Trump ranking among the top ten most viewed pages. Wikipedia's language diversity As of December 2024, the English Wikipedia subdomain contained approximately 6.91 million articles, making it the largest in terms of content and registered active users. Interestingly, the Cebuano language ranked second with around 6.11 million entries, although many of these articles are reportedly generated by bots. German and French followed as the next most populous European language subdomains, each with over 18,000 active users. Compared to the rest of the internet, as of January 2024, English was the primary language for over 52 percent of websites worldwide, far outpacing Spanish at 5.5 percent and German at 4.8 percent. Global traffic to Wikipedia.org Hosted by the Wikimedia Foundation, Wikipedia.org saw around 4.4 billion unique global visits in March 2024, a slight decrease from 4.6 billion visitors in January. In addition, as of January 2024, Wikipedia ranked amongst the top ten websites with the most referring subnets worldwide.

Search
Clear search
Close search
Google apps
Main menu