100+ datasets found

Wikipedia.org: number of articles 2024, by language
statista.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Wikipedia.org: number of articles 2024, by language [Dataset]. https://www.statista.com/statistics/1427961/wikipedia-org-articles-language/
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
Worldwide
Description
As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.
Wikipedia Article Topics for All Languages (based on article outlinks)
figshare.com
bz2
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson (2021). Wikipedia Article Topics for All Languages (based on article outlinks) [Dataset]. http://doi.org/10.6084/m9.figshare.12619766.v3
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12619766.v3
Dataset updated
Jul 20, 2021
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Isaac Johnson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.
Wikipedia.org: active registered users 2024, by language
statista.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Wikipedia.org: active registered users 2024, by language [Dataset]. https://www.statista.com/statistics/1427623/wikipedia-org-language-active-registered-users/
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
Worldwide
Description
As of December 2024, the English subdomain of Wikipedia was by far the largest in terms of participation, with more than 122 thousand active registered users and by number of published articles. The languages French and German followed, each over 18 thousand users. Overall, Wikipedia's English subdomain was the largest subdomain of the website by number of entries, with around 6.91 million published articles.
Wikipedia: number of English-language articles 2002-2024
statista.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Wikipedia: number of English-language articles 2002-2024 [Dataset]. https://www.statista.com/statistics/1428203/wikipedia-org-english-published/
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
The English-language version of Wikipedia had over 6.76 million published articles by the beginning of 2024, thus being by far the largest language version of the website in terms of entries and active registered users. By the end of the same year, the number of articles increased to 6.91 million, at a growth rate of around 2.5 percent from the previously analyzed date.
Wikipedia: most viewed articles in 2024
statista.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Wikipedia: most viewed articles in 2024 [Dataset]. https://www.statista.com/statistics/1358978/wikipedia-most-viewed-articles-by-number-of-views/
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
Worldwide
Description
The most viewed English-language article on Wikipedia in 2023 was Deaths in 2024, with a total of 44.4 million views. Political topics also dominated the list, with articles related to the 2024 U.S. presidential election and key political figures like Kamala Harris and Donald Trump ranking among the top ten most viewed pages. Wikipedia's language diversity As of December 2024, the English Wikipedia subdomain contained approximately 6.91 million articles, making it the largest in terms of content and registered active users. Interestingly, the Cebuano language ranked second with around 6.11 million entries, although many of these articles are reportedly generated by bots. German and French followed as the next most populous European language subdomains, each with over 18,000 active users. Compared to the rest of the internet, as of January 2024, English was the primary language for over 52 percent of websites worldwide, far outpacing Spanish at 5.5 percent and German at 4.8 percent. Global traffic to Wikipedia.org Hosted by the Wikimedia Foundation, Wikipedia.org saw around 4.4 billion unique global visits in March 2024, a slight decrease from 4.6 billion visitors in January. In addition, as of January 2024, Wikipedia ranked amongst the top ten websites with the most referring subnets worldwide.
WikiRank quality scores and measures for Wikipedia articles (April 2022)
figshare.com
application/gzip
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiki Rank (2023). WikiRank quality scores and measures for Wikipedia articles (April 2022) [Dataset]. http://doi.org/10.6084/m9.figshare.19762927.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19762927.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Wiki Rank
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Those datasets include lists of over 43 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Additionally, the datasets contain the quality measures (metrics) which directly affect these scores. Quality measures were extracted based on Wikipedia dumps from April, 2022.

License All files included in this datasets are released under CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ Format

page_id -- The identifier of the Wikipedia article (int), e.g. 840191 page_name -- The title of the Wikipedia article (utf-8), e.g. Sagittarius A* wikirank_quality -- quality score for Wikipedia article in a scale 0-100 (as of April 1, 2022). This is a synthetic measure that was calculated based on the metrics below (also included in the datasets). norm_len - normalized "page length" norm_refs - normalized "number of references" norm_img - normalized "number of images" norm_sec - normalized "number of sections" norm_reflen - normalized "references per length ratio" norm_authors - normalized "number of authors" (without bots and anonymous users) flawtemps - flaw templates
h
wikipedia
huggingface.co
tensorflow.org
Updated Feb 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
Explore at:
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
Quality scores for Wikipedia articles (July 2018)
figshare.com
kaggle.com
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiki Rank (2023). Quality scores for Wikipedia articles (July 2018) [Dataset]. http://doi.org/10.6084/m9.figshare.7272713.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7272713.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wiki Rank
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes a list of over 37 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Quality scores of articles are based on Wikipedia dumps from July, 2018 License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/Format• page_id -- The identifier of the Wikipedia article (int), e.g. 4519301• revision_id -- The Wikipedia revision of the article (int), e.g. 24284811• page_name -- The title of the Wikipedia article (utf-8), e.g. General relativity• wikirank_quality -- quality score 0-100
E
Pairwise Multi-Class Document Classification for Semantic Relations between...
live.european-language-grid.eu
csv
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
Explore at:
csvAvailable download formats
Dataset updated
Apr 15, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
Datasets (articles, class labels, cross-validation splits)
Pretrained models (Transformers, GloVe, Doc2vec)
Model output (prediction) for the best performing models
This package consists of the Dataset part.
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files
structured-wikipedia
huggingface.co
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia
Explore at:
Dataset updated
Sep 16, 2024
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Wikimedia Structured Wikipedia

Dataset Description Dataset Summary

Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.
WikiReaD (Wikipedia Readability Dataset)
zenodo.org
bz2
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach (2025). WikiReaD (Wikipedia Readability Dataset) [Dataset]. http://doi.org/10.5281/zenodo.11371932
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.11371932
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Description:

The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).

Dataset Details:

Number of Languages: 14

Number of files: 19

Use Case: Training and evaluating readability scoring models for articles within and outside Wikipedia.

Processing details: Text pairs are created by matching articles from Wikipedia with the corresponding article in the simplified/children encyclopedia either via the Wikidata item ID or their page titles. The text of each article is extracted directly from their parsed HTML version.

Files: The dataset consists of independent files for each type of children/simplified encyclopedia and each language (e.g., `

Attribution:

The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids.

Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page (

Wikipedia

Source: https://

License: CC BY-SA 4.0, GFDL

Simple English Wikipedia

Source: https://simple.wikipedia.org/wiki/

License: CC BY-SA 4.0, GFDL

Vikidia

Source: https://

License: CC BY-SA 3.0, GFDL

Klexikon

Source: https://klexikon.zum.de/wiki/

License: CC BY-SA 4.0

Txikipedia

Source: https://eu.wikipedia.org/wiki/Txikipedia:

License: CC BY-SA 4.0, GFDL

Wikikids

Source: https://wikikids.nl/

License: CC BY-SA 3.0

Related paper citation:

@inproceedings{trokhymovych-etal-2024-open, title = "An Open Multilingual System for Scoring Readability of {W}ikipedia", author = "Trokhymovych, Mykola and Sen, Indira and Gerlach, Martin", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.342/", doi = "10.18653/v1/2024.acl-long.342", pages = "6296--6311" }
Wikipedia Talk Corpus
figshare.com
application/x-gzip
Updated Jan 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.4264973.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4264973.v3
Dataset updated
Jan 23, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.
f
Linear regression for the logarithm of the number of edits.
figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Lerner; Alessandro Lomi (2023). Linear regression for the logarithm of the number of edits. [Dataset]. http://doi.org/10.1371/journal.pone.0190674.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0190674.t001
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Jürgen Lerner; Alessandro Lomi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Linear regression for the logarithm of the number of edits.
WikiRank 05.2019 - quality, popularity and AI for Wikipedia articles
figshare.com
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiki Rank (2023). WikiRank 05.2019 - quality, popularity and AI for Wikipedia articles [Dataset]. http://doi.org/10.6084/m9.figshare.8231273.v2
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8231273.v2
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Wiki Rank
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes a list of over 39 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Quality scores of articles are based on Wikipedia dumps from May, 2019. Popularity and Authors' Interest based on activity in April 2019.License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/Format• page_id -- The identifier of the Wikipedia article (int), e.g. 4519301• page_name -- The title of the Wikipedia article (utf-8), e.g. General relativity• wikirank_quality -- quality score for Wikipedia article in a scale 0-100 (as of May 1, 2019)• poularity -- miedian of daily number of page views of the Wikipedia article during April 2019• authors_interest -- number of authors of the Wikipedia article during April 2019
E
A meta analysis of Wikipedia's coronavirus sources during the COVID-19...
live.european-language-grid.eu
zenodo.org
txt
Updated Sep 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806
Explore at:
txtAvailable download formats
Dataset updated
Sep 8, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.
o
Wikipedia Articles Dataset
opendatabay.com
.undefined
Updated May 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2025). Wikipedia Articles Dataset [Dataset]. https://www.opendatabay.com/data/premium/b6292674-e94d-4a7e-93c0-00cf1474ffdd
Explore at:
.undefinedAvailable download formats
Dataset updated
May 25, 2025
Dataset authored and provided by
Bright Data
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
Access a wealth of information, including article titles, raw text, images, and structured references. Popular use cases include knowledge extraction, trend analysis, and content development.

Use our Wikipedia Articles dataset to access a vast collection of articles across a wide range of topics, from history and science to culture and current events. This dataset offers structured data on articles, categories, and revision histories, enabling deep analysis into trends, knowledge gaps, and content development.

Tailored for researchers, data scientists, and content strategists, this dataset allows for in-depth exploration of article evolution, topic popularity, and interlinking patterns. Whether you are studying public knowledge trends, performing sentiment analysis, or developing content strategies, the Wikipedia Articles dataset provides a rich resource to understand how information is shared and consumed globally.

Dataset Features - url: Direct URL to the original Wikipedia article.
- title: The title or name of the Wikipedia article.
- table_of_contents: A list or structure outlining the article's sections and hierarchy.
- raw_text: Unprocessed full text content of the article.
- cataloged_text: Cleaned and structured version of the article’s content, optimized for analysis.
- images: Links or data on images embedded in the article.
- see_also: Related articles linked under the “See Also” section.
- references: Sources cited in the article for credibility.
- external_links: Links to external websites or resources mentioned in the article.
- categories: Tags or groupings classifying the article by topic or domain.
- timestamp: Last edit date or revision time of the article snapshot.

Distribution - Data Volume: 11 Columns and 2.19 M Rows
- Format: CSV

Usage This dataset supports a wide range of applications: - Knowledge Extraction: Identify key entities, relationships, or events from Wikipedia content.
- Content Strategy & SEO: Discover trending topics and content gaps.
- Machine Learning: Train NLP models (e.g., summarisation, classification, QA systems).
- Historical Trend Analysis: Study how public interest in topics changes over time.
- Link Graph Modeling: Understand how information is interconnected.

Coverage - Geographic Coverage: Global (multi-language Wikipedia versions also available)
- Time Range: Continuous updates; snapshots available from early 2000s to present.

License

CUSTOM

Please review the respective licenses below:

Data Provider's License

Bright Data Master Service Agreement

Who Can Use It - Data Scientists: For training or testing NLP and information retrieval systems.
- Researchers: For computational linguistics, social science, or digital humanities.
- Businesses: To enhance AI-powered content tools or customer insight platforms.
- Educators/Students: For building projects, conducting research, or studying knowledge systems.

Suggested Dataset Names 1. Wikipedia Corpus+
2. Wikipedia Stream Dataset
3. Wikipedia Knowledge Bank
4. Open Wikipedia Dataset

Pricing

Based on Delivery frequency

~Up to $0.0025 per record. Min order $250

Approximately 283 new records are added each month. Approximately 1.12M records are updated each month. Get the complete dataset each delivery, including all records. Retrieve only the data you need with the flexibility to set Smart Updates.

Monthly

New snapshot each month, 12 snapshots/year Paid monthly

Quarterly

New snapshot each quarter, 4 snapshots/year Paid quarterly

Bi-annual

New snapshot every 6 months, 2 snapshots/year Paid twice-a-year

One-time purchase

New snapshot one-time delivery Paid once
f
Knowledge categorization affects popularity and quality of Wikipedia...
plos.figshare.com
ai
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Lerner; Alessandro Lomi (2023). Knowledge categorization affects popularity and quality of Wikipedia articles [Dataset]. http://doi.org/10.1371/journal.pone.0190674
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0190674
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jürgen Lerner; Alessandro Lomi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The existence of a shared classification system is essential to knowledge production, transfer, and sharing. Studies of knowledge classification, however, rarely consider the fact that knowledge categories exist within hierarchical information systems designed to facilitate knowledge search and discovery. This neglect is problematic whenever information about categorical membership is itself used to evaluate the quality of the items that the category contains. The main objective of this paper is to show that the effects of category membership depend on the position that a category occupies in the hierarchical knowledge classification system of Wikipedia—an open knowledge production and sharing platform taking the form of a freely accessible on-line encyclopedia. Using data on all English-language Wikipedia articles, we examine how the position that a category occupies in the classification hierarchy affects the attention that articles in that category attract from Wikipedia editors, and their evaluation of quality of the Wikipedia articles. Specifically, we show that Wikipedia articles assigned to coarse-grained categories (i. e., categories that occupy higher positions in the hierarchical knowledge classification system) garner more attention from Wikipedia editors (i. e., attract a higher volume of text editing activity), but receive lower evaluations (i. e., they are considered to be of lower quality). The negative relation between attention and quality implied by this result is consistent with current theories of social categorization, but it also goes beyond available results by showing that the effects of categorization on evaluation depend on the position that a category occupies in a hierarchical knowledge classification system.
Wikipedia time-series graph
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre; Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre (2025). Wikipedia time-series graph [Dataset]. http://doi.org/10.5281/zenodo.886484
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.886484
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre; Benzi Kirell; Miz Volodymyr; Ricaud Benjamin; Vandergheynst Pierre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wikipedia temporal graph.

The dataset is based on two Wikipedia SQL dumps: (1) English language articles and (2) user visit counts per page per hour (aka pagecounts). The original datasets are publicly available on the Wikimedia website.

Static graph structure is extracted from English language Wikipedia articles. Redirects are removed. Before building the Wikipedia graph we introduce thresholds on the minimum number of visits per hour and maximum in-degree. We remove the pages that have less than 500 visits per hour at least once during the specified period. Besides, we remove the nodes (pages) with in-degree higher than 8 000 to build a more meaningful initial graph. After cleaning, the graph contains 116 016 nodes (out of total 4 856 639 pages), 6 573 475 edges. The graph can be imported in two ways: (1) using edges.csv and vertices.csv or (2) using enwiki-20150403-graph.gt file that can be opened with open source Python library Graph-Tool.

Time-series data contains users' visit counts from 02:00, 23 September 2014 until 23:00, 30 April 2015. The total number of hours is 5278. The data is stored in two formats: CSV and H5. CSV file contains data in the following format [page_id :: count_views :: layer], where layer represents an hour. In H5 file, each layer corresponds to an hour as well.
Cross-language Wikipedia link graph
zenodo.org
bz2, txt
Updated May 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Thalhammer; Andreas Thalhammer (2023). Cross-language Wikipedia link graph [Dataset]. http://doi.org/10.5281/zenodo.7317296
Explore at:
bz2, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7317296
Dataset updated
May 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andreas Thalhammer; Andreas Thalhammer
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia articles use Wikidata to list the links to the same article in other language versions. Therefore, each Wikipedia language edition stores the Wikidata Q-id for each article.

This dataset constitutes a Wikipedia link graph where all the article identifiers are normalized to Wikidata Q-ids. It contains the normalized links from all Wikipedia language versions. Detailed link count statistics are attached. Note that articles that have no incoming nor outgoing links are not part of this graph.

The format is as follows:

Q-id of linking page (outgoing)

This dataset was used to compute Wikidata PageRank. More information can be found on the danker repository, where the source code of the link extraction as well as the PageRank computation is hosted.

Example entries:

bzcat 2022-11-10.allwiki.links.bz2 | head 1 1001051 zhwiki-20221101 1 1001 azbwiki-20221101 1 10022 nds_nlwiki-20221101 1 1005917 ptwiki-20221101 1 10090 guwiki-20221101 1 10090 tawiki-20221101 1 101038 glwiki-20221101 1 101072 idwiki-20221101 1 101072 lvwiki-20221101 1 101072 ndswiki-20221101
m
English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...
data.mendeley.com
Updated Feb 9, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
Explore at:
Unique identifier
https://doi.org/10.17632/cdcztymf4k.1
Dataset updated
Feb 9, 2017
Authors
H. Bahadir Sahin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2024). Wikipedia.org: number of articles 2024, by language [Dataset]. https://www.statista.com/statistics/1427961/wikipedia-org-articles-language/

Wikipedia.org: number of articles 2024, by language

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Dec 4, 2024

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2024

Area covered

Worldwide

Description

As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.

Clear search

Close search

Google apps

Main menu

Wikipedia.org: number of articles 2024, by language

Wikipedia Article Topics for All Languages (based on article outlinks)

Wikipedia.org: active registered users 2024, by language

Wikipedia: number of English-language articles 2002-2024

Wikipedia: most viewed articles in 2024

WikiRank quality scores and measures for Wikipedia articles (April 2022)

wikipedia

Quality scores for Wikipedia articles (July 2018)

Pairwise Multi-Class Document Classification for Semantic Relations between...

structured-wikipedia

WikiReaD (Wikipedia Readability Dataset)

Wikipedia Talk Corpus

Linear regression for the logarithm of the number of edits.

WikiRank 05.2019 - quality, popularity and AI for Wikipedia articles

A meta analysis of Wikipedia's coronavirus sources during the COVID-19...

Wikipedia Articles Dataset

Pricing

Based on Delivery frequency

Knowledge categorization affects popularity and quality of Wikipedia...

Wikipedia time-series graph

Cross-language Wikipedia link graph

English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

Wikipedia.org: number of articles 2024, by language