100+ datasets found

English Wikipedia Quality Asssessment Dataset
figshare.com
application/bzip2
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morten Warncke-Wang (2023). English Wikipedia Quality Asssessment Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1375406.v2
Explore at:
application/bzip2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1375406.v2
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Morten Warncke-Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.
Total global visitor traffic to Wikipedia.org 2024
statista.com
ai-chatbox.pro
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Total global visitor traffic to Wikipedia.org 2024 [Dataset]. https://www.statista.com/statistics/1259907/wikipedia-website-traffic/
Explore at:
Dataset updated
Nov 11, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Oct 2023 - Mar 2024
Area covered
Worldwide
Description
In March 2024, close to 4.4 billion unique global visitors had visited Wikipedia.org, slightly down from 4.4 billion visitors since August of the same year. Wikipedia is a free online encyclopedia with articles generated by volunteers worldwide. The platform is hosted by the Wikimedia Foundation.
Usage of Wikipedia in Germany 2007-2014
statista.com
Updated Sep 5, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2014). Usage of Wikipedia in Germany 2007-2014 [Dataset]. https://www.statista.com/statistics/432200/wikipedia-usage-germany/
Explore at:
Dataset updated
Sep 5, 2014
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2007 - 2014
Area covered
Germany
Description
This statistic shows the results of a survey on the usage of Wikipedia in Germany from 2007 to 2014. In 2013, 74 percent of German-speaking internet users reported to visit the online encyclopedia website at least occasionally.
f
Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...
figshare.com
txt
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14113799.v4
Dataset updated
Mar 14, 2021
Dataset provided by
figshare
Authors
KayYen Wong; Diego Saez-Trumper; Miriam Redi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
h
wikipedia-first-paragraph
huggingface.co
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aicha Bokbot (2023). wikipedia-first-paragraph [Dataset]. https://huggingface.co/datasets/abokbot/wikipedia-first-paragraph
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2023
Authors
Aicha Bokbot
Description
Dataset Description

This dataset contains the first paragraph of cleaned Wikipedia articles in English. It was obtained by transorming the Wikipedia "20220301.en" dataset as follows: from datasets import load_dataset

dataset = load_dataset("wikipedia", "20220301.en")["train"]

def get_first_paragraph(example): example["text"] = example['text'].split('

')[0] return example

dataset = dataset.map(get_first_paragraph)

Why use this dataset?

The size of the original… See the full description on the dataset page: https://huggingface.co/datasets/abokbot/wikipedia-first-paragraph.
Wikipedia.org: number of articles 2024, by language
statista.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Wikipedia.org: number of articles 2024, by language [Dataset]. https://www.statista.com/statistics/1427961/wikipedia-org-articles-language/
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
Worldwide
Description
As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.
h
wikipedia-summary-dataset
huggingface.co
Updated Sep 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Clive (2017). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2017
Authors
Jordan Clive
Description
Dataset Summary

This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.
Quality of Wikipedia articles by WikiRank
kaggle.com
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Włodzimierz Lewoniewski (2025). Quality of Wikipedia articles by WikiRank [Dataset]. http://doi.org/10.34740/kaggle/dsv/11073096
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/11073096
Dataset updated
Mar 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Włodzimierz Lewoniewski
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Datasets with quality score for 47 million Wikipedia articles across 55 language versions by Wikirank, as of 1 August 2024.

Potential Applications:

Academic research: scholars can incorporate WikiRank scores into studies on information accuracy, digital literacy, collective intelligence, and crowd dynamics. This data can also inform sociological research into biases, representation, and content disparities across different languages and cultures.

Educational tools and platforms: educational institutions and learning platforms can integrate WikiRank scores to recommend reliable and high-quality articles, significantly aiding learners in sourcing accurate information.

AI and machine learning development: developers and data scientists can use WikiRank scores to train sophisticated NLP and content-generation models to recognize and produce high-quality, structured, and well-referenced content.

Content moderation and policy development: Wikipedia community can use these metrics to enforce content quality policies more effectively.

Content strategy and editorial planning: media companies, publishers, and content strategists can employ these scores to identify high-performing content and detect topics needing deeper coverage or improvement.

More information about the quality score can be found in scientific papers:

Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics

Quality and Importance of Wikipedia Articles in Different Languages

Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles

Modelling the quality of attributes in Wikipedia infoboxes
f
Wikipedia Articles and Associated WikiProject Templates
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson; Aaron Halfaker (2023). Wikipedia Articles and Associated WikiProject Templates [Dataset]. http://doi.org/10.6084/m9.figshare.10248344.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10248344.v4
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Isaac Johnson; Aaron Halfaker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
== wikiproject_to_template.halfak_20191202.yaml The mapping of the canonical names of WikiProjects to all the templates that might be used to tag an article with this WikiProject that was used for generating this dump. For instance, the line 'WikiProject Trade: ["WikiProject Trade", "WikiProject trade", "Wptrade"]' indicates that WikiProject Trade (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Trade) is associated with the following templates:* https://en.wikipedia.org/wiki/Template:WikiProject_Trade* https://en.wikipedia.org/wiki/Template:WikiProject_trade* https://en.wikipedia.org/wiki/Template:Wptrade wikiproject_taxonomy.halfak_20191202.yaml A proposed mapping of WikiProjects to higher-level categories. This mapping has not been applied to the JSON dump contained here. It is based on the WikiProjects' canonical names. gather_wikiprojects_per_article.py Old Python script that built the JSON dump described below for English Wikipedia based on wikitext/wikidata dumps (slow and more prone to errors). gather_wikiprojects_per_article_pageassessments.py New Python script to build the JSON dump described below that uses the PageAssessments Mediawiki table in MariaDB and so is much faster and can handle languages beyond Enlgihs much more easily. labeled_wiki_with_topics_metadata.json.bz2 ==Each line of this bzipped JSON file corresponds with a Wikipedia article in that language (currently Arabic, English, French, Hungarian, Turkish). The intended usage of this JSON file is to build topic classification models for Wikipedia articles.While the English file has good coverage because a more or less complete mapping exists between WikiProjects and topics, the other languages are much more sparse in their labels because they do not cover any WikiProjects in that language that don't have English equivalents (per Wikidata). The other languages are probably best used for supplementation of the English labels or a separate test set that might have a different topic distribution.The following properties are recorded:* title: Wikipedia article title in that language* article_revid: Most recent revision ID associated with the article for which a WikiProject asssessment was made (might not be current revision ID)* talk_pid: Page ID corresponding with the talk page for the Wikipedia article* talk_revid: Most recent revision ID associated with the talk page for which a WikiProject asssessment was made (might not be current revision ID)* wp_templates: List of WikiProject templates from the page_assessments table.* qid: Wikidata ID corresponding to the Wikipedia article* sitelinks: Based on Wikidata, the other languages in which this article exists and the corresponding page IDs.* topics: topic labels associated with the article based on its WikiProject templates and the WikiProjectLabel mapping (wikiproject_taxonomy)This version is based on the 24 May 2020 page_assessment tables and 4 May 2020 Wikidata item_page_link table. Articles with no associated WikiProject templates are not included. Of note in comparison to previous versions of this file, the revision IDs are now that revision IDs that were most recently assessed by a WikiProject, not the current versions of the page. The sitelinks are now as page IDs, which are more stable and less prone to encoding issues etc. The WikiProject templates are now pulled via the Mediawiki page_assessments table and so are in a different format than the templates that were extracted from the raw talk pages.For example, here is the line for Agatha Christie from the English JSON file:{'title': 'Agatha_Christie','article_revid': 958377791, 'talk_pid': 1001, 'talk_revid': 958103309, 'wp_templates': ["Women","Women's History","Women writers","Biography","Novels/Crime task force","Novels","Biography/science and academia work group","Biography/arts and entertainment work group","Devon","Archaeology/Women in archaeology task force","Archaeology"], 'qid': 'Q35064', 'sitelinks': { 'afwiki': 19274, 'amwiki': 47582, 'anwiki': 115127, 'arwiki': 12886, ...'enwiki': 984,... 'zhwiki': 10983, 'zh_min_nanwiki': 21828, 'zh_yuewiki': 131652}}
Z
Data from: Relating Wikipedia Article Quality to Edit Behavior and Link...
data.niaid.nih.gov
Updated Jun 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thorsten Ruprechter (2020). Relating Wikipedia Article Quality to Edit Behavior and Link Structure [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3716097
Explore at:
Dataset updated
Jun 30, 2020
Dataset authored and provided by
Thorsten Ruprechter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was analyzed and produced during the study described in the paper "Relating Wikipedia Article Quality to Edit Behavior and Link Structure" (under review, doi and link follows - see references). Its creation process and use cases are described in the dedicated paper.

For directions and code to process and evaluate this data, please see the corresponding GitHub repository: https://github.com/ruptho/editlinkquality-wikipedia.

We provide three files for 4941 Wikipedia articles (in .pkl format): The "article_revisions_labeled.pkl" file provides the final, semantically labeled revisions for each analyzed article per quality category. The "article_revision_features.zip" file contains processed per-article features, divided into folders for the specific quality categories they belong to. In "article_revision_features_raw.zip", we provide the raw features as retrieved via RevScoring API (https://pythonhosted.org/revscoring/).
Z
Cross-language Wikipedia link graph
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thalhammer, Andreas (2024). Cross-language Wikipedia link graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7163079
Explore at:
Dataset updated
Nov 13, 2024
Dataset authored and provided by
Thalhammer, Andreas
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia articles use Wikidata to list the links to the same article in other language versions. Therefore, each Wikipedia language edition stores the Wikidata Q-id for each article.

This dataset constitutes a Wikipedia link graph where all the article identifiers are normalized to Wikidata Q-ids. It contains the normalized links from all Wikipedia language versions. Detailed link count statistics are attached. Note that articles that have no incoming nor outgoing links are not part of this graph.

The format is as follows:

Q-id of linking page (outgoing) Q-id of linked page (incoming) language version - dump date (20241101)

This dataset was used to compute Wikidata PageRank. More information can be found on the danker repository, where the source code of the link extraction as well as the PageRank computation is hosted.

Example entries:$ bzcat 2024-11-06.allwiki.links.bz2 | head

1 107 ckbwiki-202411011 107 lawiki-202411011 107 ltwiki-202411011 107 tewiki-202411011 107 wuuwiki-202411011 111 hywwiki-202411011 11379 bat_smgwiki-202411011 11471 cdowiki-202411011 150 ckbwiki-202411011 150 lowiki-20241101
Dataset for "Is Wikipedia Politically Biased?"
zenodo.org
bin, csv
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rozado; David Rozado (2024). Dataset for "Is Wikipedia Politically Biased?" [Dataset]. http://doi.org/10.5281/zenodo.10775984
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10775984
Dataset updated
Jun 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Rozado; David Rozado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
· This work aims to determine whether there is evidence of political bias in English Wikipedia articles.

· Wikipedia is one of the most visited domains on the Web, attracting hundreds of millions of unique users per month. Wikipedia content is also routinely used for training Large Language Models (LLMs), which are the core engines driving cutting edge AI systems.

· To study political bias in Wikipedia content, we analyze the sentiment (positive, neutral or negative) with which a set of target terms (N=1,628) with political connotations (i.e. names of recent U.S. presidents, U.S. congressmembers, U.S. Supreme Court Justices, or Prime Ministers of Western countries) are used in Wikipedia articles.

· We do not cherry pick the set of terms to be included in the analysis but instead use publicly available pre-existing lists of terms from Wikipedia and other sources.

· We find a mild to moderate tendency in Wikipedia articles to associate public figures politically aligned right-of-center with more negative sentiment than left-of-center public figures.

· These favorable associations for left-leaning public figures are apparent for names of recent U.S. Presidents, U.S. Supreme Court Justices, U.S. Senators, U.S. House of Representatives Congressmembers, U.S. State Governors, Western countries’ Prime Ministers, and prominent U.S. based journalists and media organizations.

· Despite being common, these politically asymmetrical sentiment associations are not ubiquitous. We find no evidence of them in the sentiment with which names of U.K. MPs and U.S. based think tanks are used in Wikipedia articles.

· We also find larger associations of negative emotions (i.e. anger and disgust) with right-leaning public figures and positive emotion (i.e. joy) with left-leaning public figures.

· The trends just described constitute suggestive evidence of political bias embedded in Wikipedia articles.

· We also find some of the aforementioned sentiment political associations embedded in Wikipedia articles popping up in OpenAI’s language models. This is suggestive of the potential for biases in Wikipedia content percolating into widely used AI systems.

· Wikipedia’s neutral point of view policy (NPOV) aims for articles in Wikipedia to be written in an impartial and unbiased tone. Our results suggest that Wikipedia’s neutral point of view policy is not achieving its stated goal of political viewpoint neutrality.

· This report highlights areas where Wikipedia can improve in how it presents political information. Nonetheless, we want to acknowledge Wikipedia’s significant and valuable role as a public resource. We hope this work inspires efforts to uphold and strengthen Wikipedia’s principles of neutrality and impartiality.

The set of 1,653 target terms used in our analysis, the sample of Wikipedia paragraphs where they occur (as of 2022) and their sentiment and emotion annotations are provided in the files:

- WikipediaParagraphsWithTargetNGramsAndSentiment.csv

- WikipediaParagraphsWithTargetNGramsAndEmotion.csv
Z
Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cong Yu (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Yu Su
Alyssa Lees
Cong Yu
Huan Sun
Xiang Deng
You Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
WikiReaD (Wikipedia Readability Dataset)
zenodo.org
bz2
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach (2025). WikiReaD (Wikipedia Readability Dataset) [Dataset]. http://doi.org/10.5281/zenodo.11371932
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.11371932
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Description:

The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).

Dataset Details:

Number of Languages: 14

Number of files: 19

Use Case: Training and evaluating readability scoring models for articles within and outside Wikipedia.

Processing details: Text pairs are created by matching articles from Wikipedia with the corresponding article in the simplified/children encyclopedia either via the Wikidata item ID or their page titles. The text of each article is extracted directly from their parsed HTML version.

Files: The dataset consists of independent files for each type of children/simplified encyclopedia and each language (e.g., `

Attribution:

The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids.

Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page (

Wikipedia

Source: https://

License: CC BY-SA 4.0, GFDL

Simple English Wikipedia

Source: https://simple.wikipedia.org/wiki/

License: CC BY-SA 4.0, GFDL

Vikidia

Source: https://

License: CC BY-SA 3.0, GFDL

Klexikon

Source: https://klexikon.zum.de/wiki/

License: CC BY-SA 4.0

Txikipedia

Source: https://eu.wikipedia.org/wiki/Txikipedia:

License: CC BY-SA 4.0, GFDL

Wikikids

Source: https://wikikids.nl/

License: CC BY-SA 3.0

Related paper citation:

@inproceedings{trokhymovych-etal-2024-open, title = "An Open Multilingual System for Scoring Readability of {W}ikipedia", author = "Trokhymovych, Mykola and Sen, Indira and Gerlach, Martin", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.342/", doi = "10.18653/v1/2024.acl-long.342", pages = "6296--6311" }
Wikipedia English Official Offline Edition 2014-07-07
academictorrents.com
bittorrent
Updated Aug 5, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikipedia (2014). Wikipedia English Official Offline Edition 2014-07-07 [Dataset]. https://academictorrents.com/details/e18b8cce7d9cb2726f5f40dcb857111ec573cad4
Explore at:
bittorrent(11031162019)Available download formats
Dataset updated
Aug 5, 2014
Dataset authored and provided by
Wikipedia//www.wikipedia.org/
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.
h
wikipedia
huggingface.co
tensorflow.org
Updated Feb 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
Explore at:
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
Data from: Use of Wikipedia at university as a resource of active health...
osf.io
Updated Jan 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Cardoso; Hector Matos (2024). Use of Wikipedia at university as a resource of active health teaching methodology and scientific dissemination: scoping review [Dataset]. http://doi.org/10.17605/OSF.IO/DWRSJ
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/DWRSJ
Dataset updated
Jan 27, 2024
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Maria Cardoso; Hector Matos
License
http://opensource.org/licenses/AFL-3.0http://opensource.org/licenses/AFL-3.0
Description
Wikipedia is a free, collaborative, and multilingual encyclopedia and represents the largest and most popular reference work on internet. Furthermore, it enables engagement with higher education, allowing students to be integrated as a teaching and learning tool, as well as facilitating the dissemination of scientific content to the public. Therefore, we aim is to analyze scientific production from 2005 to 2023 regarding Wikipedia as an active teaching methodology in the field of health and scientific dissemination. To investigate the PubMed, LILACS, SciELO, and CAPES Periodicals Portal databases regarding the use of Wikipedia as an active methodology in health and a tool for scientific dissemination, through searches using the terms 'Wikipedia,' 'university,' and 'health.'

Wiki. Health. Teaching Methodology. University.
f
A season for all things: Phenological imprints in Wikipedia usage and their...
figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John C. Mittermeier; Uri Roll; Thomas J. Matthews; Richard Grenyer (2023). A season for all things: Phenological imprints in Wikipedia usage and their relevance to conservation [Dataset]. http://doi.org/10.1371/journal.pbio.3000146
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pbio.3000146
Dataset updated
May 31, 2023
Dataset provided by
PLOS Biology
Authors
John C. Mittermeier; Uri Roll; Thomas J. Matthews; Richard Grenyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Phenology plays an important role in many human–nature interactions, but these seasonal patterns are often overlooked in conservation. Here, we provide the first broad exploration of seasonal patterns of interest in nature across many species and cultures. Using data from Wikipedia, a large online encyclopedia, we analyzed 2.33 billion pageviews to articles for 31,751 species across 245 languages. We show that seasonality plays an important role in how and when people interact with plants and animals online. In total, over 25% of species in our data set exhibited a seasonal pattern in at least one of their language-edition pages, and seasonality is significantly more prevalent in pages for plants and animals than it is in a random selection of Wikipedia articles. Pageview seasonality varies across taxonomic clades in ways that reflect observable patterns in phenology, with groups such as insects and flowering plants having higher seasonality than mammals. Differences between Wikipedia language editions are significant; pages in languages spoken at higher latitudes exhibit greater seasonality overall, and species seldom show the same pattern across multiple language editions. These results have relevance to conservation policy formulation and to improving our understanding of what drives human interest in biodiversity.
Wikipedia Talk Labels: Personal Attacks
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Personal Attacks [Dataset]. http://doi.org/10.6084/m9.figshare.4054689.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4054689.v6
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Odia Wikipedia Articles
kaggle.com
Updated Dec 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav (2019). Odia Wikipedia Articles [Dataset]. https://www.kaggle.com/disisbig/odia-wikipedia-articles/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 25, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This data set consists of 17k Wikipedia Articles which have been cleaned.

It has a Train set of 12.4k articles and Validation set of 5.3k articles, which were used to train and benchmark Language Models for Odia in the repository NLP for Odia

The scripts which were used to fetch and clean articles can be found here

Feel free to use this data set creatively and for building better Language Models

Facebook

Twitter

Click to copy link

Link copied

Cite

Morten Warncke-Wang (2023). English Wikipedia Quality Asssessment Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1375406.v2

English Wikipedia Quality Asssessment Dataset

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

application/bzip2Available download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.1375406.v2

Dataset updated

May 31, 2023

Dataset provided by

figshare

Authors

Morten Warncke-Wang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.

Clear search

Close search

Google apps

Main menu

English Wikipedia Quality Asssessment Dataset

Total global visitor traffic to Wikipedia.org 2024

Usage of Wikipedia in Germany 2007-2014

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

wikipedia-first-paragraph

Wikipedia.org: number of articles 2024, by language

wikipedia-summary-dataset

Quality of Wikipedia articles by WikiRank

Potential Applications:

Wikipedia Articles and Associated WikiProject Templates

Data from: Relating Wikipedia Article Quality to Edit Behavior and Link...

Cross-language Wikipedia link graph

Dataset for "Is Wikipedia Politically Biased?"

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

WikiReaD (Wikipedia Readability Dataset)

Wikipedia English Official Offline Edition 2014-07-07

wikipedia

Data from: Use of Wikipedia at university as a resource of active health...

A season for all things: Phenological imprints in Wikipedia usage and their...

Wikipedia Talk Labels: Personal Attacks

Odia Wikipedia Articles

English Wikipedia Quality Asssessment Dataset