Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.
In March 2024, close to 4.4 billion unique global visitors had visited Wikipedia.org, slightly down from 4.4 billion visitors since August of the same year. Wikipedia is a free online encyclopedia with articles generated by volunteers worldwide. The platform is hosted by the Wikimedia Foundation.
This statistic shows the results of a survey on the usage of Wikipedia in Germany from 2007 to 2014. In 2013, 74 percent of German-speaking internet users reported to visit the online encyclopedia website at least occasionally.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
Dataset Description
This dataset contains the first paragraph of cleaned Wikipedia articles in English. It was obtained by transorming the Wikipedia "20220301.en" dataset as follows: from datasets import load_dataset
dataset = load_dataset("wikipedia", "20220301.en")["train"]
def get_first_paragraph(example): example["text"] = example['text'].split('
')[0] return example
dataset = dataset.map(get_first_paragraph)
Why use this dataset?
The size of the original… See the full description on the dataset page: https://huggingface.co/datasets/abokbot/wikipedia-first-paragraph.
As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.
Dataset Summary
This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Datasets with quality score for 47 million Wikipedia articles across 55 language versions by Wikirank, as of 1 August 2024.
More information about the quality score can be found in scientific papers:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
== wikiproject_to_template.halfak_20191202.yaml The mapping of the canonical names of WikiProjects to all the templates that might be used to tag an article with this WikiProject that was used for generating this dump. For instance, the line 'WikiProject Trade: ["WikiProject Trade", "WikiProject trade", "Wptrade"]' indicates that WikiProject Trade (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Trade) is associated with the following templates:* https://en.wikipedia.org/wiki/Template:WikiProject_Trade* https://en.wikipedia.org/wiki/Template:WikiProject_trade* https://en.wikipedia.org/wiki/Template:Wptrade wikiproject_taxonomy.halfak_20191202.yaml A proposed mapping of WikiProjects to higher-level categories. This mapping has not been applied to the JSON dump contained here. It is based on the WikiProjects' canonical names. gather_wikiprojects_per_article.py Old Python script that built the JSON dump described below for English Wikipedia based on wikitext/wikidata dumps (slow and more prone to errors). gather_wikiprojects_per_article_pageassessments.py New Python script to build the JSON dump described below that uses the PageAssessments Mediawiki table in MariaDB and so is much faster and can handle languages beyond Enlgihs much more easily. labeled_wiki_with_topics_metadata.json.bz2 ==Each line of this bzipped JSON file corresponds with a Wikipedia article in that language (currently Arabic, English, French, Hungarian, Turkish). The intended usage of this JSON file is to build topic classification models for Wikipedia articles.While the English file has good coverage because a more or less complete mapping exists between WikiProjects and topics, the other languages are much more sparse in their labels because they do not cover any WikiProjects in that language that don't have English equivalents (per Wikidata). The other languages are probably best used for supplementation of the English labels or a separate test set that might have a different topic distribution.The following properties are recorded:* title: Wikipedia article title in that language* article_revid: Most recent revision ID associated with the article for which a WikiProject asssessment was made (might not be current revision ID)* talk_pid: Page ID corresponding with the talk page for the Wikipedia article* talk_revid: Most recent revision ID associated with the talk page for which a WikiProject asssessment was made (might not be current revision ID)* wp_templates: List of WikiProject templates from the page_assessments table.* qid: Wikidata ID corresponding to the Wikipedia article* sitelinks: Based on Wikidata, the other languages in which this article exists and the corresponding page IDs.* topics: topic labels associated with the article based on its WikiProject templates and the WikiProjectLabel mapping (wikiproject_taxonomy)This version is based on the 24 May 2020 page_assessment tables and 4 May 2020 Wikidata item_page_link table. Articles with no associated WikiProject templates are not included. Of note in comparison to previous versions of this file, the revision IDs are now that revision IDs that were most recently assessed by a WikiProject, not the current versions of the page. The sitelinks are now as page IDs, which are more stable and less prone to encoding issues etc. The WikiProject templates are now pulled via the Mediawiki page_assessments table and so are in a different format than the templates that were extracted from the raw talk pages.For example, here is the line for Agatha Christie from the English JSON file:{'title': 'Agatha_Christie','article_revid': 958377791, 'talk_pid': 1001, 'talk_revid': 958103309, 'wp_templates': ["Women","Women's History","Women writers","Biography","Novels/Crime task force","Novels","Biography/science and academia work group","Biography/arts and entertainment work group","Devon","Archaeology/Women in archaeology task force","Archaeology"], 'qid': 'Q35064', 'sitelinks': { 'afwiki': 19274, 'amwiki': 47582, 'anwiki': 115127, 'arwiki': 12886, ...'enwiki': 984,... 'zhwiki': 10983, 'zh_min_nanwiki': 21828, 'zh_yuewiki': 131652}}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was analyzed and produced during the study described in the paper "Relating Wikipedia Article Quality to Edit Behavior and Link Structure" (under review, doi and link follows - see references). Its creation process and use cases are described in the dedicated paper.
For directions and code to process and evaluate this data, please see the corresponding GitHub repository: https://github.com/ruptho/editlinkquality-wikipedia.
We provide three files for 4941 Wikipedia articles (in .pkl format): The "article_revisions_labeled.pkl" file provides the final, semantically labeled revisions for each analyzed article per quality category. The "article_revision_features.zip" file contains processed per-article features, divided into folders for the specific quality categories they belong to. In "article_revision_features_raw.zip", we provide the raw features as retrieved via RevScoring API (https://pythonhosted.org/revscoring/).
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia articles use Wikidata to list the links to the same article in other language versions. Therefore, each Wikipedia language edition stores the Wikidata Q-id for each article.
This dataset constitutes a Wikipedia link graph where all the article identifiers are normalized to Wikidata Q-ids. It contains the normalized links from all Wikipedia language versions. Detailed link count statistics are attached. Note that articles that have no incoming nor outgoing links are not part of this graph.
The format is as follows:
Q-id of linking page (outgoing) Q-id of linked page (incoming) language version - dump date (20241101)
This dataset was used to compute Wikidata PageRank. More information can be found on the danker repository, where the source code of the link extraction as well as the PageRank computation is hosted.
Example entries:$ bzcat 2024-11-06.allwiki.links.bz2 | head
1 107 ckbwiki-202411011 107 lawiki-202411011 107 ltwiki-202411011 107 tewiki-202411011 107 wuuwiki-202411011 111 hywwiki-202411011 11379 bat_smgwiki-202411011 11471 cdowiki-202411011 150 ckbwiki-202411011 150 lowiki-20241101
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
· This work aims to determine whether there is evidence of political bias in English Wikipedia articles.
· Wikipedia is one of the most visited domains on the Web, attracting hundreds of millions of unique users per month. Wikipedia content is also routinely used for training Large Language Models (LLMs), which are the core engines driving cutting edge AI systems.
· To study political bias in Wikipedia content, we analyze the sentiment (positive, neutral or negative) with which a set of target terms (N=1,628) with political connotations (i.e. names of recent U.S. presidents, U.S. congressmembers, U.S. Supreme Court Justices, or Prime Ministers of Western countries) are used in Wikipedia articles.
· We do not cherry pick the set of terms to be included in the analysis but instead use publicly available pre-existing lists of terms from Wikipedia and other sources.
· We find a mild to moderate tendency in Wikipedia articles to associate public figures politically aligned right-of-center with more negative sentiment than left-of-center public figures.
· These favorable associations for left-leaning public figures are apparent for names of recent U.S. Presidents, U.S. Supreme Court Justices, U.S. Senators, U.S. House of Representatives Congressmembers, U.S. State Governors, Western countries’ Prime Ministers, and prominent U.S. based journalists and media organizations.
· Despite being common, these politically asymmetrical sentiment associations are not ubiquitous. We find no evidence of them in the sentiment with which names of U.K. MPs and U.S. based think tanks are used in Wikipedia articles.
· We also find larger associations of negative emotions (i.e. anger and disgust) with right-leaning public figures and positive emotion (i.e. joy) with left-leaning public figures.
· The trends just described constitute suggestive evidence of political bias embedded in Wikipedia articles.
· We also find some of the aforementioned sentiment political associations embedded in Wikipedia articles popping up in OpenAI’s language models. This is suggestive of the potential for biases in Wikipedia content percolating into widely used AI systems.
· Wikipedia’s neutral point of view policy (NPOV) aims for articles in Wikipedia to be written in an impartial and unbiased tone. Our results suggest that Wikipedia’s neutral point of view policy is not achieving its stated goal of political viewpoint neutrality.
· This report highlights areas where Wikipedia can improve in how it presents political information. Nonetheless, we want to acknowledge Wikipedia’s significant and valuable role as a public resource. We hope this work inspires efforts to uphold and strengthen Wikipedia’s principles of neutrality and impartiality.
The set of 1,653 target terms used in our analysis, the sample of Wikipedia paragraphs where they occur (as of 2022) and their sentiment and emotion annotations are provided in the files:
- WikipediaParagraphsWithTargetNGramsAndSentiment.csv
- WikipediaParagraphsWithTargetNGramsAndEmotion.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.
There are two files:
sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only
table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid
The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.
For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT
Below is a sample code snippet to load the data
import webdataset as wds
url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )
Below we show how the data is organized with two examples.
Text-only
{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }
Hybrid
{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Description:
The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).
Dataset Details:
Attribution:
The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids.
Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page (
https://
https://simple.wikipedia.org/wiki/
https://
https://klexikon.zum.de/wiki/
https://eu.wikipedia.org/wiki/Txikipedia:
https://wikikids.nl/
Related paper citation:
@inproceedings{trokhymovych-etal-2024-open, title = "An Open Multilingual System for Scoring Readability of {W}ikipedia", author = "Trokhymovych, Mykola and Sen, Indira and Gerlach, Martin", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.342/", doi = "10.18653/v1/2024.acl-long.342", pages = "6296--6311"
}
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
http://opensource.org/licenses/AFL-3.0http://opensource.org/licenses/AFL-3.0
Wikipedia is a free, collaborative, and multilingual encyclopedia and represents the largest and most popular reference work on internet. Furthermore, it enables engagement with higher education, allowing students to be integrated as a teaching and learning tool, as well as facilitating the dissemination of scientific content to the public. Therefore, we aim is to analyze scientific production from 2005 to 2023 regarding Wikipedia as an active teaching methodology in the field of health and scientific dissemination. To investigate the PubMed, LILACS, SciELO, and CAPES Periodicals Portal databases regarding the use of Wikipedia as an active methodology in health and a tool for scientific dissemination, through searches using the terms 'Wikipedia,' 'university,' and 'health.'
Wiki. Health. Teaching Methodology. University.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Phenology plays an important role in many human–nature interactions, but these seasonal patterns are often overlooked in conservation. Here, we provide the first broad exploration of seasonal patterns of interest in nature across many species and cultures. Using data from Wikipedia, a large online encyclopedia, we analyzed 2.33 billion pageviews to articles for 31,751 species across 245 languages. We show that seasonality plays an important role in how and when people interact with plants and animals online. In total, over 25% of species in our data set exhibited a seasonal pattern in at least one of their language-edition pages, and seasonality is significantly more prevalent in pages for plants and animals than it is in a random selection of Wikipedia articles. Pageview seasonality varies across taxonomic clades in ways that reflect observable patterns in phenology, with groups such as insects and flowering plants having higher seasonality than mammals. Differences between Wikipedia language editions are significant; pages in languages spoken at higher latitudes exhibit greater seasonality overall, and species seldom show the same pattern across multiple language editions. These results have relevance to conservation policy formulation and to improving our understanding of what drives human interest in biodiversity.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data set consists of 17k Wikipedia Articles which have been cleaned.
It has a Train set of 12.4k articles and Validation set of 5.3k articles, which were used to train and benchmark Language Models for Odia in the repository NLP for Odia
The scripts which were used to fetch and clean articles can be found here
Feel free to use this data set creatively and for building better Language Models
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.