Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people
Dataset Summary
Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a
Supported Tasks and Leaderboards
The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is
Languages
english only… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.
See https://github.com/mdoering/wikipedia-dwca for details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
The data in this dataset is extracted from the BigQuery Wikipedia dataset. It includes:
P31
(instance of) = Q5
(human)) for the years 2015-2023.Date accessed: March 2024.
Below are the queries used to get the dataset:
``sql
-- 1. Get List of people on Wikipedia
SELECT DISTINCT en_wiki -- page title name in english wikipedia
FROM
project.wikipedia_pageviews.wikidata`,
UNNEST(instance_of) AS instance_of_struct
WHERE instance_of_struct.numeric_id = 5 -- instance_of = 5 => person
-- 2. Get pageview data for those people SELECT title, DATETIME_TRUNC(datehour, MONTH) AS month, SUM(views) AS monthly_views
FROM project.wikipedia_pageviews.pageviews_20xx
a -- replace xx with desired year
JOIN project.data_for_project.distinct_people
b
ON a.title = b.en_wiki
WHERE datehour IS NOT NULL AND wiki = "en"
GROUP BY title, DATETIME_TRUNC(datehour, MONTH)
-- 3. Get wikidata for those people
SELECT *
FROM project.wikipedia_pageviews.wikidata
,
UNNEST(instance_of) AS instance_of_struct
WHERE instance_of_struct.numeric_id = 5 ```
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Wizard-of-Wikipedia data for the Findings of EMNLP 2020 paper "Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation" GitHub repo. Original paper. @inproceedings{zheng-etal-2020-diffks, title="{D}ifference-aware Knowledge Selection for Knowledge-grounded Conversation Generation", author="Zheng, Chujie and Cao, Yunbo and Jiang, Daxin and Huang, Minlie", booktitle="Findings of EMNLP", year="2020" }
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
User interaction networks of Wikipedia of 28 different languages. Nodes (orininal wikipedia user IDs) represent users of the Wikipedia, and an edge from user A to user B denotes that user A wrote a message on the talk page of user B at a certain timestamp.
More info: http://yfiua.github.io/academic/2016/02/14/wiki-talk-datasets.html
Wikipedia Citations is a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further equip an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science.
TopicNavi/Wikipedia-example-data dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by Threatthriver
The Wikipedia Change Metadata is a curation of article changes, updates, and edits over time.
**Source for details below: **https://zenodo.org/record/3605388#.YWitsdnML0o
Dataset details
Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and *p$2p$3 *indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):
%3C!-- --%3E
Part 2: Page creation times (page_creation_times.json.gz)
This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:
%3C!-- --%3E
Part 3: Redirect history (redirect_history.json.gz)
This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:
%3C!-- --%3E
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Article-level image suggestions (ALIS, reads alice) is a distributed computing system that recommends images for Wikipedia articles that don't have one [1]. This publication contains roughly 3,800 human ratings made against ALIS output in multiple Wikipedia language editions. Evaluation task Data was collected through an evaluation tool [2], with code available at [3]. Given a language, the user is shown a random Wikipedia article and an image suggested by the system; they are then asked to rate the relevance of the image by clicking on either the Good, Okay, Bad, or Unsure button. The user is also brought to judge whether the image is not suitable for any reason via the It's ok, It's unsuitable, or Unsure button. Content The archive holds 2 tab-separated-values (TSV) text files:
evaluation_dataset.tsv contains the evaluation data; unillustrated_articles.tsv keeps track of unillustrated Wikipedia articles.
Evaluation dataset headers
id (integer) - identifier used for internal storage; unillustratedArticleId (integer) - identifier of the unillustrated Wikipedia article; resultFilePage (string) - Wikimedia Commons image file name. Prepend https://commons.wikimedia.org/wiki/ to form a valid Commons URL; resultImageUrl (string) - Wikimedia Commons thumbnail URL; source (string) - suggestion source. ms = MediaSearch; ima = ALIS prototype algorithm. See [4] and [5] respectively for more details; confidence_class (string) - shallow degree of suggestion confidence. Either low, medium, or high; rating (integer) - human image relevance rating. 1 = good; 0 = okay; -1 = bad; sensitive (integer) - human image suitability rating. 0 = it's okay; 1 = it's unsuitable; -1 = unsure; viewCount (integer) - number of times the suggestion was seen by evaluators.
Example 7357 1827 File:Cuphea_cyanea_strybing.jpg https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Cuphea_cyanea_strybing.jpg/800px-Cuphea_cyanea_strybing.jpg ima high 1 0 1 Unillustrated articles headers
id (integer) - identifier used for internal storage. Maps to unillustratedArticleId in the evaluation data; langCode (string) - Wikipedia language code; pageTitle (string) - Wikipedia article title; unsuitableArticleType (integer) - whether the Wikipedia article is suitable for receiving image suggestions. 0 = suitable; 1 = not suitable;
Example 1827 vi Cuphea_cyanea 0 References [1] https://www.mediawiki.org/wiki/Structured_Data_Across_Wikimedia/Image_Suggestions/Data_Pipeline [2] https://image-recommendation-test.toolforge.org/ [3] https://github.com/cormacparle/media-search-signal-test/tree/master/public_html [4] https://www.mediawiki.org/wiki/Help:MediaSearch [5] https://www.mediawiki.org/wiki/Structured_Data_Across_Wikimedia/Image_Suggestions/Data_Pipeline#How_it_works
Wikipedia
Source: https://huggingface.co/datasets/wikipedia Num examples: 1,281,412 Language: Vietnamese
from datasets import load_dataset
load_dataset("tdtunlp/wikipedia_vi")
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using Wikipedia data to study AI ethics.
This dataset was created by Oleh Onyshchak
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a row for every (wiki, user, month) that contains a count of all 'revisions' saved and a count of those revisions that were 'archived' when the page was deleted. For more information, see https://meta.wikimedia.org/wiki/Research:Monthly_wikimedia_editor_activity_dataset Fields: · wiki -- The dbname of the wiki in question ("enwiki" == English Wikipedia, "commonswiki" == Commons) · month -- YYYYMM · user_id -- The user's identifier in the local wiki · user_name -- The user name in the local wiki (from the 'user' table) · user_registration -- The recorded registration date for the user in the 'user' table · archived -- The count of deleted revisions saved in this month by this user · revisions -- The count of all revisions saved in this month by this user (archived or not) · attached_method -- The method by which this user attached this account to their global account
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.
The original dataset contains missing values. They have been simply replaced by zeros.
As of December 2024, global Google searches using the query "Wiki" stood at 87 percentage points, so far the lowest interest rate despite a relatively stable margin throughout the analyzed period. Meanwhile, searches for the full Wikipedia query were slightly more popular during the analyzed period.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people
Dataset Summary
Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a
Supported Tasks and Leaderboards
The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is
Languages
english only… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.