100+ datasets found

h
WikipediaUpdated
huggingface.co
Updated May 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jojo jenkins (2023). WikipediaUpdated [Dataset]. https://huggingface.co/datasets/luciferxf/WikipediaUpdated
Explore at:
Dataset updated
May 4, 2023
Authors
jojo jenkins
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
T
wikipedia
tensorflow.org
huggingface.co
Updated Aug 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
Explore at:
Dataset updated
Aug 9, 2019
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
wikipedia-summary-dataset
huggingface.co
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Clive (2023). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
Explore at:
Dataset updated
Feb 15, 2023
Authors
Jordan Clive
Description
Dataset Summary

This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.
P
French Wikipedia Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot, French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia
Explore at:
Authors
Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot
Area covered
French
Description
French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps

Extended Wikipedia Multimodal Dataset

kaggle.com

zip

Updated Apr 4, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. https://www.kaggle.com/datasets/jacksoncrow/extended-wikipedia-multimodal-dataset

Explore at:

zip(977856346 bytes)Available download formats

Dataset updated

Apr 4, 2020

Authors

Oleh Onyshchak

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Wikipedia Featured Articles multimodal dataset

Overview

This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

Dataset structure

The high-level structure of the dataset is as follows:

.
+-- page1 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
+-- page2 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
: 
+-- pageN 
|  +-- text.json 
|  +-- img 
|    +-- meta.json

label	description
pageN	is the title of N-th Wikipedia page and contains all information about the page
text.json	text of the page saved as JSON. Please refer to the details of JSON schema below.
meta.json	a collection of all images of the page. Please refer to the details of JSON schema below.
imageN	is the N-th image of an article, saved in `jpg` format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

text.JSON Schema

Below you see an example of how data is stored:

{
 "title": "Naval Battle of Guadalcanal",
 "id": 405411,
 "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
 "html": "...

...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

key	description
title	page title
id	unique page id
url	url of a page on Wikipedia
html	HTML content of the article
wikitext	wikitext content of the article

Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

meta.JSON Schema

{
 "img_meta": [
  {
   "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
   "title": "IronbottomSound.jpg",
   "parsed_title": "ironbottom sound",
   "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
   "is_icon": False,
   "on_commons": True,
   "description": "A U.S. destroyer steams up what later became known as ...",
   "caption": "Ironbottom Sound. The majority of the warship surface ...",
   "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
   "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
   },
   ...
  ]
}

key	description
filename	unique image id, md5 hashcode of original image title
title	image title retrieved from Commons, if applicable
parsed_title	image title split into words, i.e. "helloWorld.jpg" -> "hello world"
url	url of an image on Wikipedia
is_icon	True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
on_commons	True if image is available from Wikimedia Commons dataset
description	description of an image parsed from Wikimedia Commons page, if available
caption	caption of an image parsed from Wikipedia article, if available
headings	list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
features	output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in `jpeg` format with fixed width of 600px. Practically, it is a list of floats with len = 2048

Collection method

Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

P
Wiki-en Dataset
paperswithcode.com
opendatalab.com
Updated Jul 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yumo Xu; Mirella Lapata (2019). Wiki-en Dataset [Dataset]. https://paperswithcode.com/dataset/wiki-en
Explore at:
Dataset updated
Jul 25, 2019
Authors
Yumo Xu; Mirella Lapata
Description
Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).
Data from: Wikipedia Citations: A comprehensive dataset of citations with...
zenodo.org
zip
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshdeep Singh; Robert West; Giovanni Colavizza; Harshdeep Singh; Robert West; Giovanni Colavizza (2020). Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia [Dataset]. http://doi.org/10.5281/zenodo.3940692
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3940692
Dataset updated
Nov 12, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Harshdeep Singh; Robert West; Giovanni Colavizza; Harshdeep Singh; Robert West; Giovanni Colavizza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is composed of 3 parts:

1. The dataset of 29.276 million citations from 35 different citation templates, out of which 3.92 million citations already contained identifiers, and approximately 260,752 citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip

2. A minimal dataset containing a few of the columns from the citations from Wikipedia dataset. These columns are as follows: 'type_of_citation', 'page_title', 'Title', 'ID_list', metadata_file', 'updated_identifier'. This is under the filename: minimal_dataset.zip. The 'metadata_file' column can be used to refer to the metadata collected from CrossRef and page title, the title of the citation can be used to refer to the 'citations_from_wikipedia.zip' dataset and get more information for a particular citation (such as author, periodical, chapter).

3. Citations classified as a journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)

The data was parsed from the Wikipedia XML content dumps published in May 2020.

The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki

The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset
f
Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...
figshare.com
txt
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14113799.v4
Dataset updated
Mar 14, 2021
Dataset provided by
figshare
Authors
KayYen Wong; Diego Saez-Trumper; Miriam Redi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
P
Wikipedia Person and Animal Dataset Dataset
paperswithcode.com
Updated Nov 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qingyun Wang; Xiaoman Pan; Lifu Huang; Boliang Zhang; Zhiying Jiang; Heng Ji; Kevin Knight (2021). Wikipedia Person and Animal Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/wikipedia-person-and-animal-dataset
Explore at:
Dataset updated
Nov 27, 2021
Authors
Qingyun Wang; Xiaoman Pan; Lifu Huang; Boliang Zhang; Zhiying Jiang; Heng Ji; Kevin Knight
Description
This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).
c
Plaintext Wikipedia dump 2018
lindat.mff.cuni.cz
live.european-language-grid.eu
Updated Feb 25, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rudolf Rosa (2018). Plaintext Wikipedia dump 2018 [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2735
Explore at:
Dataset updated
Feb 25, 2018
Authors
Rudolf Rosa
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Wikipedia Article Networks
kaggle.com
Updated Nov 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Garritano (2019). Wikipedia Article Networks [Dataset]. https://www.kaggle.com/datasets/andreagarritano/wikipedia-article-networks
Explore at:
Dataset updated
Nov 12, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Andrea Garritano
Description
Wikipedia Article Networks

Description

The data was collected from the English Wikipedia (December 2018). These datasets represent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes represent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The presence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.

Properties

Directed: No.

Node features: Yes.

Edge features: No.

Node labels: Yes. Continuous target.

Temporal: No.

| Dataset | Chameleon | Crocodile | Squirrel |

| Nodes |2,277 | 11,631 | 5,201 |

| Edges | 31,421 |170,918 | 198,493 |

| Density | 0.012 | 0.003 | 0.015 |

| Transitvity | 0.314| 0.026 | 0.348 |

Possible Tasks

Regression

Link prediction

Community detection

Network visualization

Paper: Multi-scale Attributed Node Embedding. Benedek Rozemberczki, Carl Allen, and Rik Sarkar. arXiv, 2019. https://arxiv.org/abs/1909.13021
T
wiki40b
tensorflow.org
opendatalab.com
+1more
Updated Aug 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). wiki40b [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki40b
Explore at:
Dataset updated
Aug 30, 2023
Description
Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wiki40b', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Wikipedia Talk Labels: Personal Attacks
figshare.com
txt
Updated Feb 22, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Personal Attacks [Dataset]. http://doi.org/10.6084/m9.figshare.4054689.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4054689.v6
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Kensho Derived Wikimedia Dataset
kaggle.com
Updated Jan 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kensho R&D (2020). Kensho Derived Wikimedia Dataset [Dataset]. https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data/activity
Explore at:
Dataset updated
Jan 31, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kensho R&D
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Kensho Derived Wikimedia Dataset

Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-sourced projects supported by the Wikimedia Foundation. Wikipedia is nearly 20 years old and recently added its six millionth article in English. Wikidata, its younger machine-readable sister project, was created in 2012 but has been growing rapidly and currently contains more than 75 million items.

These projects contribute to the Wikimedia Foundation's mission of empowering people to develop and disseminate educational content under a free license. They are also heavily utilized by computer science research groups, especially those interested in natural language processing (NLP). The Wikimedia Foundation periodically releases snapshots of the raw data backing these projects, but these are in a variety of formats and were not designed for use in NLP research. In the Kensho R&D group, we spend a lot of time downloading, parsing, and experimenting with this raw data. The Kensho Derived Wikimedia Dataset (KDWD) is a condensed subset of the raw Wikimedia data in a form that we find helpful for NLP work. The KDWD has a CC BY-SA 3.0 license, so feel free to use it in your work too.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F972e4157b97efe8c2c5ea17c983b1504%2Fkdwd_header_logos_2.jpg?generation=1580510520532141&alt=media" alt="">

This particular release consists of two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. We version the KDWD using the raw Wikimedia snapshot dates. The version string for this dataset is kdwd_enwiki_20191201_wikidata_20191202 indicating that this KDWD was built from the English Wikipedia snapshot from 2019 December 1 and the Wikidata snapshot from 2019 December 2. Below we describe these components in more detail.

Example Notebooks

Dive right in by checking out some of our example notebooks:

Introduction to the KDWD Wikipedia sample

Introduction to the KDWD Wikidata sample

Entity aliases and disambiguation candidates from anchor link statistics

Updates / Changelog

initial release 2020-01-31

File Summary

Wikipedia

page.csv (page metadata and Wikipedia-to-Wikidata mapping)

link_annotated_text.jsonl (plaintext of Wikipedia pages with link offsets)

Wikidata

item.csv (item labels and descriptions in English)

item_aliases.csv (item aliases in English)

property.csv (property labels and descriptions in English)

property_aliases.csv (property aliases in English)

statements.csv (truthy qpq statements)

Three Layers of Data

The KDWD is three connected layers of data. The base layer is a plain text English Wikipedia corpus, the middle layer annotates the corpus by indicating which text spans are links, and the top layer connects the link text spans to items in Wikidata. Below we'll describe these layers in more detail.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F19663d43bade0e92f578255f6e0d9dcd%2Fkensho_wiki_triple_layer.svg?generation=1580347573004185&alt=media" alt="">

Wikipedia Sample

The first part of the KDWD is derived from Wikipedia. In order to create a corpus of mostly natural text, we restrict our English Wikipedia page sample to those that:

are in the (Main/Article) namespace

are not redirect pages

are not disambiguation pages

are not list pages

have an associated Wikidata item (see Wikidata sample below)

are not a Wikimedia internal item (see Wikidata sample below)

From these pages we construct a corpus of link annotated text. We store this data in a single JSON Lines file with one page per line. Each page object has the following format:

page = { "page_id": 12, # wikipedia page id of annotated page "sections": [...] # list of section objects } section = { "name": "Introduction", # section header "text": "Anarchism is an ...", # plaintext of section "link_offsets": [16, 35, 49, ...], # list of anchor text offsets "link_lengths": [18, 9, 17, ...], # list of anchor text lengths "target_page_ids": [867979, 23040, 586276, ...] # list of link target page ids }

The text attribute of each section object contains our parse of the section’s wikitext markup into plaintext. Text spans that represent links are identified via the attributes link_offsets, link_lengths, and target_page_ids.

Wikidata Sample

The second part of the KDWD is derived from Wikidata. Because more people are familiar with Wikipedia than Wikidata, we provide more background here than in the previous section. Wikidata provides centralized storage of structured data for all Wikimedia projects. The core Wikidata concepts are items, properties, and statements.

In Wikidata, items are used to represent all the things in human knowledge, including topics, concepts, and objects. For example, the "1988 Summer Olympics", "love", "Elvis Presley", and "gorilla" are all items in Wikidata.
-- https://www.wikidata.org/wiki/Help:Items

A property describes the data value of a statement and can be thought of as a category of data, for example "color" for the data value "blue".
-- https://www.wikidata.org/wiki/Help:Properties

A statement is how the information we know about an item - the data we have about it - gets recorded in Wikidata. This happens by pairing a property with at least one data value
-- https://www.wikidata.org/wiki/Help:Statements

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F1c39f09bcbce766b7cef415a102be567%2Fkdwd_wikidata_image_nologo.jpg?generation=1579814918394405&alt=media" alt="">

The image above shows several statements from the Wikidata item for Grace Hopper. We can think about these statements as triples with the form (item, property, data value).

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F0a9c3aa58860f298db8b0e1f5952030e%2Fkdwd_statement_table.png?generation=1579214465126860&alt=media" alt="">

In the first statement (Grace Hopper, date of birth, 9 December 1906) the data value represents a time. However, data values can have several different types (e.g., time, string, globecoordinate, item, …). If the data value in a statement triple is a Wikidata item, we call it a qpq-statement (note that each item has a unique i.d. beginning with Q and each property has a unique i.d. beginning with P). We can think of qpq-statements as triples of the form (source item, property, target_item). The qpq-statements in the image above are:

(Grace Hopper, instance of, human)

(Grace Hopper, educated at, Vassar College)

(Grace Hopper, educated at, Yale University)

In order to construct a compact Wikidata sample that is relevant to our Wikipedia sample, we start with all statements in Wikidata and filter down to those that:

have a data value that is a Wikidata item (i.e., qpq-statements)

have a source item associated with a Wikipedia page from our Wikipedia sample

are
f
Wikipedia Article Topics for All Languages (based on article outlinks)
figshare.com
bz2
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson (2021). Wikipedia Article Topics for All Languages (based on article outlinks) [Dataset]. http://doi.org/10.6084/m9.figshare.12619766.v3
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12619766.v3
Dataset updated
Jul 20, 2021
Dataset provided by
figshare
Authors
Isaac Johnson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.
Arabic Wiki data Dump 2018
kaggle.com
zip
Updated Feb 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abed Khooli (2018). Arabic Wiki data Dump 2018 [Dataset]. https://www.kaggle.com/datasets/abedkhooli/arabic-wiki-data-dump-2018
Explore at:
zip(694547311 bytes)Available download formats
Dataset updated
Feb 6, 2018
Authors
Abed Khooli
Description
Context

Arabic is a rich and major world language. Recent advances in computational linguistics and AI can be applied to Arabic but not in the generic way most languages are treated. This dataset (Arabic articles from Wikipedia) will be used to train Word2Vec and compare performance with publicly available pre-trained model from FastText (Facebook) in a generic way. A related model is now available: https://www.kaggle.com/abedkhooli/arabic-ulmfit-model

Content

All Wikipedia Arabic articles from the January 20, 2018 data dump (compressed) in wikimedia format. Cntent is expected to be (mostly) in modern standard Arabic

Acknowledgements

Thanks to Wikipedia for making public data dumps available and for Facebook for releasing pre-trained models.

Inspiration

The challenges (opportunities) here are mostly in the pre-processing of the token and text normalization, plus hyper parameter tuning for different purposes. It is easy to isolate Arabic tokens (many articles have non-Arabic words), but tokenization is a challenge: how to treat accented (7arakaat or tashkeel) and non-accented word forms, same word form with different meanings, suffixes and prefixes (especially w).
Bangla Wikipedia dataset
kaggle.com
data.mendeley.com
+1more
zip
Updated Jan 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SADMAN ARAF (2021). Bangla Wikipedia dataset [Dataset]. https://www.kaggle.com/sadmanaraf/bangla-wikipedia-dataset
Explore at:
zip(74663684 bytes)Available download formats
Dataset updated
Jan 20, 2021
Authors
SADMAN ARAF
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

A subset of the Bangla version of the Wikipedia text. To create the Wikipedia dataset, we collected the Bangla wiki-dump of 10th June, 2019. The files are then merged and each article is selected as a sample text. All HTML tags were removed and the title of the page was stripped from the beginning of the text. This dataset contains 70377 samples with a total number of words being 18229481. The entire dataset has 1289249 unique words, which is 7% of the total vocabulary.

Content

Each text is represented by an id. The data is found in the wiki.csv

wiki.csv contains - id - text - title - url

Acknowledgement

The original dataset is found here here. To acknowledge use of the dataset in publications, please cite the data: Khatun, Aisha; Rahman, Anisur; Islam, Md Saiful (2020), “Bangla Wikipedia dataset”, Mendeley Data, V4, doi: 10.17632/3ph3n78fp7.4

http://dx.doi.org/10.17632/3ph3n78fp7.4

Inspiration

Research on bangla natural language can be conducted
l
ner-wikipedia-dataset
hf-mirror.llyke.com
huggingface.co
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
大規模言語モデル入門 (2023). ner-wikipedia-dataset [Dataset]. https://hf-mirror.llyke.com/datasets/llm-book/ner-wikipedia-dataset
Explore at:
Dataset updated
Jul 25, 2023
Dataset authored and provided by
大規模言語モデル入門
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for llm-book/ner-wikipedia-dataset

書籍『大規模言語モデル入門』で使用する、ストックマーク株式会社により作成された「Wikipediaを用いた日本語の固有表現抽出データセット」(Version 2.0)です。 Githubリポジトリstockmarkteam/ner-wikipedia-datasetで公開されているデータセットを利用しています。

Citation

@inproceedings{omi-2021-wikipedia, title = "Wikipediaを用いた日本語の固有表現抽出のデータセットの構築", author = "近江崇宏", booktitle = "言語処理学会第27回年次大会", year = "2021", url = "https://anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P2-7.pdf", }… See the full description on the dataset page: https://hf-mirror.com/datasets/llm-book/ner-wikipedia-dataset.
f
Wikipedia Talk Labels: Aggression
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Aggression [Dataset]. http://doi.org/10.6084/m9.figshare.4267550.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4267550.v5
Dataset updated
Feb 22, 2017
Dataset provided by
figshare
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it has aggressive tone. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
T
wikipedia_toxicity_subtypes
tensorflow.org
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
Explore at:
Dataset updated
Dec 6, 2022
Description
The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia_toxicity_subtypes', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

Facebook

Twitter

Click to copy link

Link copied

Cite

jojo jenkins (2023). WikipediaUpdated [Dataset]. https://huggingface.co/datasets/luciferxf/WikipediaUpdated

WikipediaUpdated

luciferxf/WikipediaUpdated

Explore at:

Dataset updated

May 4, 2023

Authors

jojo jenkins

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Clear search

Close search

Google apps

Main menu

WikipediaUpdated

wikipedia

wikipedia-summary-dataset

French Wikipedia Dataset

Extended Wikipedia Multimodal Dataset

Wikipedia Featured Articles multimodal dataset

Overview

Dataset structure

text.JSON Schema

meta.JSON Schema

Collection method

Wiki-en Dataset

Data from: Wikipedia Citations: A comprehensive dataset of citations with...

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

Wikipedia Person and Animal Dataset Dataset

Plaintext Wikipedia dump 2018

Wikipedia Article Networks

Wikipedia Article Networks

Description

Properties

Possible Tasks

wiki40b

Wikipedia Talk Labels: Personal Attacks

Kensho Derived Wikimedia Dataset

Kensho Derived Wikimedia Dataset

Example Notebooks

Updates / Changelog

File Summary

Three Layers of Data

Wikipedia Sample

Wikidata Sample

Wikipedia Article Topics for All Languages (based on article outlinks)

Arabic Wiki data Dump 2018

Context

Content

Acknowledgements

Inspiration

Bangla Wikipedia dataset

Context

Content

Acknowledgement

Inspiration

ner-wikipedia-dataset

Wikipedia Talk Labels: Aggression

wikipedia_toxicity_subtypes

WikipediaUpdatedSee More Versions

luciferxf/WikipediaUpdated

WikipediaUpdated