52 datasets found

e
Plaintext Wikipedia dump 2018 - Dataset - B2FIND
b2find.eudat.eu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Plaintext Wikipedia dump 2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3074cb26-6a0d-5803-8520-d0050a22c66e
Explore at:
Description
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
E
Plaintext Wikipedia dump 2018
live.european-language-grid.eu
binary format
Updated Feb 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Plaintext Wikipedia dump 2018 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1242
Explore at:
binary formatAvailable download formats
Dataset updated
Feb 24, 2018
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).
For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).
Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.
The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
h
simple-wikipedia
huggingface.co
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Aralikatte (2023). simple-wikipedia [Dataset]. https://huggingface.co/datasets/rahular/simple-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2023
Authors
Rahul Aralikatte
Description
simple-wikipedia

Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.
Z
A Wikipedia dataset of 5 categories
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maitre, Julien (2020). A Wikipedia dataset of 5 categories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3260045
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Maitre, Julien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

Economy : 44'876 articles

History : 92'041 articles

Informatics : 25'408 articles

Health : 22'143 articles

Law : 9'964 articles
wikipedia-22-12-simple-embeddings
huggingface.co
opendatalab.com
Updated Mar 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
h
thai_wikipedia_clean_20230101
huggingface.co
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PyThaiNLP (2023). thai_wikipedia_clean_20230101 [Dataset]. https://huggingface.co/datasets/pythainlp/thai_wikipedia_clean_20230101
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2023
Dataset authored and provided by
PyThaiNLP
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "thai_wikipedia_clean_20230101"

More Information needed Thai Wikipedia Database dumps to plain text for NLP work. This dataset was dump on 1 January 2023 from Thai wikipedia.

GitHub: PyThaiNLP / ThaiWiki-clean Notebook for upload to HF: https://github.com/PyThaiNLP/ThaiWiki-clean/blob/main/thai_wikipedia_clean_20230101_hf.ipynb
Wikipedia Plaintext (2023-07-01) cut
kaggle.com
Updated Sep 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kɔuq Wang (2023). Wikipedia Plaintext (2023-07-01) cut [Dataset]. https://www.kaggle.com/datasets/gmhost/wikipedia-plaintext-2023-07-01-cut
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kɔuq Wang
Description
Dataset

This dataset was created by kwang

Contents
E
External References of English Wikipedia (ref-wiki-en)
live.european-language-grid.eu
data.niaid.nih.gov
txt
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). External References of English Wikipedia (ref-wiki-en) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7625
Explore at:
txtAvailable download formats
Dataset updated
Mar 27, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
External References of English Wikipedia (ref-wiki-en) is a corpus of the plain-text content of 2,475,461 external webpages linked from the reference section of articles in English Wikipedia. Specifically:
32,329,989 external reference URLs were extracted from a 2018 HTML dump of English Wikipedia. Removing repeated and ill-formed URLs yielded 23,036,318 unique URLs.These URLs were filtered to remove file extensions for unsupported formats (videos, audio, etc.), yielding 17,781,974 downloadable URLs. The URLs were loaded into Apache Nutch and continuously downloaded from August 2019 to December 2019, resulting in 2,475,461 successfully downloaded URLs. Not all URLs could be accessed. The order in which URLs were accessed was determined by Nutch, which partitions URLs by host and then randomly chooses amongst the URLs for each host.The content of these webpages were indexed in Apache Solr by Nutch. From Solr we extracted a JSON dump of the content.Many URLs offer a redirect; unfortunately Nutch does not index redirect information. This means that connecting the Wikipedia article (with the pre-direct link) to the downloaded webpage (at the post-redirect link) was complicated. However, by inspecting the order of download in the Nutch log files, we managed to recover links for 2,058,896 documents (83%) from their original Wikipedia article(s).We further managed to associate 3,899,953 unique Wikidata items with at least one external reference webpage in the corpus.
The ref-en-wiki corpus is incomplete, i.e., we did not attempt to download all reference URLs for English Wikipedia. We thus also collect a smaller complete corpus for the external references of 5,000 Wikipedia articles (ref-wiki-en-5k). We sampled from 5 ranges of Wikidata items: Q1-10000, Q10001-100000, Q100001-1000000, Q1000001-10000000, and Q10000001-100000000. From each range we sampled 1000 items. We then scraped the external reference URLs for the Wikipedia article corresponding to these items and downloaded them. The resulting corpus contains 37,983 webpages.Each line of the corpus (ref-wiki-en, ref-wiki-en-5k) encodes the webpage of an external reference in JSON format. Specifically, we provide:
tstamp: When the webpage was accessedhost: The domain (FQDN post-redirect) from which the webpage was retrieved.title: The title (meta) of the documenturl: The URL (post-redirect) of the webpageQ: The Q-code identifiers of the Wikidata items whose corresponding Wikipedia article is confirmed to link to this webpage.content: A plain-text encoding of the content of the webpage.
Below we provide an abbreviated example of a line from the corpus:{""tstamp"":""2019-09-26T01:22:43.621Z"",""host"":""geology.isu.edu"",""title"":""Digital Geology of Idaho - Basin And Range"",""url"":""http://geology.isu.edu/Digital_Geology_Idaho/Module9/mod9.htm"",""Q"":[810178],""content"":""Digital Geology of Idaho - Basin And Range 1 - Idaho Basement Rock 2 - Belt Supergroup 3 - Rifting & Passive Margin 4 - Accreted Terranes 5 - Thrust Belt 6 - Idaho Batholith 7 - North Idaho & Mining 8 - Challis Volcanics 9 - Basin and Range 10 - Columbia River Basalts 11 - SRP & Yellowstone 12 - Pleistocene Glaciation 13 - Palouse & Lake Missoula 14 - Lake Bonneville Flood 15 - Snake River Plain Aquifer Basin and Range Province - Teritiary Extension General geology of the Basin and Range Province Mechanisms of Basin and Range faulting Idaho Basin and Range south of the Snake River Plain Idaho Basin and Range north of the Snake River Plain Local areas of active and recent Basin & Range faulting: Borah Peak PDF Slideshows: North of SRP , South of SRP , Borah Earthquake Flythroughs: Teton Valley , Henry's Fork , Big Lost River , Blackfoot , Portneuf , Raft River Valley , Bear River , Salmon Falls Creek , Snake River , Big Wood River Vocabulary Words thrust fault Basin and Range Snake River Plain half-graben transfer zone Fly-throughs General geology of the Basin and Range Province The Basin and Range Province generally includes most of eastern California, eastern Oregon, eastern Washington, Nevada, western Utah, southern and western Arizona, and southeastern Idaho. ...""},A summary of the files we make available:
ref-wiki-en.json.gz: 2,475,461 external reference webpages (JSON format)ref-wiki-en_urls.txt.gz: 23,036,318 unique raw links to external references (plain-text format)ref-wiki-en-5k.json.gz: 37,983 external reference webpages (JSON format)ref-wiki-en-5k_urls.json.gz: 70,375 unique raw links to external references (plain-text format)ref-wiki-en-5k_Q.txt.gz: 5,000 Wikidata Q identifiers forming the 5k dataset (plain-text format)
Further details can be found in the publication:
Suggesting References for Wikidata Claims based on Wikipedia's External References. Paolo Curotto, Aidan Hogan. Wikidata Workshop @ISWC 2020.
Further material relating to this publication (including code for a proof-of-concept interface) is also available.

Extended Wikipedia Multimodal Dataset

kaggle.com

Updated Apr 4, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/1058023

Dataset updated

Apr 4, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Oleh Onyshchak

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Wikipedia Featured Articles multimodal dataset

Overview

This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

Dataset structure

The high-level structure of the dataset is as follows:

.
+-- page1 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
+-- page2 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
: 
+-- pageN 
|  +-- text.json 
|  +-- img 
|    +-- meta.json

label	description
pageN	is the title of N-th Wikipedia page and contains all information about the page
text.json	text of the page saved as JSON. Please refer to the details of JSON schema below.
meta.json	a collection of all images of the page. Please refer to the details of JSON schema below.
imageN	is the N-th image of an article, saved in `jpg` format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

text.JSON Schema

Below you see an example of how data is stored:

{
 "title": "Naval Battle of Guadalcanal",
 "id": 405411,
 "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
 "html": "...

...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

key	description
title	page title
id	unique page id
url	url of a page on Wikipedia
html	HTML content of the article
wikitext	wikitext content of the article

Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

meta.JSON Schema

{
 "img_meta": [
  {
   "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
   "title": "IronbottomSound.jpg",
   "parsed_title": "ironbottom sound",
   "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
   "is_icon": False,
   "on_commons": True,
   "description": "A U.S. destroyer steams up what later became known as ...",
   "caption": "Ironbottom Sound. The majority of the warship surface ...",
   "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
   "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
   },
   ...
  ]
}

key	description
filename	unique image id, md5 hashcode of original image title
title	image title retrieved from Commons, if applicable
parsed_title	image title split into words, i.e. "helloWorld.jpg" -> "hello world"
url	url of an image on Wikipedia
is_icon	True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
on_commons	True if image is available from Wikimedia Commons dataset
description	description of an image parsed from Wikimedia Commons page, if available
caption	caption of an image parsed from Wikipedia article, if available
headings	list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
features	output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in `jpeg` format with fixed width of 600px. Practically, it is a list of floats with len = 2048

Collection method

Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

Kaggle Data Collection Notebook...

Z
Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Su (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
You Wu
Alyssa Lees
Yu Su
Cong Yu
Huan Sun
Xiang Deng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
A word2vec model file built from the French Wikipedia XML Dump using gensim....
zenodo.org
data.niaid.nih.gov
bin, text/x-python
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christof Schöch; Christof Schöch (2020). A word2vec model file built from the French Wikipedia XML Dump using gensim. [Dataset]. http://doi.org/10.5281/zenodo.162792
Explore at:
bin, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.162792
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christof Schöch; Christof Schöch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
A word2vec model file built from the French Wikipedia XML dump using gensim. The data published here includes three model files (you need all three of them in the same folder) as well as the Python script used to build the model (for documentation). The Wikipedia dump was downloaded on October 7, 2016 from https://dumps.wikimedia.org/. Before building the model, plain text was extracted from the dump. The size of that dataset is about 500 million words or 3.6 GB of plain text. The principal parameters for building the model were the following: no lemmatization was performed, tokenization was done using the "\W" regular expression (any non-word character splits tokens), and the model was built with 500 dimensions.
E
Pairwise Multi-Class Document Classification for Semantic Relations between...
live.european-language-grid.eu
csv
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
Explore at:
csvAvailable download formats
Dataset updated
Apr 15, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
Datasets (articles, class labels, cross-validation splits)
Pretrained models (Transformers, GloVe, Doc2vec)
Model output (prediction) for the best performing models
This package consists of the Dataset part.
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files
e
Wikipedia Text Segmentation - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 4, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2012). Wikipedia Text Segmentation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3f590262-3ea6-5cc8-8e85-94e8a2c402f3
Explore at:
Dataset updated
Apr 4, 2012
Description
For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles See also'',References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense.
WikiReaD (Wikipedia Readability Dataset)
zenodo.org
bz2
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach (2025). WikiReaD (Wikipedia Readability Dataset) [Dataset]. http://doi.org/10.5281/zenodo.11371932
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.11371932
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Description:

The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).

Dataset Details:

Number of Languages: 14

Number of files: 19

Use Case: Training and evaluating readability scoring models for articles within and outside Wikipedia.

Processing details: Text pairs are created by matching articles from Wikipedia with the corresponding article in the simplified/children encyclopedia either via the Wikidata item ID or their page titles. The text of each article is extracted directly from their parsed HTML version.

Files: The dataset consists of independent files for each type of children/simplified encyclopedia and each language (e.g., `

Attribution:

The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids.

Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page (

Wikipedia

Source: https://

License: CC BY-SA 4.0, GFDL

Simple English Wikipedia

Source: https://simple.wikipedia.org/wiki/

License: CC BY-SA 4.0, GFDL

Vikidia

Source: https://

License: CC BY-SA 3.0, GFDL

Klexikon

Source: https://klexikon.zum.de/wiki/

License: CC BY-SA 4.0

Txikipedia

Source: https://eu.wikipedia.org/wiki/Txikipedia:

License: CC BY-SA 4.0, GFDL

Wikikids

Source: https://wikikids.nl/

License: CC BY-SA 3.0

Related paper citation:

@inproceedings{trokhymovych-etal-2024-open, title = "An Open Multilingual System for Scoring Readability of {W}ikipedia", author = "Trokhymovych, Mykola and Sen, Indira and Gerlach, Martin", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.342/", doi = "10.18653/v1/2024.acl-long.342", pages = "6296--6311" }
h
barwiki-dumps
huggingface.co
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bavarian NLP (2025). barwiki-dumps [Dataset]. https://huggingface.co/datasets/bavarian-nlp/barwiki-dumps
Explore at:
Dataset updated
Aug 9, 2025
Dataset authored and provided by
Bavarian NLP
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🥨 Bavarian Wikipedia Dumps

This repo hosts backups of the Bavarian Wikipedia Dumps. More precisely, various *-pages-articles.xml.bz2 dumps are hosted here, that include articles, templates, media/file descriptions, and primary meta-pages. These dumps can be used to e.g. construct a plaintext Wikipedia dump using wikiextractor. Recent dumps will be added to this repo on a regular basis.
Citations with contexts in Wikipedia
figshare.com
html
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Halfaker; Meen Chul Kim; Andrea Forte; Dario Taraborelli (2023). Citations with contexts in Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.5588842.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5588842.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Aaron Halfaker; Meen Chul Kim; Andrea Forte; Dario Taraborelli
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset represents structured metadata and contextual information about references added to Wikipedia articles in a JSON format. Each record represents an individual Wikipedia article revision with all the tags parsed, as stored in Wikipedia's XML dumps, including information about: 1) the context(s) in which the reference occurs within the article – such as the surrounding text, parent section title, and section level – 2) structured data and bibliographic metadata included within the reference itself (such as: any citation template used, external links, any known persistent identifiers) 3) additional data/metadata about the reference itself (the reference name, its raw content, and if applicable, revision ID associated with reference addition/deletion/change)The data is available as a set of compressed JSON files, extracted from the July 1, 2017 XML dump of English Wikipedia. Other languages may be added to this dataset in the future.The JSON schema and Python parsing libraries used to generate the data are in the references.
E
Data from: W2C – Web to Corpus – tool
live.european-language-grid.eu
Updated Dec 19, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). W2C – Web to Corpus – tool [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/18153
Explore at:
Dataset updated
Dec 19, 2011
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.

A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9
z
Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories
zenodo.org
explore.openaire.eu
bin, zip
Updated Aug 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Frederico Rodrigues; Carla Teixeira Lopes; Carla Teixeira Lopes; Henrique Lopes Cardoso; José Frederico Rodrigues; Henrique Lopes Cardoso (2024). Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories [Dataset]. http://doi.org/10.25747/4vc9-zs43
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.25747/4vc9-zs43
Dataset updated
Aug 9, 2024
Dataset provided by
INESC TEC
Authors
José Frederico Rodrigues; Carla Teixeira Lopes; Carla Teixeira Lopes; Henrique Lopes Cardoso; José Frederico Rodrigues; Henrique Lopes Cardoso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset (categorized_dataset folder) contains 9 files in .csv format, each a collection of 10,000 lead section pairs sourced from Wikipedia (https://www.wikipedia.org/) and Simple Wikipedia (https://simple.wikipedia.org/) for a given category. Included categories are Culture, Education, Employment, Entertainment, Health, Leisure, Objects, Science and Time. This dataset was created to understand how effective an open-source large language model (Llama3) is in assessing the readability of texts and simplifying text across multiple domains. The dataset was collected using Wikipedia API.
Wikipedia English Official Offline Edition 2014-07-07
academictorrents.com
bittorrent
Updated Aug 5, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikipedia (2014). Wikipedia English Official Offline Edition 2014-07-07 [Dataset]. https://academictorrents.com/details/e18b8cce7d9cb2726f5f40dcb857111ec573cad4
Explore at:
bittorrent(11031162019)Available download formats
Dataset updated
Aug 5, 2014
Dataset authored and provided by
Wikipedia//www.wikipedia.org/
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.
A Comprehensive Dataset of Classified Citations with Identifiers from...
zenodo.org
zip
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza (2023). A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2023) [Dataset]. http://doi.org/10.5281/zenodo.8107239
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8107239
Dataset updated
Jul 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https://dumps.wikimedia.org/enwiki/20230220/).

Version 1: en_citations.zip is a dataset of extracted citations

Version 2: en_final.zip is the same dataset with classified citations augmented with identifiers

The fields are as follows:

type_of_citation - Wikipedia template type used to define the citation, e.g., 'cite journal', 'cite news', etc.

page_title - title of the Wikipedia article from which the citation was extracted.

Title - source title, e.g., title of the book, newspaper article, etc.

URL - link to the source, e.g., webpage where news article was published, description of the book at the publisher's website, online library webpage, etc.

tld - top link domain extracted from the URL, e.g., 'bbc' for https://www.bbc.co.uk/...

Authors - list of article or book authors, if available.

ID_list - list of publication identifiers mentioned in the citation, e.g., DOI, ISBN, etc.

citations - citation text as used in Wikipedia code

actual_label - 'book', 'journal', 'news', or 'other' label assigned based on the analysis of citation identifiers or top link domain.

acquired_ID_list - identifiers located via Google Books and Crossref APIs for citations which are likely to refer to books or journals, i.e., defined using 'cite book', 'cite journal', 'cite encyclopedia', and 'cite proceedings' templates.

The total number of news: 9.926.598

The total number of books: 2.994.601

The total number of journals: 2.052.172

Augmented with IDs via lookup 929.601 (out of 2.445.913 book, journal, encyclopedia, and proceedings template citations not classified as books or journals via given identifiers).

The source code to extract citations can be found here: https://github.com/albatros13/wikicite.

The code is a fork of the earlier project on Wikipedia citation extraction: https://github.com/Harshdeep1996/cite-classifications-wiki.

Facebook

Twitter

Click to copy link

Link copied

Cite

Plaintext Wikipedia dump 2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3074cb26-6a0d-5803-8520-d0050a22c66e

Plaintext Wikipedia dump 2018 - Dataset - B2FIND

Explore at:

Description

Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

Clear search

Close search

Google apps

Main menu

Plaintext Wikipedia dump 2018 - Dataset - B2FIND

Plaintext Wikipedia dump 2018

simple-wikipedia

A Wikipedia dataset of 5 categories

wikipedia-22-12-simple-embeddings

thai_wikipedia_clean_20230101

Wikipedia Plaintext (2023-07-01) cut

Dataset

Contents

External References of English Wikipedia (ref-wiki-en)

Extended Wikipedia Multimodal Dataset

Wikipedia Featured Articles multimodal dataset

Overview

Dataset structure

text.JSON Schema

meta.JSON Schema

Collection method

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

A word2vec model file built from the French Wikipedia XML Dump using gensim....

Pairwise Multi-Class Document Classification for Semantic Relations between...

Wikipedia Text Segmentation - Dataset - B2FIND

WikiReaD (Wikipedia Readability Dataset)

barwiki-dumps

Citations with contexts in Wikipedia

Data from: W2C – Web to Corpus – tool

Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories

Wikipedia English Official Offline Edition 2014-07-07

A Comprehensive Dataset of Classified Citations with Identifiers from...

Plaintext Wikipedia dump 2018 - Dataset - B2FIND