25 datasets found

Wikipedia Corpus (2023-03-01)
kaggle.com
zip
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcell Emmer (2024). Wikipedia Corpus (2023-03-01) [Dataset]. https://www.kaggle.com/datasets/emmermarcell/wikipedia-corpus-2023-03-01
Explore at:
zip(7253680490 bytes)Available download formats
Dataset updated
Jan 24, 2024
Authors
Marcell Emmer
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A corpus created from the Hugging Face Wikipedia dataset (https://huggingface.co/datasets/wikipedia). The preprocessing and the creation of this corpus are done using the text_to_sentences method of Blingfire. The details can be found in the following notebook:

https://www.kaggle.com/code/emmermarcell/create-a-wikipedia-corpus
T
wikipedia
tensorflow.org
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
Explore at:
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Plain text Wikipedia (SimpleEnglish)
kaggle.com
zip
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ffatty (2024). Plain text Wikipedia (SimpleEnglish) [Dataset]. https://www.kaggle.com/datasets/ffatty/plain-text-wikipedia-simpleenglish
Explore at:
zip(133738695 bytes)Available download formats
Dataset updated
Apr 1, 2024
Authors
Ffatty
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Plain text Wikipedia (SimpleEnglish)

Unsupervised text corpus of all 249,396 articles in the Simple English Wikipedia.

31M tokens

196,000 words

<400 MB uncompressed

Extracted from wikipedia dumps using an open source tool.

Data is remarkably clean and uniform.

There is also a dataset available for the much more massive Full English Wikipedia, generated in the same manner as this: https://www.kaggle.com/datasets/ffatty/plaintext-wikipedia-full-english

Format:

Each article's title appears before the content.

Articles are plain text; they are stripped of all Wiki formatting syntax, including font styles, citations, links, etc.

Articles are concatenated into txt files of ≤ 1MB each.

Sometimes, related articles are found next to each other (see excerpt below). This is probably because they were created or edited around the same time.

Random example excerpt:

(a portion of 1 file; 4 articles, concatenated in place)

Nicotine Nicotine is a drug in tobacco cigarettes, cigars, pipe tobacco, chewing tobacco, vaping liquids and some e-cigarettes. Nicotine is an addictive stimulant that causes the heart to beat faster and makes blood pressure rise. Addiction Addiction is when the body or mind badly wants or needs something in order to work right. When you have an addiction to something it is called being "addicted" or being an "addict". People can be addicted to drugs, cigarettes, alcohol, caffeine, and many other things. Bishop Bishop is a type of clergy in some Christian churches. The bishop is the leader of the Christians and the Christian priests in each diocese. The diocese which a bishop governs is called a bishopric. A bishop may be given the rank of archbishop in an archdiocese. Christian priests in some denominations must be made priests by bishops. Some Christian movements have neither bishops nor priests: Quakers are one example. In the Catholic church, the Pope is chosen by all the cardinals. Tray A tray is a shallow container designed for carrying things. Trays are flat, but with raised edges to stop things from sliding off of them.
Wikipedia Plaintext (2023-07-01)
kaggle.com
Updated Jul 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JJ (2023). Wikipedia Plaintext (2023-07-01) [Dataset]. https://www.kaggle.com/datasets/jjinho/wikipedia-20230701
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
JJ
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
While other great datasets containing Wikipedia exist, the latest one dates from 2020, and so this is an updated version that contains 6,286,775 articles, titles, text, and categories from the July 1st, 2023 Wikipedia dump.

Articles are sorted in alphanumeric order and separated into parquet files corresponding to the first character of the article title. The data is partitioned into parquet files named a-z, number (titles that began with numbers), and other (titles that began with symbols).

The best place to see it in action is: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam

If you find this dataset helpful, please upvote!
h
simple-wikipedia
huggingface.co
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Aralikatte (2023). simple-wikipedia [Dataset]. https://huggingface.co/datasets/rahular/simple-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2023
Authors
Rahul Aralikatte
Description
simple-wikipedia

Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.
Raw Wikipedia updates 2024
kaggle.com
zip
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheema Zain (2024). Raw Wikipedia updates 2024 [Dataset]. https://www.kaggle.com/datasets/sheemazain/raw-wikipedia-updates-2024
Explore at:
zip(8575597 bytes)Available download formats
Dataset updated
May 24, 2024
Authors
Sheema Zain
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Visit the Wikimedia Dumps page:

Go to the Wikimedia Dumps page. Here you can find the latest dumps for various Wikimedia projects, including Wikipedia.

Select the desired Wikipedia version:

Choose the language version of Wikipedia you are interested in. For example, for English Wikipedia, select the enwiki directory.

Download the latest dump:

Inside the directory for your selected language, you will find several types of dumps. The most commonly used dumps for raw Wikipedia content are:

enwiki-latest-pages-articles.xml.bz2: Contains the current versions of article content.

enwiki-latest-pages-meta-current.xml.bz2: Contains current versions of article content, including page metadata.

Click on the file you want to download.

Handling large files:

These files are typically very large (several gigabytes). Ensure you have sufficient storage and bandwidth to download them.

You may need tools like bzip2 to decompress .bz2 files.

Parsing the dump:

Once you have the dump file, you will need to parse it. The dumps are in XML format, which can be processed using various programming languages and tools. Python, for instance, has libraries like xml.etree.ElementTree for XML parsing.

Alternatively, you can use specialized tools like WikiExtractor, which is a Python script designed to extract and clean text from Wikipedia XML dumps.

Example of Download and Parsing

Here's an example of how you might use Python to download and parse a Wikipedia dump:

Download using Python:

import requests url = 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2' response = requests.get(url, stream=True) with open('enwiki-latest-pages-articles.xml.bz2', 'wb') as file: for chunk in response.iter_content(chunk_size=1024): if chunk: file.write(chunk) 2. Decompress and parse using WikiExtractor: ```bash # First, ensure you have WikiExtractor installed pip install wikiextractor # Run WikiExtractor to process the dump wikiextractor enwiki-latest-pages-articles.xml.bz2

Important Notes - Ensure you have appropriate storage and processing power to handle large datasets. - Parsing and processing Wikipedia dumps can be resource-intensive, so plan accordingly. - Always check the licensing and usage terms for Wikipedia content to ensure compliance.
E
Plaintext Wikipedia dump 2018
live.european-language-grid.eu
binary format
Updated Feb 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Plaintext Wikipedia dump 2018 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1242
Explore at:
binary formatAvailable download formats
Dataset updated
Feb 24, 2018
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).
For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).
Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.
The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
E
External References of English Wikipedia (ref-wiki-en)
live.european-language-grid.eu
data.niaid.nih.gov
txt
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). External References of English Wikipedia (ref-wiki-en) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7625
Explore at:
txtAvailable download formats
Dataset updated
Mar 27, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
External References of English Wikipedia (ref-wiki-en) is a corpus of the plain-text content of 2,475,461 external webpages linked from the reference section of articles in English Wikipedia. Specifically:
32,329,989 external reference URLs were extracted from a 2018 HTML dump of English Wikipedia. Removing repeated and ill-formed URLs yielded 23,036,318 unique URLs.These URLs were filtered to remove file extensions for unsupported formats (videos, audio, etc.), yielding 17,781,974 downloadable URLs. The URLs were loaded into Apache Nutch and continuously downloaded from August 2019 to December 2019, resulting in 2,475,461 successfully downloaded URLs. Not all URLs could be accessed. The order in which URLs were accessed was determined by Nutch, which partitions URLs by host and then randomly chooses amongst the URLs for each host.The content of these webpages were indexed in Apache Solr by Nutch. From Solr we extracted a JSON dump of the content.Many URLs offer a redirect; unfortunately Nutch does not index redirect information. This means that connecting the Wikipedia article (with the pre-direct link) to the downloaded webpage (at the post-redirect link) was complicated. However, by inspecting the order of download in the Nutch log files, we managed to recover links for 2,058,896 documents (83%) from their original Wikipedia article(s).We further managed to associate 3,899,953 unique Wikidata items with at least one external reference webpage in the corpus.
The ref-en-wiki corpus is incomplete, i.e., we did not attempt to download all reference URLs for English Wikipedia. We thus also collect a smaller complete corpus for the external references of 5,000 Wikipedia articles (ref-wiki-en-5k). We sampled from 5 ranges of Wikidata items: Q1-10000, Q10001-100000, Q100001-1000000, Q1000001-10000000, and Q10000001-100000000. From each range we sampled 1000 items. We then scraped the external reference URLs for the Wikipedia article corresponding to these items and downloaded them. The resulting corpus contains 37,983 webpages.Each line of the corpus (ref-wiki-en, ref-wiki-en-5k) encodes the webpage of an external reference in JSON format. Specifically, we provide:
tstamp: When the webpage was accessedhost: The domain (FQDN post-redirect) from which the webpage was retrieved.title: The title (meta) of the documenturl: The URL (post-redirect) of the webpageQ: The Q-code identifiers of the Wikidata items whose corresponding Wikipedia article is confirmed to link to this webpage.content: A plain-text encoding of the content of the webpage.
Below we provide an abbreviated example of a line from the corpus:{""tstamp"":""2019-09-26T01:22:43.621Z"",""host"":""geology.isu.edu"",""title"":""Digital Geology of Idaho - Basin And Range"",""url"":""http://geology.isu.edu/Digital_Geology_Idaho/Module9/mod9.htm"",""Q"":[810178],""content"":""Digital Geology of Idaho - Basin And Range 1 - Idaho Basement Rock 2 - Belt Supergroup 3 - Rifting & Passive Margin 4 - Accreted Terranes 5 - Thrust Belt 6 - Idaho Batholith 7 - North Idaho & Mining 8 - Challis Volcanics 9 - Basin and Range 10 - Columbia River Basalts 11 - SRP & Yellowstone 12 - Pleistocene Glaciation 13 - Palouse & Lake Missoula 14 - Lake Bonneville Flood 15 - Snake River Plain Aquifer Basin and Range Province - Teritiary Extension General geology of the Basin and Range Province Mechanisms of Basin and Range faulting Idaho Basin and Range south of the Snake River Plain Idaho Basin and Range north of the Snake River Plain Local areas of active and recent Basin & Range faulting: Borah Peak PDF Slideshows: North of SRP , South of SRP , Borah Earthquake Flythroughs: Teton Valley , Henry's Fork , Big Lost River , Blackfoot , Portneuf , Raft River Valley , Bear River , Salmon Falls Creek , Snake River , Big Wood River Vocabulary Words thrust fault Basin and Range Snake River Plain half-graben transfer zone Fly-throughs General geology of the Basin and Range Province The Basin and Range Province generally includes most of eastern California, eastern Oregon, eastern Washington, Nevada, western Utah, southern and western Arizona, and southeastern Idaho. ...""},A summary of the files we make available:
ref-wiki-en.json.gz: 2,475,461 external reference webpages (JSON format)ref-wiki-en_urls.txt.gz: 23,036,318 unique raw links to external references (plain-text format)ref-wiki-en-5k.json.gz: 37,983 external reference webpages (JSON format)ref-wiki-en-5k_urls.json.gz: 70,375 unique raw links to external references (plain-text format)ref-wiki-en-5k_Q.txt.gz: 5,000 Wikidata Q identifiers forming the 5k dataset (plain-text format)
Further details can be found in the publication:
Suggesting References for Wikidata Claims based on Wikipedia's External References. Paolo Curotto, Aidan Hogan. Wikidata Workshop @ISWC 2020.
Further material relating to this publication (including code for a proof-of-concept interface) is also available.
Kensho Derived Wikimedia Dataset
kaggle.com
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kensho R&D (2020). Kensho Derived Wikimedia Dataset [Dataset]. https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data
Explore at:
zip(8760044227 bytes)Available download formats
Dataset updated
Jan 24, 2020
Authors
Kensho R&D
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Kensho Derived Wikimedia Dataset

Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-sourced projects supported by the Wikimedia Foundation. Wikipedia is nearly 20 years old and recently added its six millionth article in English. Wikidata, its younger machine-readable sister project, was created in 2012 but has been growing rapidly and currently contains more than 75 million items.

These projects contribute to the Wikimedia Foundation's mission of empowering people to develop and disseminate educational content under a free license. They are also heavily utilized by computer science research groups, especially those interested in natural language processing (NLP). The Wikimedia Foundation periodically releases snapshots of the raw data backing these projects, but these are in a variety of formats and were not designed for use in NLP research. In the Kensho R&D group, we spend a lot of time downloading, parsing, and experimenting with this raw data. The Kensho Derived Wikimedia Dataset (KDWD) is a condensed subset of the raw Wikimedia data in a form that we find helpful for NLP work. The KDWD has a CC BY-SA 3.0 license, so feel free to use it in your work too.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F972e4157b97efe8c2c5ea17c983b1504%2Fkdwd_header_logos_2.jpg?generation=1580510520532141&alt=media" alt="">

This particular release consists of two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. We version the KDWD using the raw Wikimedia snapshot dates. The version string for this dataset is kdwd_enwiki_20191201_wikidata_20191202 indicating that this KDWD was built from the English Wikipedia snapshot from 2019 December 1 and the Wikidata snapshot from 2019 December 2. Below we describe these components in more detail.

Example Notebooks

Dive right in by checking out some of our example notebooks:

Introduction to the KDWD Wikipedia sample

Introduction to the KDWD Wikidata sample

Entity aliases and disambiguation candidates from anchor link statistics

Updates / Changelog

initial release 2020-01-31

File Summary

Wikipedia

page.csv (page metadata and Wikipedia-to-Wikidata mapping)

link_annotated_text.jsonl (plaintext of Wikipedia pages with link offsets)

Wikidata

item.csv (item labels and descriptions in English)

item_aliases.csv (item aliases in English)

property.csv (property labels and descriptions in English)

property_aliases.csv (property aliases in English)

statements.csv (truthy qpq statements)

Three Layers of Data

The KDWD is three connected layers of data. The base layer is a plain text English Wikipedia corpus, the middle layer annotates the corpus by indicating which text spans are links, and the top layer connects the link text spans to items in Wikidata. Below we'll describe these layers in more detail.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F19663d43bade0e92f578255f6e0d9dcd%2Fkensho_wiki_triple_layer.svg?generation=1580347573004185&alt=media" alt="">

Wikipedia Sample

The first part of the KDWD is derived from Wikipedia. In order to create a corpus of mostly natural text, we restrict our English Wikipedia page sample to those that:

are in the (Main/Article) namespace

are not redirect pages

are not disambiguation pages

a...
📖 Wikipedia Articles in PlainText
kaggle.com
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2023). 📖 Wikipedia Articles in PlainText [Dataset]. https://www.kaggle.com/datasets/bwandowando/wikipedia-index-and-plaintext-20230801/data
Explore at:
zip(6778044866 bytes)Available download formats
Dataset updated
Dec 16, 2023
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
[Update]

Updated dataset with https://dumps.wikimedia.org/enwiki/20231001/ dataset

[Context]

I was inspired by Radek Osmulski's additional Kaggle LLM Science Exam datasets for the Kaggle - LLM Science Exam competition.

I am trying to replicate his dataset creation workflow.

His workflow consists of getting Science and Tech Wikipedia articles and submitting them to ChatGPT3.5 for the creation of additional training data, which is discussed in this Youtube Video.

https://www.youtube.com/watch?v=w4Js5My2KXw" alt="">

[Challenges]

There are challenges that I (and he also mentioned them) encountered revolves around using the Wikipedia's get random article API.

Some but not limited to

You can't control what article you will get, and the competition is about Science and Technology questions

When you get an article that has too little of text, ChatGPT will "Hallucinate" and will do its best to create a question about the little text data it has received

These issues has been discussed here . As a very important first step, is to download the latest complete dump of Wikipedia via the Wikimedia website, which can be found here https://dumps.wikimedia.org/enwiki/yyyymm01/

[Workflow]

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F932674abed030512a05d794ae00cf40e%2Fworkflow.png?generation=1692165011555570&alt=media" alt="">

I used WikiExtractor to extract the articles from the 20GB compressed dump as 512MB JSON files. Afterwards, I converted them to compressed csv format (zip).

The same library has a lot of issues running under the Window environment. Issues revolving around encoding and forking/ multiprocessing were encountered. Running under Ubuntu made me finish the whole task

[Files]

There are 28 compressed files, a-z, numbers (0-9), and others (those that start with symbols).

[Cover]

Generated using https://hotpot.ai/
wikipedia-22-12-en-embeddings
huggingface.co
Updated Oct 16, 2006
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2006). wikipedia-22-12-en-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2006
Dataset authored and provided by
Coherehttps://cohere.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wikipedia (en) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (en) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings.
h
rag-mini-wikipedia
huggingface.co
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2025
Dataset authored and provided by
RAG Datasets
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
Microsoft Research WikiQA Corpus
kaggle.com
zip
Updated Jan 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurabh Shahane (2021). Microsoft Research WikiQA Corpus [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/wikiqa-corpus
Explore at:
zip(7080215 bytes)Available download formats
Dataset updated
Jan 24, 2021
Authors
Saurabh Shahane
Description
Context

The WikiQA corpus is a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, dataset authors used Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer. Because the summary section of a Wikipedia page provides the basic and usually most important information about the topic, authors used sentences in this section as the candidate answers. Source - https://msropendata.com/datasets/21032bb1-88bd-4656-9570-3172ae1757f0

Content

Dataset contains 3,047 questions and 29,258 sentences , where 1,473 sentences were labeled as answer sentences to their corresponding question.

Acknowledgements

Dataset Source - https://msropendata.com/datasets/21032bb1-88bd-4656-9570-3172ae1757f0

InProceedings{YangYihMeek:EMNLP2015:WikiQA, author = {Yang, Yi and Yih, Wen-tau and Meek, Christopher}, title = {{WikiQA}: {A} Challenge Dataset for Open-Domain Question Answering}, booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, month = {September}, year = {2015}, address = {Lisbon, Portugal}, publisher = {Association for Computational Linguistics} }

License - Open Use of Data Agreement v1.0
e
PAISÀ Corpus of Italian Web Text
clarin.eurac.edu
Updated Jun 5, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Verena Lyding; Egon Stemle; Claudia Borghetti; Marco Brunello; Sara Castagnoli; Felice Dell’Orletta; Henrik Dittmann; Alessandro Lenci; Vito Pirrelli (2013). PAISÀ Corpus of Italian Web Text [Dataset]. https://clarin.eurac.edu/repository/xmlui/handle/20.500.12124/3
Explore at:
Dataset updated
Jun 5, 2013
Authors
Verena Lyding; Egon Stemle; Claudia Borghetti; Marco Brunello; Sara Castagnoli; Felice Dell’Orletta; Henrik Dittmann; Alessandro Lenci; Vito Pirrelli
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ.

All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system.

The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor.

Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words.

The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.
wiki_qa
huggingface.co
opendatalab.com
Updated Jun 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2024). wiki_qa [Dataset]. https://huggingface.co/datasets/microsoft/wiki_qa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 3, 2024
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "wiki_qa"

Dataset Summary

Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure Data Instances default

Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.
Wikipedia Article Ratings
kaggle.com
huggingface.co
zip
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Akiki (2025). Wikipedia Article Ratings [Dataset]. https://www.kaggle.com/datasets/cakiki/wikipedia-article-ratings
Explore at:
zip(458435086 bytes)Available download formats
Dataset updated
May 23, 2025
Authors
Christopher Akiki
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
1-year dump of English Wikipedia article ratings. The dataset includes 47,207,448 records corresponding to 11,801,862 unique ratings posted between July 22, 2011 and July 22, 2012.

The Wikimedia Foundation has been experimenting with a feature to capture reader quality assessments of articles since September 2010. Article Feedback v4 (AFTv4) is a tool allowing readers to rate the quality of an article along 4 different dimensions. The tool has been deployed on the entire English Wikipedia (except for a small number of articles) since July 22, 2011. A new version of the tool, focused on feedback instead of ratings (AFTv5), has been tested in 2012 and deployed to a 10% random sample of articles from the English Wikipedia in July 2012.

Since launching the tool in September 2010, we've continually analyzed the results; see the Research reports, including specific analyses of the call to action and rater expertise.

As of AFTv5, all research reports are hosted on Meta.

This 1-year dump of anonymized rating data was originally made available for download from the DataHub. Real-time rating data can also be accessed via the toolserver.
h
large_spanish_corpus
huggingface.co
Updated Apr 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Cañete (2019). large_spanish_corpus [Dataset]. https://huggingface.co/datasets/josecannete/large_spanish_corpus
Explore at:
Dataset updated
Apr 20, 2019
Authors
José Cañete
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
h
hind_encorp
huggingface.co
opendatalab.com
Updated Mar 22, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
Explore at:
Dataset updated
Mar 22, 2014
Authors
Pavel Rychlý
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus. For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.
b
SpeakGer: A meta-data enriched speech corpus of German state and federal...
berd-platform.de
csv
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kai-Robin Lange; Kai-Robin Lange; Carsten Jentsch; Carsten Jentsch (2025). SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments [Dataset]. http://doi.org/10.82939/g3225-rba63
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.82939/g3225-rba63
Dataset updated
Jul 25, 2025
Dataset provided by
BERD@NFDI
Authors
Kai-Robin Lange; Kai-Robin Lange; Carsten Jentsch; Carsten Jentsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Germany
Description
A dataset of German parliament debates covering 74 years of plenary protocols across all 16 state parliaments of Germany as well as the German Bundestag. The debates are separated into individual speeches which are enriched with meta data identifying the speaker as a member of the parliament (mp).
When using this data set, please cite the original paper "Lange, K.-R., Jentsch, C. (2023). SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments. Proceedings of the 3rd Workshop on Computational Linguistics for Political Text Analysis@KONVENS 2023.".
The meta data is separated into two different types: time-specific meta-data that contains only information for a legislative period but can change over time (e.g. the party or constituency of an mp) and meta-data that is considered fixed, such as the birth date or the name of a speaker. The former information are stored aong with the speeches as it is considered temporal information of that point in time, but are additionally stored in the file all_mps_mapping.csv if there is the need to double-check something. The rest of the meta-data are stored in the file all_mps_meta.csv. The meta-data from this file can be matched with a speech by comparing the speaker ID-variable "MPID". The speeches of each parliament are saved in a csv format. Along with the speeches, they contain the following meta-data:
Period: int. The period in which the speech took place
Session: int. The session in which the speech took place
Chair: boolean. The information if the speaker was the chair of the plenary session
Interjection: boolean. The information if the speech is a comment or an interjection from the crowd
Party: list (e.g. ["cdu"] or ["cdu", "fdp"] when having more than one speaker during an interjection). List of the party of the speaker or the parties whom the comment/interjection references
Consituency: string. The consituency of the speaker in the current legislative period
MPID: int. The ID of the speaker, which can be used to get more meta-data from the file all_mps_meta.csv
The file all_mps_meta.csv contains the following meta information:
MPID: int. The ID of the speaker, which can be used to match the mp with his/her speeches.
WikipediaLink: The Link to the mps Wikipedia page
WikiDataLink: The Link to the mps WikiData page
Name: string. The full name of the mp.
Last Name: string. The last name of the mp, found on WikiData. If no last name is given on WikiData, the full name was heuristically cut at the last space to get the information neccessary for splitting the speeches.
Born: string, format: YYYY-MM-DD. Birth date of the mp. If an exact birth date is found on WikiData, this exact date is used. Otherwise, a day in the year of birth given on Wikipedia is used.
SexOrGender: string. Information on the sex or gender of the mp. Disclaimer: This infomation was taken from WikiData, which does not seem to differentiate between sex or gender.
Occupation: list. Occupation(s) of the mp.
Religion: string. Religious believes of the mp.
AbgeordnetenwatchID: int. ID of the mp on the website Abgeordnetenwatch

Facebook

Twitter

Click to copy link

Link copied

Cite

Marcell Emmer (2024). Wikipedia Corpus (2023-03-01) [Dataset]. https://www.kaggle.com/datasets/emmermarcell/wikipedia-corpus-2023-03-01

Wikipedia Corpus (2023-03-01)

A sentence-by-sentence breakdown of the Wikipedia dataset

Explore at:

zip(7253680490 bytes)Available download formats

Dataset updated

Jan 24, 2024

Authors

Marcell Emmer

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

A corpus created from the Hugging Face Wikipedia dataset (https://huggingface.co/datasets/wikipedia). The preprocessing and the creation of this corpus are done using the text_to_sentences method of Blingfire. The details can be found in the following notebook:

https://www.kaggle.com/code/emmermarcell/create-a-wikipedia-corpus

Clear search

Close search

Google apps

Main menu

Wikipedia Corpus (2023-03-01)

wikipedia

Plain text Wikipedia (SimpleEnglish)

Plain text Wikipedia (SimpleEnglish)

Format:

Random example excerpt:

Wikipedia Plaintext (2023-07-01)

simple-wikipedia

Raw Wikipedia updates 2024

Plaintext Wikipedia dump 2018

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

External References of English Wikipedia (ref-wiki-en)

Kensho Derived Wikimedia Dataset

Kensho Derived Wikimedia Dataset

Example Notebooks

Updates / Changelog

File Summary

Three Layers of Data

Wikipedia Sample

📖 Wikipedia Articles in PlainText

[Update]

[Context]

[Challenges]

[Workflow]

[Files]

[Cover]

wikipedia-22-12-en-embeddings

rag-mini-wikipedia

Microsoft Research WikiQA Corpus

Context

Content

Acknowledgements

License - Open Use of Data Agreement v1.0

PAISÀ Corpus of Italian Web Text

wiki_qa

Wikipedia Article Ratings

large_spanish_corpus

hind_encorp

SpeakGer: A meta-data enriched speech corpus of German state and federal...

Wikipedia Corpus (2023-03-01)

A sentence-by-sentence breakdown of the Wikipedia dataset