100+ datasets found

h
wikipedia
huggingface.co
tensorflow.org
Updated Feb 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
Explore at:
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
h
rag-mini-wikipedia
huggingface.co
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2025
Dataset authored and provided by
RAG Datasets
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
h
wikipedia_vi
huggingface.co
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VietGPT (2023). wikipedia_vi [Dataset]. https://huggingface.co/datasets/vietgpt/wikipedia_vi
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2023
Dataset authored and provided by
VietGPT
Description
Wikipedia

Source: https://huggingface.co/datasets/wikipedia Num examples: 1,281,412 Language: Vietnamese

from datasets import load_dataset

load_dataset("tdtunlp/wikipedia_vi")
h
simple-wiki
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "simple-wiki"

Dataset Summary

This dataset contains pairs of equivalent sentences obtained from Wikipedia.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
wikipedia-22-12-simple-embeddings
huggingface.co
opendatalab.com
Updated Mar 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
h
wikipedia-persons-masked
huggingface.co
Updated May 23, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL (2009). wikipedia-persons-masked [Dataset]. https://huggingface.co/datasets/rcds/wikipedia-persons-masked
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2009
Dataset authored and provided by
Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people

Dataset Summary

Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a

Supported Tasks and Leaderboards

The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is

Languages

english only

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
h
wikipedia-small-3000-embedded
huggingface.co
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2024
Authors
Hafedh Hichri
License
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
Description
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

load dataset in streaming mode (no download and it's fast)

dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

select 3000 samples

from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
h
wikipedia-summary-dataset
huggingface.co
Updated Sep 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Clive (2017). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2017
Authors
Jordan Clive
Description
Dataset Summary

This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.
wikipedia-2023-11-embed-multilingual-v3
huggingface.co
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-2023-11-embed-multilingual-v3 [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
Description
Multilingual Embeddings for Wikipedia in 300+ Languages

This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings. You… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.
h
small-simple-wikipedia
huggingface.co
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gb (2025). small-simple-wikipedia [Dataset]. https://huggingface.co/datasets/tcltcl/small-simple-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2025
Authors
Gb
Description
tcltcl/small-simple-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
h
clean-wikipedia
huggingface.co
Updated Sep 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2025). clean-wikipedia [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/clean-wikipedia
Explore at:
Dataset updated
Sep 18, 2025
Dataset authored and provided by
FineData
Description
HuggingFaceFW/clean-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
h
zh-tw-wikipedia
huggingface.co
Updated May 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pokai Chang (2023). zh-tw-wikipedia [Dataset]. https://huggingface.co/datasets/zetavg/zh-tw-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2023
Authors
Pokai Chang
Description
台灣正體中文維基百科 (zh-tw Wikipedia)

截至 2023 年 5 月，中文維基百科 2,533,212 篇條目的台灣正體文字內容。每篇條目為一列 (row)，包含 HTML 以及 Markdown 兩種格式。 A nearly-complete collection of 2,533,212 Traditional Chinese (zh-tw) Wikipedia pages, gathered between May 1, 2023, and May 7, 2023. Includes both the original HTML format and an auto-converted Markdown version, which has been processed using vinta/pangu.py. 於 2023 年 5 月 1 日至 5 月 7 日間取自維基百科 action=query & prop=extracts API，內容皆與維基百科網站之台灣正體版本一致，沒有繁簡體混雜的問題。 For development… See the full description on the dataset page: https://huggingface.co/datasets/zetavg/zh-tw-wikipedia.
h
wikipedia-en-sentences
huggingface.co
Updated Jul 15, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2013). wikipedia-en-sentences [Dataset]. https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2013
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for Wikipedia Sentences (English)

This dataset contains 7.87 million English sentences and can be used in knowledge distillation of embedding models.

Dataset Details

Columns: "sentence" Column types: str Examples:{ 'sentence': "After the deal was approved and NONG's stock rose to $13, Farris purchased 10,000 shares at the $2.50 price, sold 2,500 shares at the new price to reimburse the company, and gave the remaining 7,500 shares to Landreville at no cost… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences.
h
wikipedia
huggingface.co
Updated Aug 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UpPrize Tech (2025). wikipedia [Dataset]. https://huggingface.co/datasets/upprize/wikipedia
Explore at:
Dataset updated
Aug 31, 2025
Authors
UpPrize Tech
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
upprize/wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
h
roots_indic-te_wikipedia
huggingface.co
Updated Aug 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Data (2023). roots_indic-te_wikipedia [Dataset]. https://huggingface.co/datasets/bigscience-data/roots_indic-te_wikipedia
Explore at:
Dataset updated
Aug 9, 2023
Dataset authored and provided by
BigScience Data
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
ROOTS Subset: roots_indic-te_wikipedia

wikipedia

Dataset uid: wikipedia

Description Homepage Licensing Speaker Locations Sizes

3.2299 % of total 4.2071 % of en 5.6773 % of ar 3.3416 % of fr 5.2815 % of es 12.4852 % of ca 0.4288 % of zh 0.4286 % of zh 5.4743 % of indic-bn 8.9062 % of indic-ta 21.3313 % of indic-te 4.4845 % of pt 4.0493 % of indic-hi 11.3163 % of indic-ml 22.5300 % of indic-ur 4.4902 % of vi 16.9916 % of… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_indic-te_wikipedia.
h
ner-wikipedia-dataset
huggingface.co
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
大規模言語モデル入門 (2023). ner-wikipedia-dataset [Dataset]. https://huggingface.co/datasets/llm-book/ner-wikipedia-dataset
Explore at:
Dataset updated
Jul 25, 2023
Dataset authored and provided by
大規模言語モデル入門
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for llm-book/ner-wikipedia-dataset

書籍『大規模言語モデル入門』で使用する、ストックマーク株式会社により作成された「Wikipediaを用いた日本語の固有表現抽出データセット」(Version 2.0)です。 Githubリポジトリstockmarkteam/ner-wikipedia-datasetで公開されているデータセットを利用しています。

Citation

@inproceedings{omi-2021-wikipedia, title = "Wikipediaを用いた日本語の固有表現抽出のデータセットの構築", author = "近江崇宏", booktitle = "言語処理学会第27回年次大会", year = "2021", url = "https://anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P2-7.pdf", }

Licence… See the full description on the dataset page: https://huggingface.co/datasets/llm-book/ner-wikipedia-dataset.
h
wiki_dpr
huggingface.co
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2024). wiki_dpr [Dataset]. https://huggingface.co/datasets/facebook/wiki_dpr
Explore at:
Dataset updated
May 29, 2024
Dataset authored and provided by
AI at Meta
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.
h
wizard_of_wikipedia
huggingface.co
Updated Jul 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chujie Zheng (2023). wizard_of_wikipedia [Dataset]. https://huggingface.co/datasets/chujiezheng/wizard_of_wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2023
Authors
Chujie Zheng
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Wizard-of-Wikipedia data for the Findings of EMNLP 2020 paper "Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation" GitHub repo. Original paper. @inproceedings{zheng-etal-2020-diffks, title="{D}ifference-aware Knowledge Selection for Knowledge-grounded Conversation Generation", author="Zheng, Chujie and Cao, Yunbo and Jiang, Daxin and Huang, Minlie", booktitle="Findings of EMNLP", year="2020" }
h
wikipedia
huggingface.co
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MedRAG (2024). wikipedia [Dataset]. https://huggingface.co/datasets/MedRAG/wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2024
Authors
MedRAG
Description
The Wikipedia Corpus in MedRAG

This HF dataset contains the chunked snippets from the Wikipedia corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).

News

(02/26/2024) The "id" column has been reformatted. A new "wiki_id" column is added.

Dataset Details Dataset Descriptions

As a large-scale open-source encyclopedia, Wikipedia is frequently used as a corpus in information retrieval tasks. We select Wikipedia as one… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/wikipedia.
h
kilt_wikipedia
huggingface.co
opendatalab.com
Updated Aug 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2019). kilt_wikipedia [Dataset]. https://huggingface.co/datasets/facebook/kilt_wikipedia
Explore at:
Dataset updated
Aug 1, 2019
Dataset authored and provided by
AI at Meta
Description
KILT-Wikipedia: Wikipedia pre-processed for KILT.

Facebook

Twitter

Click to copy link

Link copied

Cite

Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia

wikipedia

Wikipedia

olm/wikipedia

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Feb 21, 2023

Dataset authored and provided by

Online Language Modelling

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Clear search

Close search

Google apps

Main menu

wikipedia

rag-mini-wikipedia

wikipedia_vi

simple-wiki

wikipedia-22-12-simple-embeddings

wikipedia-persons-masked

wikipedia-small-3000-embedded

load dataset in streaming mode (no download and it's fast)

select 3000 samples

wikipedia-summary-dataset

wikipedia-2023-11-embed-multilingual-v3

small-simple-wikipedia

clean-wikipedia

zh-tw-wikipedia

wikipedia-en-sentences

wikipedia

roots_indic-te_wikipedia

ner-wikipedia-dataset

wiki_dpr

wizard_of_wikipedia

wikipedia

kilt_wikipedia

wikipediaSee More Versions

Wikipedia

olm/wikipedia

wikipedia