100+ datasets found

h
wikipedia
huggingface.co
tensorflow.org
Updated Feb 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
Explore at:
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
h
rag-mini-wikipedia
huggingface.co
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2025
Dataset authored and provided by
RAG Datasets
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
h
simple-wiki
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "simple-wiki"

Dataset Summary

This dataset contains pairs of equivalent sentences obtained from Wikipedia.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
wikitext
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salesforce, wikitext [Dataset]. https://huggingface.co/datasets/Salesforce/wikitext
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Salesforce Inchttp://salesforce.com/
Authors
Salesforce
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "wikitext"

Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.
structured-wikipedia
huggingface.co
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia
Explore at:
Dataset updated
Sep 16, 2024
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Wikimedia Structured Wikipedia

Dataset Description Dataset Summary

Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.
wiki_qa
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft, wiki_qa [Dataset]. https://huggingface.co/datasets/microsoft/wiki_qa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "wiki_qa"

Dataset Summary

Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure Data Instances default

Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.
wikipedia-22-12-simple-embeddings
huggingface.co
opendatalab.com
Updated Mar 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
h
wikipedia-persons-masked
huggingface.co
Updated May 23, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL (2009). wikipedia-persons-masked [Dataset]. https://huggingface.co/datasets/rcds/wikipedia-persons-masked
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2009
Dataset authored and provided by
Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people

Dataset Summary

Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a

Supported Tasks and Leaderboards

The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is

Languages

english only

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
h
wikipedia-small-3000-embedded
huggingface.co
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2024
Authors
Hafedh Hichri
License
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
Description
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

load dataset in streaming mode (no download and it's fast)

dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

select 3000 samples

from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
h
clean-wikipedia
huggingface.co
Updated Oct 31, 2008
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2008). clean-wikipedia [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/clean-wikipedia
Explore at:
Dataset updated
Oct 31, 2008
Dataset authored and provided by
FineData
Description
HuggingFaceFW/clean-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
h
simple-wikipedia
huggingface.co
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Aralikatte (2023). simple-wikipedia [Dataset]. https://huggingface.co/datasets/rahular/simple-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2023
Authors
Rahul Aralikatte
Description
simple-wikipedia

Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.
h
wikipedia-summary-dataset
huggingface.co
Updated Sep 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Clive (2017). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2017
Authors
Jordan Clive
Description
Dataset Summary

This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.
h
Wiki-UQA
huggingface.co
Updated Jul 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UQA (2024). Wiki-UQA [Dataset]. https://huggingface.co/datasets/uqa/Wiki-UQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2024
Dataset authored and provided by
UQA
Description
uqa/Wiki-UQA dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikipedia-en-sentences
huggingface.co
Updated Jul 15, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2013). wikipedia-en-sentences [Dataset]. https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2013
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for Wikipedia Sentences (English)

This dataset contains 7.87 million English sentences and can be used in knowledge distillation of embedding models.

Dataset Details

Columns: "sentence" Column types: str Examples:{ 'sentence': "After the deal was approved and NONG's stock rose to $13, Farris purchased 10,000 shares at the $2.50 price, sold 2,500 shares at the new price to reimburse the company, and gave the remaining 7,500 shares to Landreville at no cost… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences.
wikisource
huggingface.co
Updated Feb 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2024). wikisource [Dataset]. https://huggingface.co/datasets/wikimedia/wikisource
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 18, 2024
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for Wikimedia Wikisource

Dataset Summary

Wikisource dataset containing cleaned articles of all languages. The dataset is built from the Wikisource dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikisource text with cleaning to strip markdown and unwanted sections (references, etc.). All language subsets have already been processed for recent dump, and you… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/wikisource.
h
wiki_auto
huggingface.co
opendatalab.com
+1more
Updated May 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chao Jiang (2024). wiki_auto [Dataset]. https://huggingface.co/datasets/chaojiang06/wiki_auto
Explore at:
Dataset updated
May 29, 2024
Authors
Chao Jiang
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
WikiAuto provides a set of aligned sentences from English Wikipedia and Simple English Wikipedia as a resource to train sentence simplification systems. The authors first crowd-sourced a set of manual alignments between sentences in a subset of the Simple English Wikipedia and their corresponding versions in English Wikipedia (this corresponds to the manual config), then trained a neural CRF system to predict these alignments. The trained model was then applied to the other articles in Simple English Wikipedia with an English counterpart to create a larger corpus of aligned sentences (corresponding to the auto, auto_acl, auto_full_no_split, and auto_full_with_split configs here).
h
wikipedia-2023-11-embed-multilingual-v3
huggingface.co
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-2023-11-embed-multilingual-v3 [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2023
Dataset authored and provided by
Cohere
Description
Multilingual Embeddings for Wikipedia in 300+ Languages

This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings. You… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.
h
wiki_movies
huggingface.co
Updated Dec 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2020). wiki_movies [Dataset]. https://huggingface.co/datasets/facebook/wiki_movies
Explore at:
Dataset updated
Dec 14, 2020
Dataset authored and provided by
AI at Meta
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
The WikiMovies dataset consists of roughly 100k (templated) questions over 75k entities based on questions with answers in the open movie database (OMDb).
h
wiki_dpr
huggingface.co
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2024). wiki_dpr [Dataset]. https://huggingface.co/datasets/facebook/wiki_dpr
Explore at:
Dataset updated
May 29, 2024
Dataset authored and provided by
AI at Meta
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.
h
goodwiki
huggingface.co
opendatalab.com
Updated Sep 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Euirim Choi (2023). goodwiki [Dataset]. https://huggingface.co/datasets/euirim/goodwiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 4, 2023
Authors
Euirim Choi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
GoodWiki Dataset

GoodWiki is a 179 million token dataset of English Wikipedia articles collected on September 4, 2023, that have been marked as Good or Featured by Wikipedia editors. The dataset provides these articles in GitHub-flavored Markdown format, preserving layout features like lists, code blocks, math, and block quotes, unlike many other public Wikipedia datasets. Articles are accompanied by a short description of the page as well as any associated categories. Thanks to a… See the full description on the dataset page: https://huggingface.co/datasets/euirim/goodwiki.

Facebook

Twitter

Click to copy link

Link copied

Cite

Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia

wikipedia

Wikipedia

olm/wikipedia

Explore at:

Dataset updated

Feb 21, 2023

Dataset authored and provided by

Online Language Modelling

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Clear search

Close search

Google apps

Main menu

wikipedia

rag-mini-wikipedia

simple-wiki

wikitext

structured-wikipedia

wiki_qa

wikipedia-22-12-simple-embeddings

wikipedia-persons-masked

wikipedia-small-3000-embedded

load dataset in streaming mode (no download and it's fast)

select 3000 samples

clean-wikipedia

simple-wikipedia

wikipedia-summary-dataset

Wiki-UQA

wikipedia-en-sentences

wikisource

wiki_auto

wikipedia-2023-11-embed-multilingual-v3

wiki_movies

wiki_dpr

goodwiki

wikipediaSee More Versions

Wikipedia

olm/wikipedia

wikipedia