Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is an updated and better extracted version of the wikimedia/Wikipedia dataset originally released in 2023. We carefully parsed Wikipedia HTML dumps from August of 2025 covering 325 languages. This dataset:
fully renders templates as it was extracted from HTML and not markdown dumps removes redirects, disambiguation, and other non main article pages includes detailed metadata such as page ID, title, last modified date, wikidate ID, version and markdown version of the textpreserves… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finewiki.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
Facebook
TwitterDataset Summary
This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.
Facebook
Twitter台灣正體中文維基百科 (zh-tw Wikipedia)
截至 2023 年 5 月,中文維基百科 2,533,212 篇條目的台灣正體文字內容。每篇條目為一列 (row),包含 HTML 以及 Markdown 兩種格式。 A nearly-complete collection of 2,533,212 Traditional Chinese (zh-tw) Wikipedia pages, gathered between May 1, 2023, and May 7, 2023. Includes both the original HTML format and an auto-converted Markdown version, which has been processed using vinta/pangu.py. 於 2023 年 5 月 1 日至 5 月 7 日間取自維基百科 action=query & prop=extracts API,內容皆與維基百科網站之台灣正體版本一致,沒有繁簡體混雜的問題。 For development… See the full description on the dataset page: https://huggingface.co/datasets/zetavg/zh-tw-wikipedia.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "wiki_qa"
Dataset Summary
Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure
Data Instances
default
Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people
Dataset Summary
Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a
Supported Tasks and Leaderboards
The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is
Languages
english only
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
Facebook
Twittersimple-wikipedia
Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (en) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (en) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings.
Facebook
TwitterDataset Card for "wiki40b"
Dataset Summary
Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.… See the full description on the dataset page: https://huggingface.co/datasets/google/wiki40b.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
OpenAI, all-MiniLM-L6-v2, GTE-small embeddings for Wikipedia Simple English
Texts and OpenAI embeddings are genereted by Stephan Sturges, big thanks for sharing this dataset. Here we added embeddings for all-MiniLM-L6-v2 and GTE-small. Total 224,482 vectors for each model.
Notes
These are the embeddings and corresponded simplified articles from the wikipedia "simple english" dump. Please see wikipedia's licensing for usage information:… See the full description on the dataset page: https://huggingface.co/datasets/Supabase/wikipedia-en-embeddings.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
本数据集基于中文维基2023年7月20日的dump存档。作为一项以数据为中心的工作,本数据集仅保留了 254,547条 质量较高的词条内容。具体而言:
过滤了Template, Category, Wikipedia, File, Topic, Portal, MediaWiki, Draft, Help等特殊类型的词条 使用启发式的方法和自有的NLU模型过滤了一部分质量较低的词条 过滤了一部分内容较为敏感或存在争议性的词条。 进行了简繁转换和习惯用词转换,确保符合中国大陆地区的习惯用词。
This dataset is based on the Chinese Wikipedia dump archive from July 20th, 2023. As a data-centric effort, the dataset retains 254,574 high-quality entries. Specifically:
Entries of special types such as Template, Category, Wikipedia, File, Topic… See the full description on the dataset page: https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
quangduc/wikipedia-style dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Wikitext-fr language modeling dataset consists of over 70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles" or "good articles.". The aim is to replicate the English benchmark.
Facebook
Twitterhttps://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
Loading dataset without vector embeddings
You can load the raw dataset without vectors, like this: from datasets import load_dataset dataset = load_dataset("weaviate/wiki-sample", split="train", streaming=True)
Loading dataset with vector embeddings
You can also load the dataset with vectors, like this: from datasets import load_dataset dataset = load_dataset("weaviate/wiki-sample", "weaviate-snowflake-arctic-v2", split="train", streaming=True)
for item in dataset:… See the full description on the dataset page: https://huggingface.co/datasets/weaviate/wiki-sample.
Facebook
TwitterDataset Card for Simple Wiki
This dataset is a collection of pairs of English Wikipedia entries and their simplified variants. See Simple-Wiki for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "text", "simplified" Column types: str, str Examples:{ 'text': "Charles Michael `` Chuck '' Palahniuk ( ; born February 21 , 1962 ) is an American transgressional… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/simple-wiki.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for Wikimedia Wikisource
Dataset Summary
Wikisource dataset containing cleaned articles of all languages. The dataset is built from the Wikisource dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikisource text with cleaning to strip markdown and unwanted sections (references, etc.). All language subsets have already been processed for recent dump, and you… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/wikisource.
Facebook
TwitterDataset Card for Wikipedia Sentences (English)
This dataset contains 7.87 million English sentences and can be used in knowledge distillation of embedding models.
Dataset Details
Columns: "sentence" Column types: str Examples:{ 'sentence': "After the deal was approved and NONG's stock rose to $13, Farris purchased 10,000 shares at the $2.50 price, sold 2,500 shares at the new price to reimburse the company, and gave the remaining 7,500 shares to Landreville at no cost… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).