100+ datasets found
  1. h

    wikipedia

    • huggingface.co
    • tensorflow.org
    Updated Feb 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
    Explore at:
    Dataset updated
    Feb 21, 2023
    Dataset authored and provided by
    Online Language Modelling
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

  2. h

    finewiki

    • huggingface.co
    Updated Oct 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). finewiki [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/finewiki
    Explore at:
    Dataset updated
    Oct 21, 2025
    Dataset authored and provided by
    FineData
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is an updated and better extracted version of the wikimedia/Wikipedia dataset originally released in 2023. We carefully parsed Wikipedia HTML dumps from August of 2025 covering 325 languages. This dataset:

    fully renders templates as it was extracted from HTML and not markdown dumps removes redirects, disambiguation, and other non main article pages includes detailed metadata such as page ID, title, last modified date, wikidate ID, version and markdown version of the textpreserves… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finewiki.

  3. h

    rag-mini-wikipedia

    • huggingface.co
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2025
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

  4. h

    wikipedia-summary-dataset

    • huggingface.co
    Updated Sep 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Clive (2017). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2017
    Authors
    Jordan Clive
    Description

    Dataset Summary

    This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.

  5. h

    zh-tw-wikipedia

    • huggingface.co
    Updated May 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pokai Chang (2023). zh-tw-wikipedia [Dataset]. https://huggingface.co/datasets/zetavg/zh-tw-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2023
    Authors
    Pokai Chang
    Description

    台灣正體中文維基百科 (zh-tw Wikipedia)

    截至 2023 年 5 月,中文維基百科 2,533,212 篇條目的台灣正體文字內容。每篇條目為一列 (row),包含 HTML 以及 Markdown 兩種格式。 A nearly-complete collection of 2,533,212 Traditional Chinese (zh-tw) Wikipedia pages, gathered between May 1, 2023, and May 7, 2023. Includes both the original HTML format and an auto-converted Markdown version, which has been processed using vinta/pangu.py. 於 2023 年 5 月 1 日至 5 月 7 日間取自維基百科 action=query & prop=extracts API,內容皆與維基百科網站之台灣正體版本一致,沒有繁簡體混雜的問題。 For development… See the full description on the dataset page: https://huggingface.co/datasets/zetavg/zh-tw-wikipedia.

  6. wiki_qa

    • huggingface.co
    • opendatalab.com
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2024). wiki_qa [Dataset]. https://huggingface.co/datasets/microsoft/wiki_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "wiki_qa"

      Dataset Summary
    

    Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure
    
    
    
    
    
      Data Instances
    
    
    
    
    
      default
    

    Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.

  7. h

    wikipedia-persons-masked

    • huggingface.co
    Updated May 23, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL (2009). wikipedia-persons-masked [Dataset]. https://huggingface.co/datasets/rcds/wikipedia-persons-masked
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2009
    Dataset authored and provided by
    Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people

      Dataset Summary
    

    Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a

      Supported Tasks and Leaderboards
    

    The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is

      Languages
    

    english only

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
    
  8. h

    simple-wikipedia

    • huggingface.co
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Aralikatte (2023). simple-wikipedia [Dataset]. https://huggingface.co/datasets/rahular/simple-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Authors
    Rahul Aralikatte
    Description

    simple-wikipedia

    Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.

  9. wikipedia-22-12-en-embeddings

    • huggingface.co
    Updated Oct 16, 2006
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2006). wikipedia-22-12-en-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2006
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (en) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (en) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings.

  10. wiki40b

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Jun 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2024). wiki40b [Dataset]. https://huggingface.co/datasets/google/wiki40b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    Googlehttp://google.com/
    Description

    Dataset Card for "wiki40b"

      Dataset Summary
    

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.… See the full description on the dataset page: https://huggingface.co/datasets/google/wiki40b.

  11. h

    wikitext2

    • huggingface.co
    • opendatalab.com
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Dataset updated
    Oct 21, 2023
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  12. h

    wikipedia-en-embeddings

    • huggingface.co
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Supabase (2023). wikipedia-en-embeddings [Dataset]. https://huggingface.co/datasets/Supabase/wikipedia-en-embeddings
    Explore at:
    Dataset updated
    Aug 3, 2023
    Dataset authored and provided by
    Supabase
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    OpenAI, all-MiniLM-L6-v2, GTE-small embeddings for Wikipedia Simple English

    Texts and OpenAI embeddings are genereted by Stephan Sturges, big thanks for sharing this dataset. Here we added embeddings for all-MiniLM-L6-v2 and GTE-small. Total 224,482 vectors for each model.

      Notes
    

    These are the embeddings and corresponded simplified articles from the wikipedia "simple english" dump. Please see wikipedia's licensing for usage information:… See the full description on the dataset page: https://huggingface.co/datasets/Supabase/wikipedia-en-embeddings.

  13. h

    wikipedia-cn-20230720-filtered

    • huggingface.co
    • opendatalab.com
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pleisto Inc (2023). wikipedia-cn-20230720-filtered [Dataset]. https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2023
    Dataset authored and provided by
    Pleisto Inc
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    本数据集基于中文维基2023年7月20日的dump存档。作为一项以数据为中心的工作,本数据集仅保留了 254,547条 质量较高的词条内容。具体而言:

    过滤了Template, Category, Wikipedia, File, Topic, Portal, MediaWiki, Draft, Help等特殊类型的词条 使用启发式的方法和自有的NLU模型过滤了一部分质量较低的词条 过滤了一部分内容较为敏感或存在争议性的词条。 进行了简繁转换和习惯用词转换,确保符合中国大陆地区的习惯用词。

    This dataset is based on the Chinese Wikipedia dump archive from July 20th, 2023. As a data-centric effort, the dataset retains 254,574 high-quality entries. Specifically:

    Entries of special types such as Template, Category, Wikipedia, File, Topic… See the full description on the dataset page: https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered.

  14. h

    wikipedia-style

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    quangduc, wikipedia-style [Dataset]. https://huggingface.co/datasets/quangduc/wikipedia-style
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    quangduc
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    quangduc/wikipedia-style dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    clean-wikipedia

    • huggingface.co
    Updated Oct 31, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2008). clean-wikipedia [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/clean-wikipedia
    Explore at:
    Dataset updated
    Oct 31, 2008
    Dataset authored and provided by
    FineData
    Description

    Please see 🌐 FineWiki instead

  16. h

    wikitext_fr

    • huggingface.co
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine SIMOULIN (2024). wikitext_fr [Dataset]. https://huggingface.co/datasets/asi/wikitext_fr
    Explore at:
    Dataset updated
    May 31, 2024
    Authors
    Antoine SIMOULIN
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Wikitext-fr language modeling dataset consists of over 70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles" or "good articles.". The aim is to replicate the English benchmark.

  17. h

    wiki-sample

    • huggingface.co
    Updated Dec 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weaviate (2022). wiki-sample [Dataset]. https://huggingface.co/datasets/weaviate/wiki-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2022
    Dataset authored and provided by
    Weaviate
    License

    https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/

    Description

    Loading dataset without vector embeddings

    You can load the raw dataset without vectors, like this: from datasets import load_dataset dataset = load_dataset("weaviate/wiki-sample", split="train", streaming=True)

      Loading dataset with vector embeddings
    

    You can also load the dataset with vectors, like this: from datasets import load_dataset dataset = load_dataset("weaviate/wiki-sample", "weaviate-snowflake-arctic-v2", split="train", streaming=True)

    for item in dataset:… See the full description on the dataset page: https://huggingface.co/datasets/weaviate/wiki-sample.

  18. h

    simple-wiki

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers, simple-wiki [Dataset]. https://huggingface.co/datasets/sentence-transformers/simple-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for Simple Wiki

    This dataset is a collection of pairs of English Wikipedia entries and their simplified variants. See Simple-Wiki for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

      Dataset Subsets
    
    
    
    
    
      pair subset
    

    Columns: "text", "simplified" Column types: str, str Examples:{ 'text': "Charles Michael `` Chuck '' Palahniuk ( ; born February 21 , 1962 ) is an American transgressional… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/simple-wiki.

  19. wikisource

    • huggingface.co
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). wikisource [Dataset]. https://huggingface.co/datasets/wikimedia/wikisource
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for Wikimedia Wikisource

      Dataset Summary
    

    Wikisource dataset containing cleaned articles of all languages. The dataset is built from the Wikisource dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikisource text with cleaning to strip markdown and unwanted sections (references, etc.). All language subsets have already been processed for recent dump, and you… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/wikisource.

  20. h

    wikipedia-en-sentences

    • huggingface.co
    Updated Jul 15, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2013). wikipedia-en-sentences [Dataset]. https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2013
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for Wikipedia Sentences (English)

    This dataset contains 7.87 million English sentences and can be used in knowledge distillation of embedding models.

      Dataset Details
    

    Columns: "sentence" Column types: str Examples:{ 'sentence': "After the deal was approved and NONG's stock rose to $13, Farris purchased 10,000 shares at the $2.50 price, sold 2,500 shares at the new price to reimburse the company, and gave the remaining 7,500 shares to Landreville at no cost… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia

wikipedia

Wikipedia

olm/wikipedia

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Search
Clear search
Close search
Google apps
Main menu