100+ datasets found
  1. h

    wikipedia

    • huggingface.co
    • tensorflow.org
    Updated Feb 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
    Explore at:
    Dataset updated
    Feb 21, 2023
    Dataset authored and provided by
    Online Language Modelling
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

  2. h

    rag-mini-wikipedia

    • huggingface.co
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2025
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

  3. h

    wikipedia_vi

    • huggingface.co
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VietGPT (2023). wikipedia_vi [Dataset]. https://huggingface.co/datasets/vietgpt/wikipedia_vi
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2023
    Dataset authored and provided by
    VietGPT
    Description

    Wikipedia

    Source: https://huggingface.co/datasets/wikipedia Num examples: 1,281,412 Language: Vietnamese

    from datasets import load_dataset

    load_dataset("tdtunlp/wikipedia_vi")

  4. h

    simple-wiki

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "simple-wiki"

      Dataset Summary
    

    This dataset contains pairs of equivalent sentences obtained from Wikipedia.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.

  5. wikipedia-22-12-simple-embeddings

    • huggingface.co
    • opendatalab.com
    Updated Mar 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.

  6. h

    wikipedia-persons-masked

    • huggingface.co
    Updated May 23, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL (2009). wikipedia-persons-masked [Dataset]. https://huggingface.co/datasets/rcds/wikipedia-persons-masked
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2009
    Dataset authored and provided by
    Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people

      Dataset Summary
    

    Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a

      Supported Tasks and Leaderboards
    

    The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is

      Languages
    

    english only

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
    
  7. h

    wikipedia-small-3000-embedded

    • huggingface.co
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2024
    Authors
    Hafedh Hichri
    License

    https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/

    Description

    this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

    load dataset in streaming mode (no download and it's fast)

    dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

    select 3000 samples

    from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.

  8. h

    wikipedia-summary-dataset

    • huggingface.co
    Updated Sep 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Clive (2017). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2017
    Authors
    Jordan Clive
    Description

    Dataset Summary

    This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.

  9. wikipedia-2023-11-embed-multilingual-v3

    • huggingface.co
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-2023-11-embed-multilingual-v3 [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    Description

    Multilingual Embeddings for Wikipedia in 300+ Languages

    This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings. You… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.

  10. h

    small-simple-wikipedia

    • huggingface.co
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gb (2025). small-simple-wikipedia [Dataset]. https://huggingface.co/datasets/tcltcl/small-simple-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2025
    Authors
    Gb
    Description

    tcltcl/small-simple-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    clean-wikipedia

    • huggingface.co
    Updated Sep 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). clean-wikipedia [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/clean-wikipedia
    Explore at:
    Dataset updated
    Sep 18, 2025
    Dataset authored and provided by
    FineData
    Description

    HuggingFaceFW/clean-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    zh-tw-wikipedia

    • huggingface.co
    Updated May 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pokai Chang (2023). zh-tw-wikipedia [Dataset]. https://huggingface.co/datasets/zetavg/zh-tw-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2023
    Authors
    Pokai Chang
    Description

    台灣正體中文維基百科 (zh-tw Wikipedia)

    截至 2023 年 5 月,中文維基百科 2,533,212 篇條目的台灣正體文字內容。每篇條目為一列 (row),包含 HTML 以及 Markdown 兩種格式。 A nearly-complete collection of 2,533,212 Traditional Chinese (zh-tw) Wikipedia pages, gathered between May 1, 2023, and May 7, 2023. Includes both the original HTML format and an auto-converted Markdown version, which has been processed using vinta/pangu.py. 於 2023 年 5 月 1 日至 5 月 7 日間取自維基百科 action=query & prop=extracts API,內容皆與維基百科網站之台灣正體版本一致,沒有繁簡體混雜的問題。 For development… See the full description on the dataset page: https://huggingface.co/datasets/zetavg/zh-tw-wikipedia.

  13. h

    wikipedia-en-sentences

    • huggingface.co
    Updated Jul 15, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2013). wikipedia-en-sentences [Dataset]. https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2013
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for Wikipedia Sentences (English)

    This dataset contains 7.87 million English sentences and can be used in knowledge distillation of embedding models.

      Dataset Details
    

    Columns: "sentence" Column types: str Examples:{ 'sentence': "After the deal was approved and NONG's stock rose to $13, Farris purchased 10,000 shares at the $2.50 price, sold 2,500 shares at the new price to reimburse the company, and gave the remaining 7,500 shares to Landreville at no cost… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences.

  14. h

    wikipedia

    • huggingface.co
    Updated Aug 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UpPrize Tech (2025). wikipedia [Dataset]. https://huggingface.co/datasets/upprize/wikipedia
    Explore at:
    Dataset updated
    Aug 31, 2025
    Authors
    UpPrize Tech
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    upprize/wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    roots_indic-te_wikipedia

    • huggingface.co
    Updated Aug 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Data (2023). roots_indic-te_wikipedia [Dataset]. https://huggingface.co/datasets/bigscience-data/roots_indic-te_wikipedia
    Explore at:
    Dataset updated
    Aug 9, 2023
    Dataset authored and provided by
    BigScience Data
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    ROOTS Subset: roots_indic-te_wikipedia

      wikipedia
    

    Dataset uid: wikipedia

      Description
    
    
    
    
    
      Homepage
    
    
    
    
    
      Licensing
    
    
    
    
    
      Speaker Locations
    
    
    
    
    
      Sizes
    

    3.2299 % of total 4.2071 % of en 5.6773 % of ar 3.3416 % of fr 5.2815 % of es 12.4852 % of ca 0.4288 % of zh 0.4286 % of zh 5.4743 % of indic-bn 8.9062 % of indic-ta 21.3313 % of indic-te 4.4845 % of pt 4.0493 % of indic-hi 11.3163 % of indic-ml 22.5300 % of indic-ur 4.4902 % of vi 16.9916 % of… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_indic-te_wikipedia.

  16. h

    ner-wikipedia-dataset

    • huggingface.co
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    大規模言語モデル入門 (2023). ner-wikipedia-dataset [Dataset]. https://huggingface.co/datasets/llm-book/ner-wikipedia-dataset
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset authored and provided by
    大規模言語モデル入門
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for llm-book/ner-wikipedia-dataset

    書籍『大規模言語モデル入門』で使用する、ストックマーク株式会社により作成された「Wikipediaを用いた日本語の固有表現抽出データセット」(Version 2.0)です。 Githubリポジトリstockmarkteam/ner-wikipedia-datasetで公開されているデータセットを利用しています。

      Citation
    

    @inproceedings{omi-2021-wikipedia, title = "Wikipediaを用いた日本語の固有表現抽出のデータセットの構築", author = "近江 崇宏", booktitle = "言語処理学会第27回年次大会", year = "2021", url = "https://anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P2-7.pdf", }

      Licence… See the full description on the dataset page: https://huggingface.co/datasets/llm-book/ner-wikipedia-dataset.
    
  17. h

    wiki_dpr

    • huggingface.co
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2024). wiki_dpr [Dataset]. https://huggingface.co/datasets/facebook/wiki_dpr
    Explore at:
    Dataset updated
    May 29, 2024
    Dataset authored and provided by
    AI at Meta
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.

  18. h

    wizard_of_wikipedia

    • huggingface.co
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chujie Zheng (2023). wizard_of_wikipedia [Dataset]. https://huggingface.co/datasets/chujiezheng/wizard_of_wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2023
    Authors
    Chujie Zheng
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Wizard-of-Wikipedia data for the Findings of EMNLP 2020 paper "Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation" GitHub repo. Original paper. @inproceedings{zheng-etal-2020-diffks, title="{D}ifference-aware Knowledge Selection for Knowledge-grounded Conversation Generation", author="Zheng, Chujie and Cao, Yunbo and Jiang, Daxin and Huang, Minlie", booktitle="Findings of EMNLP", year="2020" }

  19. h

    wikipedia

    • huggingface.co
    Updated Feb 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MedRAG (2024). wikipedia [Dataset]. https://huggingface.co/datasets/MedRAG/wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2024
    Authors
    MedRAG
    Description

    The Wikipedia Corpus in MedRAG

    This HF dataset contains the chunked snippets from the Wikipedia corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).

      News
    

    (02/26/2024) The "id" column has been reformatted. A new "wiki_id" column is added.

      Dataset Details
    
    
    
    
    
      Dataset Descriptions
    

    As a large-scale open-source encyclopedia, Wikipedia is frequently used as a corpus in information retrieval tasks. We select Wikipedia as one… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/wikipedia.

  20. h

    kilt_wikipedia

    • huggingface.co
    • opendatalab.com
    Updated Aug 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2019). kilt_wikipedia [Dataset]. https://huggingface.co/datasets/facebook/kilt_wikipedia
    Explore at:
    Dataset updated
    Aug 1, 2019
    Dataset authored and provided by
    AI at Meta
    Description

    KILT-Wikipedia: Wikipedia pre-processed for KILT.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia

wikipedia

Wikipedia

olm/wikipedia

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Search
Clear search
Close search
Google apps
Main menu