100+ datasets found
  1. h

    wikipedia

    • huggingface.co
    • tensorflow.org
    Updated Feb 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
    Explore at:
    Dataset updated
    Feb 21, 2023
    Dataset authored and provided by
    Online Language Modelling
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

  2. h

    wikipedia

    • huggingface.co
    Updated Apr 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hafedh Hichri (2024). wikipedia [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2024
    Authors
    Hafedh Hichri
    Description

    not-lain/wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    clean-wikipedia

    • huggingface.co
    Updated Oct 31, 2008
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    clean-wikipedia [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/clean-wikipedia
    Explore at:
    Dataset updated
    Oct 31, 2008
    Dataset authored and provided by
    HuggingFaceFW
    Description

    HuggingFaceFW/clean-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    kilt_wikipedia

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta, kilt_wikipedia [Dataset]. https://huggingface.co/datasets/facebook/kilt_wikipedia
    Explore at:
    Dataset authored and provided by
    AI at Meta
    Description

    KILT-Wikipedia: Wikipedia pre-processed for KILT.

  5. h

    wikitext2

    • huggingface.co
    • paperswithcode.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  6. h

    wikipedia-human-retrieval-ja

    • huggingface.co
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baobab, Inc. (2024). wikipedia-human-retrieval-ja [Dataset]. https://huggingface.co/datasets/baobab-trees/wikipedia-human-retrieval-ja
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2024
    Dataset authored and provided by
    Baobab, Inc.
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Japanese Wikipedia Human Retrieval dataset

    This is a Japanese question answereing dataset with retrieval on Wikipedia articles by trained human workers.

      Contributors
    

    Yusuke Oda defined the dataset specification, data structure, and the scheme of data collection. Baobab, Inc. operated data collection, data checking, and formatting.

      About the dataset
    

    Each entry represents a single QA session: given a question sentence, the responsible worker tried… See the full description on the dataset page: https://huggingface.co/datasets/baobab-trees/wikipedia-human-retrieval-ja.

  7. h

    wikipedia-data-en-2023-11

    • huggingface.co
    Updated Dec 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mixedbread (2023). wikipedia-data-en-2023-11 [Dataset]. https://huggingface.co/datasets/mixedbread-ai/wikipedia-data-en-2023-11
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2023
    Dataset authored and provided by
    Mixedbread
    Description

    mixedbread-ai/wikipedia-data-en-2023-11 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. wikipedia-22-12-de-embeddings

    • huggingface.co
    Updated Apr 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-22-12-de-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (de) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (de) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings.

  9. h

    simple-wiki

    • huggingface.co
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2025). simple-wiki [Dataset]. https://huggingface.co/datasets/sentence-transformers/simple-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2025
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for Simple Wiki

    This dataset is a collection of pairs of English Wikipedia entries and their simplified variants. See Simple-Wiki for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

      Dataset Subsets
    
    
    
    
    
    
    
      pair subset
    

    Columns: "text", "simplified" Column types: str, str Examples:{ 'text': "Charles Michael `` Chuck '' Palahniuk ( ; born February 21 , 1962 ) is an American… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/simple-wiki.

  10. h

    Cephalo-Wikipedia-Materials

    • huggingface.co
    Updated Jun 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LAMM: MIT Laboratory for Atomistic and Molecular Mechanics (2024). Cephalo-Wikipedia-Materials [Dataset]. https://huggingface.co/datasets/lamm-mit/Cephalo-Wikipedia-Materials
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2024
    Dataset authored and provided by
    LAMM: MIT Laboratory for Atomistic and Molecular Mechanics
    Description

    This dataset is used to train the Cephalo models. Cephalo is a series of multimodal materials science focused vision large language models (V-LLMs) designed to integrate visual and linguistic data for advanced understanding and interaction in human-AI or multi-agent AI frameworks. A novel aspect of Cephalo's development is the innovative dataset generation method. The extraction process employs advanced algorithms to accurately detect and separate images and their corresponding textual… See the full description on the dataset page: https://huggingface.co/datasets/lamm-mit/Cephalo-Wikipedia-Materials.

  11. h

    wikipedia-filtered

    • huggingface.co
    Updated Dec 8, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Camille Harris (2016). wikipedia-filtered [Dataset]. https://huggingface.co/datasets/charris/wikipedia-filtered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2016
    Authors
    Camille Harris
    Description

    charris/wikipedia-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    wikipedia-nq-corpus

    • huggingface.co
    Updated Dec 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tevatron (2021). wikipedia-nq-corpus [Dataset]. https://huggingface.co/datasets/Tevatron/wikipedia-nq-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 23, 2021
    Dataset authored and provided by
    Tevatron
    Description

    Tevatron/wikipedia-nq-corpus dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    wiki_toxic

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wiki_toxic [Dataset]. https://huggingface.co/datasets/OxAISH-AL-LLM/wiki_toxic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    OxAI Safety Hub Active Learning with Large Language Models Labs Team
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Jigsaw Toxic Comment Challenge dataset. This dataset was the basis of a Kaggle competition run by Jigsaw

  14. h

    svm-wikipedia-01

    • huggingface.co
    Updated Nov 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wangqian (2024). svm-wikipedia-01 [Dataset]. https://huggingface.co/datasets/ieeeeeH/svm-wikipedia-01
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 10, 2024
    Authors
    Wangqian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ieeeeeH/svm-wikipedia-01 dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    wikipedia-curated

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikipedia-curated [Dataset]. https://huggingface.co/datasets/Tevatron/wikipedia-curated
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Tevatron
    Description

    Tevatron/wikipedia-curated dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    truncated-american-wikipedia

    • huggingface.co
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gb (2025). truncated-american-wikipedia [Dataset]. https://huggingface.co/datasets/tcltcl/truncated-american-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2025
    Authors
    Gb
    Area covered
    United States
    Description

    tcltcl/truncated-american-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    wikipedia-english

    • huggingface.co
    Updated Mar 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oriental Lab (2025). wikipedia-english [Dataset]. https://huggingface.co/datasets/oriental-lab/wikipedia-english
    Explore at:
    Dataset updated
    Mar 18, 2025
    Dataset authored and provided by
    Oriental Lab
    Description

    oriental-lab/wikipedia-english dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    blink-pretrain-wikipedia

    • huggingface.co
    Updated Aug 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shoumik Gandre (2024). blink-pretrain-wikipedia [Dataset]. https://huggingface.co/datasets/shomez/blink-pretrain-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2024
    Authors
    Shoumik Gandre
    Description

    shomez/blink-pretrain-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. P

    WIT Dataset

    • paperswithcode.com
    • huggingface.co
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork (2023). WIT Dataset [Dataset]. https://paperswithcode.com/dataset/wit
    Explore at:
    Dataset updated
    Jun 14, 2023
    Authors
    Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork
    Description

    Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

    Key Advantages

    A few unique advantages of WIT:

    The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.

  20. h

    wikipedia-en-20241020

    • huggingface.co
    Updated Oct 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bigge (2024). wikipedia-en-20241020 [Dataset]. https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 20, 2024
    Authors
    Bigge
    Description

    burgerbee/wikipedia-en-20241020 dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia

wikipedia

Wikipedia

olm/wikipedia

Explore at:
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Search
Clear search
Close search
Google apps
Main menu