100+ datasets found
  1. h

    wikipedia

    • huggingface.co
    • tensorflow.org
    Updated Feb 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
    Explore at:
    Dataset updated
    Feb 21, 2023
    Dataset authored and provided by
    Online Language Modelling
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

  2. h

    wikitext2

    • huggingface.co
    • paperswithcode.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  3. h

    simple-wiki

    • huggingface.co
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2025). simple-wiki [Dataset]. https://huggingface.co/datasets/sentence-transformers/simple-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2025
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for Simple Wiki

    This dataset is a collection of pairs of English Wikipedia entries and their simplified variants. See Simple-Wiki for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

      Dataset Subsets
    
    
    
    
    
    
    
      pair subset
    

    Columns: "text", "simplified" Column types: str, str Examples:{ 'text': "Charles Michael `` Chuck '' Palahniuk ( ; born February 21 , 1962 ) is an American… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/simple-wiki.

  4. h

    VCR-wiki-en-hard-test

    • huggingface.co
    Updated Jul 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VCR (2024). VCR-wiki-en-hard-test [Dataset]. https://huggingface.co/datasets/vcr-org/VCR-wiki-en-hard-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2024
    Dataset authored and provided by
    VCR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The VCR-Wiki Dataset for Visual Caption Restoration (VCR)

    🏠 Paper | 👩🏻‍💻 GitHub | 🤗 Huggingface Datasets | 📏 Evaluation with lmms-eval This is the official Hugging Face dataset for VCR-Wiki, a dataset for the Visual Caption Restoration (VCR) task. VCR is designed to measure vision-language models' capability to accurately restore partially obscured texts using pixel-level hints within images. text-based processing becomes ineffective in VCR as accurate text restoration… See the full description on the dataset page: https://huggingface.co/datasets/vcr-org/VCR-wiki-en-hard-test.

  5. h

    factoid-wiki

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Chen, factoid-wiki [Dataset]. https://huggingface.co/datasets/chentong00/factoid-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tong Chen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    chentong00/factoid-wiki dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. O

    WikiConvert

    • opendatalab.com
    • paperswithcode.com
    • +1more
    zip
    Updated May 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Southern California (2023). WikiConvert [Dataset]. https://opendatalab.com/OpenDataLab/WikiConvert
    Explore at:
    zip(220780275 bytes)Available download formats
    Dataset updated
    May 2, 2023
    Dataset provided by
    University of Southern California
    Institute of Information Science
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Wiki-Convert is a 900,000+ sentences dataset of precise number annotations from English Wikipedia. It relies on Wiki contributors' annotations in the form of a {{Convert}} template.

  7. P

    WikiBio GPT-3 Hallucination Dataset Dataset

    • paperswithcode.com
    Updated Mar 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Potsawee Manakul; Adian Liusie; Mark J. F. Gales (2023). WikiBio GPT-3 Hallucination Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/wikibio-gpt-3-hallucination-dataset
    Explore at:
    Dataset updated
    Mar 14, 2023
    Authors
    Potsawee Manakul; Adian Liusie; Mark J. F. Gales
    Description

    The WikiBio GPT-3 Hallucination Dataset is a benchmark dataset used for hallucination detection. It is based on Wikipedia biographies (WikiBio) and is specifically designed to evaluate the factuality of text generated by large language models like GPT-3¹². Here are some key details about this dataset:

    Dataset Source: Wikipedia biographies (WikiBio) Task: Text classification Language: English Size Categories: Less than 1,000 samples License: Creative Commons Attribution-ShareAlike 3.0 (cc-by-sa-3.0)

    (1) potsawee/wiki_bio_gpt3_hallucination · Datasets at Hugging Face. https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination. (2) [2303.08896] SelfCheckGPT: Zero-Resource Black-Box Hallucination .... https://arxiv.org/abs/2303.08896. (3) AIトラストと、対話型生成AIにおける富士通のAIトラスト技術 : 富士通. https://www.fujitsu.com/jp/about/research/article/202312-ai-trust-technologies.html. (4) README.md · potsawee/wiki_bio_gpt3_hallucination at main - Hugging Face. https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination/blob/main/README.md. (5) undefined. https://github.com/potsawee/selfcheckgpt.

  8. Wiki-40B

    • opendatalab.com
    • tensorflow.org
    • +1more
    zip
    Updated Jan 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2024). Wiki-40B [Dataset]. https://opendatalab.com/OpenDataLab/Wiki-40B
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 2, 2024
    Dataset provided by
    谷歌http://google.com/
    Description

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.

  9. P

    WIT Dataset

    • paperswithcode.com
    • huggingface.co
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork (2023). WIT Dataset [Dataset]. https://paperswithcode.com/dataset/wit
    Explore at:
    Dataset updated
    Jun 14, 2023
    Authors
    Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork
    Description

    Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

    Key Advantages

    A few unique advantages of WIT:

    The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.

  10. h

    wiki-ss-corpus

    • huggingface.co
    Updated Mar 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wiki-ss-corpus [Dataset]. https://huggingface.co/datasets/Tevatron/wiki-ss-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2025
    Dataset authored and provided by
    Tevatron
    Description

    Tevatron/wiki-ss-corpus dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    wikipedia-human-retrieval-ja

    • huggingface.co
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baobab, Inc. (2024). wikipedia-human-retrieval-ja [Dataset]. https://huggingface.co/datasets/baobab-trees/wikipedia-human-retrieval-ja
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2024
    Dataset authored and provided by
    Baobab, Inc.
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Japanese Wikipedia Human Retrieval dataset

    This is a Japanese question answereing dataset with retrieval on Wikipedia articles by trained human workers.

      Contributors
    

    Yusuke Oda defined the dataset specification, data structure, and the scheme of data collection. Baobab, Inc. operated data collection, data checking, and formatting.

      About the dataset
    

    Each entry represents a single QA session: given a question sentence, the responsible worker tried… See the full description on the dataset page: https://huggingface.co/datasets/baobab-trees/wikipedia-human-retrieval-ja.

  12. ESM-2 embeddings for TCR-Epitope Binding Affinity Prediction Task

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jun 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tony Reina; Tony Reina (2024). ESM-2 embeddings for TCR-Epitope Binding Affinity Prediction Task [Dataset]. http://doi.org/10.5281/zenodo.11894560
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tony Reina; Tony Reina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.

    A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.

    [HuggingFace](https://huggingface.co/facebook/esm2_t36_3B_UR50D) provides a "one-stop shop" to train and deploy AI models. In this case, we use Facebook's open-source [Evolutionary Scale Model (ESM-2)](https://github.com/facebookresearch/esm). These embeddings turn the protein sequences into a vector of numbers that the computer can use in a mathematical model.
    To load them into Python use the Pandas library:
    import pandas as pd
    
    train_data = pd.read_pickle("train_data.pkl")
    validation_data = pd.read_pickle("validation_data.pkl")
    test_data = pd.read_pickle("test_data.pkl")

    The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.

    The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.

    The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.

    The label column is whether the two proteins bind. 0 = No. 1 = Yes.

    The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.

    From the TDC website:

    T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.

    Weber et al.

    Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.

    Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.

    Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.

    References:

    1. Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244.
    2. Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062.
    3. Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020).

    Dataset License: CC BY 4.0.

    Contributed by: Anna Weber and Jannis Born.

    The Facebook ESM-2 model has the MIT license and was published in:
    * Zeming Lin et al, Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023). DOI: 10.1126/science.ade2574 https://www.science.org/doi/10.1126/science.ade2574
    HuggingFace has several versions of the trained model.
    Checkpoint nameNumber of layersNumber of parameters
    esm2_t48_15B_UR50D4815B
    esm2_t36_3B_UR50D363B
    esm2_t33_650M_UR50D33650M
    esm2_t30_150M_UR50D30150M
    esm2_t12_35M_UR50D1235M
    esm2_t6_8M_UR50D68M
  13. h

    wiki_toxic

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wiki_toxic [Dataset]. https://huggingface.co/datasets/OxAISH-AL-LLM/wiki_toxic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    OxAI Safety Hub Active Learning with Large Language Models Labs Team
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Jigsaw Toxic Comment Challenge dataset. This dataset was the basis of a Kaggle competition run by Jigsaw

  14. h

    kilt_wikipedia

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta, kilt_wikipedia [Dataset]. https://huggingface.co/datasets/facebook/kilt_wikipedia
    Explore at:
    Dataset authored and provided by
    AI at Meta
    Description

    KILT-Wikipedia: Wikipedia pre-processed for KILT.

  15. h

    mlk-wiki

    • huggingface.co
    Updated Jun 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Pelerin (2024). mlk-wiki [Dataset]. https://huggingface.co/datasets/flpelerin/mlk-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2024
    Authors
    Florian Pelerin
    Description

    flpelerin/mlk-wiki dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. P

    FEVER Dataset

    • paperswithcode.com
    Updated Mar 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Thorne; Andreas Vlachos; Christos Christodoulopoulos; Arpit Mittal (2018). FEVER Dataset [Dataset]. https://paperswithcode.com/dataset/fever
    Explore at:
    Dataset updated
    Mar 16, 2018
    Authors
    James Thorne; Andreas Vlachos; Christos Christodoulopoulos; Arpit Mittal
    Description

    FEVER is a publicly available dataset for fact extraction and verification against textual sources.

    It consists of 185,445 claims manually verified against the introductory sections of Wikipedia pages and classified as SUPPORTED, REFUTED or NOTENOUGHINFO. For the first two classes, systems and annotators need to also return the combination of sentences forming the necessary evidence supporting or refuting the claim.

    The claims were generated by human annotators extracting claims from Wikipedia and mutating them in a variety of ways, some of which were meaning-altering. The verification of each claim was conducted in a separate annotation process by annotators who were aware of the page but not the sentence from which original claim was extracted and thus in 31.75% of the claims more than one sentence was considered appropriate evidence. Claims require composition of evidence from multiple sentences in 16.82% of cases. Furthermore, in 12.15% of the claims, this evidence was taken from multiple pages.

  17. wikipedia-22-12-de-embeddings

    • huggingface.co
    Updated Apr 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-22-12-de-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (de) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (de) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings.

  18. P

    SQuAD Dataset

    • paperswithcode.com
    Updated May 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav Rajpurkar; Jian Zhang; Konstantin Lopyrev; Percy Liang (2021). SQuAD Dataset [Dataset]. https://paperswithcode.com/dataset/squad
    Explore at:
    Dataset updated
    May 16, 2021
    Authors
    Pranav Rajpurkar; Jian Zhang; Konstantin Lopyrev; Percy Liang
    Description

    The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.

  19. h

    wiktionary

    • huggingface.co
    Updated Jan 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenLLM France (2025). wiktionary [Dataset]. https://huggingface.co/datasets/OpenLLM-France/wiktionary
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2025
    Dataset authored and provided by
    OpenLLM France
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Plain text of French Wiktionary

    Dataset Description Size Example use (python) Data fields Notes on data formatting

    License Aknowledgements Citation

      Dataset Description
    

    This dataset is a plain text version of pages from wiktionary.org in French language. The text is without HTML tags nor wiki templates. It just includes markdown syntax for headers, lists and tables. See Notes on data formatting for more details. It was created by LINAGORA and OpenLLM France… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/wiktionary.

  20. h

    wiki-kaz-v1

    • huggingface.co
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aspandiyar Nurimanov (2024). wiki-kaz-v1 [Dataset]. https://huggingface.co/datasets/asspunchman/wiki-kaz-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 22, 2024
    Authors
    Aspandiyar Nurimanov
    Description

    asspunchman/wiki-kaz-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia

wikipedia

Wikipedia

olm/wikipedia

Explore at:
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Search
Clear search
Close search
Google apps
Main menu