13 datasets found
  1. structured-wikipedia

    • huggingface.co
    Updated Sep 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia
    Explore at:
    Dataset updated
    Sep 16, 2024
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Wikimedia Structured Wikipedia

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.

  2. h

    Wikipedia-Articles

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data, Wikipedia-Articles [Dataset]. https://huggingface.co/datasets/BrightData/Wikipedia-Articles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Bright Data
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "BrightData/Wikipedia-Articles"

      Dataset Summary
    

    Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.

  3. Wikipedia corpus for synthetic data made for Handwritten Text Recognition...

    • zenodo.org
    txt, zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas CONSTUM; Thomas CONSTUM (2025). Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition [Dataset]. http://doi.org/10.1007/s10032-024-00511-9
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas CONSTUM; Thomas CONSTUM
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).

    The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.

    The contents of the archive should be placed in the Datasets/raw directory of the DANIEL codebase.

    Contents of the archive:

    • wiki_en: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.

    • wiki_en_ner: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.

    • wiki_fr: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.

    • wiki_de.txt: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.

    Data format for corpora in Hugging Face datasets structure:

    Each record in the datasets follows the dictionary structure below:

    {
    "id": "

  4. wiki40b

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Jun 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2024). wiki40b [Dataset]. https://huggingface.co/datasets/google/wiki40b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    Googlehttp://google.com/
    Description

    Dataset Card for "wiki40b"

      Dataset Summary
    

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.… See the full description on the dataset page: https://huggingface.co/datasets/google/wiki40b.

  5. Han Instruct Dataset

    • zenodo.org
    csv
    Updated Apr 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun (2024). Han Instruct Dataset [Dataset]. http://doi.org/10.5281/zenodo.10935822
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Summary

    🪿 Han (ห่าน or goose) Instruct Dataset is a Thai instruction dataset by PyThaiNLP. It collect the instruction following in Thai from many source.

    Many question are collect from Reference desk at Thai wikipedia.

    Data sources:

    Supported Tasks and Leaderboards

    • ChatBot
    • Instruction Following

    Languages

    Thai

    Dataset Structure

    Data Fields

    • inputs: Question
    • targets: Answer

    Considerations for Using the Data

    The dataset can be biased by human annotators. You should check the dataset to select or remove an instruction before training the model or using it at your risk.

    Licensing Information

    CC-BY-SA 4.0

  6. h

    DBPedia_Classes

    • huggingface.co
    Updated Jun 8, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Willem (2016). DBPedia_Classes [Dataset]. https://huggingface.co/datasets/DeveloperOats/DBPedia_Classes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 8, 2016
    Authors
    Willem
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    About Dataset DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in Wikipedia. This is an extract of the data (after cleaning, kernel included) that provides taxonomic, hierarchical categories ("classes") for 342,782 wikipedia articles. There are 3 levels, with 9, 70 and 219 classes respectively. A version of this dataset is a popular baseline for NLP/text classification tasks. This version of the dataset is much tougher… See the full description on the dataset page: https://huggingface.co/datasets/DeveloperOats/DBPedia_Classes.

  7. h

    enwiki_structured_content

    • huggingface.co
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lung-Chuan Chen (2025). enwiki_structured_content [Dataset]. https://huggingface.co/datasets/Blaze7451/enwiki_structured_content
    Explore at:
    Dataset updated
    Jul 15, 2025
    Authors
    Lung-Chuan Chen
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for enwiki_structured_content

      Dataset Description
    

    This dataset is derived from the early official Wikipedia release, downloaded from the en subset of Wikipedia Structured Contents.Articles were converted to Markdown.

  8. h

    RAG_docset_wiki

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aziz Dhouib (2025). RAG_docset_wiki [Dataset]. https://huggingface.co/datasets/azizdh00/RAG_docset_wiki
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Aziz Dhouib
    Description

    Dataset Description

    This dataset contains 348,854 Wikipedia articles

      Dataset Structure
    

    The dataset follows a simple structure with two fields:

    text: The content of the Wikipedia article source: The source identifier (e.g., "Wikipedia:Albedo")

      Format
    

    The dataset is provided in JSONL format, where each line contains a JSON object with the above fields. Example: { "text": "Albedo is the fraction of sunlight that is reflected by a surface...", "source":… See the full description on the dataset page: https://huggingface.co/datasets/azizdh00/RAG_docset_wiki.

  9. h

    yue-wiki-pl-bert

    • huggingface.co
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hon9kon9ize (2025). yue-wiki-pl-bert [Dataset]. https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert
    Explore at:
    Dataset updated
    Apr 6, 2025
    Dataset authored and provided by
    hon9kon9ize
    Description

    Yue-Wiki-PL-BERT Dataset

      Overview
    

    This dataset contains processed text data from Cantonese Wikipedia articles, specifically formatted for training or fine-tuning BERT-like models for Cantonese language processing. The dataset is created by hon9kon9ize and contains approximately 176,177 rows of training data.

      Description
    

    The Yue-Wiki-PL-BERT dataset is a structured collection of Cantonese text data extracted from Wikipedia, with each entry containing:

    id: A… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert.

  10. h

    megawika-2

    • huggingface.co
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center for Language and Speech Processing @ JHU (2025). megawika-2 [Dataset]. https://huggingface.co/datasets/jhu-clsp/megawika-2
    Explore at:
    Dataset updated
    Aug 7, 2025
    Dataset authored and provided by
    Center for Language and Speech Processing @ JHU
    Description

    MegaWika 2

    MegaWika 2 is an improved multi- and cross-lingual text dataset containing a structured view of Wikipedia, eventually covering 50 languages, including cleanly extracted content from all cited web sources. The initial data release is based on Wikipedia dumps from May 1, 2024. In total, the data contains about 77 million articles and 71 million scraped web citations. The English collection, the largest, contains about 10 million articles and 24 million scraped web… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/megawika-2.

  11. h

    TiEBe

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timely Events Benchmark, TiEBe [Dataset]. https://huggingface.co/datasets/TimelyEventsBenchmark/TiEBe
    Explore at:
    Dataset authored and provided by
    Timely Events Benchmark
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for TiEBe

      Dataset Summary
    

    TiEBe (Timely Events Benchmark) is a large-scale dataset designed to assess the factual recall and regional knowledge representation of large language models (LLMs) concerning significant global and regional events. It contains over 23,000 question–answer pairs covering more than 10 years (Jan 2015 - Apr 2025) of events, across 23 geographic regions and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to… See the full description on the dataset page: https://huggingface.co/datasets/TimelyEventsBenchmark/TiEBe.

  12. h

    finewiki

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LeMoussel, finewiki [Dataset]. https://huggingface.co/datasets/LeMoussel/finewiki
    Explore at:
    Authors
    LeMoussel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📚 FineWiki

      Dataset Overview
    

    FineWiki is a high-quality French-language dataset designed for pretraining and NLP tasks. It is derived from the French edition of Wikipedia using the Wikipedia Structured Contents dataset released by the Wikimedia Foundation on Kaggle. Each entry is a structured JSON line representing a full Wikipedia article, parsed and cleaned from HTML snapshots provided by Wikimedia Enterprise. The dataset has been carefully filtered and… See the full description on the dataset page: https://huggingface.co/datasets/LeMoussel/finewiki.

  13. h

    Tamil_Thaai_Vaazhthu

    • huggingface.co
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Selvakumar Duraipandian (2025). Tamil_Thaai_Vaazhthu [Dataset]. https://huggingface.co/datasets/Selvakumarduraipandian/Tamil_Thaai_Vaazhthu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2025
    Authors
    Selvakumar Duraipandian
    Description

    தமிழ்த்தாய் வாழ்த்து Dataset

      Overview
    

    This dataset contains the lyrics of Tamil Thaai Vaazhthu, the official state song of Tamil Nadu, as well as the Tamil Thaai Vaazhthu version used in Puducherry. The data has been sourced from Wikipedia and formatted into a structured dataset for research and educational purposes.

      Contents
    

    The dataset consists of two rows:

    Tamil Nadu's Tamil Thaai Vaazhthu - The official state song of Tamil Nadu. Puducherry's Tamil Thaai… See the full description on the dataset page: https://huggingface.co/datasets/Selvakumarduraipandian/Tamil_Thaai_Vaazhthu.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia
Organization logo

structured-wikipedia

wikimedia/structured-wikipedia

Explore at:
222 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 16, 2024
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset Card for Wikimedia Structured Wikipedia

  Dataset Description





  Dataset Summary

Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.

Search
Clear search
Close search
Google apps
Main menu