13 datasets found

structured-wikipedia
huggingface.co
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia
Explore at:
Dataset updated
Sep 16, 2024
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Wikimedia Structured Wikipedia

Dataset Description Dataset Summary

Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.
h
Wikipedia-Articles
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data, Wikipedia-Articles [Dataset]. https://huggingface.co/datasets/BrightData/Wikipedia-Articles
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Bright Data
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "BrightData/Wikipedia-Articles"

Dataset Summary

Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.
Wikipedia corpus for synthetic data made for Handwritten Text Recognition...
zenodo.org
txt, zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas CONSTUM; Thomas CONSTUM (2025). Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition [Dataset]. http://doi.org/10.1007/s10032-024-00511-9
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.1007/s10032-024-00511-9
Dataset updated
Jul 3, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas CONSTUM; Thomas CONSTUM
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).

The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.

The contents of the archive should be placed in the Datasets/raw directory of the DANIEL codebase.

Contents of the archive:

wiki_en: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.

wiki_en_ner: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.

wiki_fr: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.

wiki_de.txt: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.

Data format for corpora in Hugging Face datasets structure:

Each record in the datasets follows the dictionary structure below:

{
"id": "
wiki40b
huggingface.co
opendatalab.com
+1more
Updated Jun 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2024). wiki40b [Dataset]. https://huggingface.co/datasets/google/wiki40b
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 3, 2024
Dataset authored and provided by
Googlehttp://google.com/
Description
Dataset Card for "wiki40b"

Dataset Summary

Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.… See the full description on the dataset page: https://huggingface.co/datasets/google/wiki40b.
Han Instruct Dataset
zenodo.org
csv
Updated Apr 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun (2024). Han Instruct Dataset [Dataset]. http://doi.org/10.5281/zenodo.10935822
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10935822
Dataset updated
Apr 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Summary

🪿 Han (ห่าน or goose) Instruct Dataset is a Thai instruction dataset by PyThaiNLP. It collect the instruction following in Thai from many source.

Many question are collect from Reference desk at Thai wikipedia.

Data sources:

Reference desk at Thai wikipedia.

Law from justicechannel.org

pythainlp/final_training_set_v1_enth: Human checked and edited.

Self-instruct from WangChanGLM

Wannaphong.com

Human annotators

Supported Tasks and Leaderboards

ChatBot

Instruction Following

Languages

Thai

Dataset Structure

Data Fields

inputs: Question

targets: Answer

Considerations for Using the Data

The dataset can be biased by human annotators. You should check the dataset to select or remove an instruction before training the model or using it at your risk.

Licensing Information

CC-BY-SA 4.0
h
DBPedia_Classes
huggingface.co
Updated Jun 8, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Willem (2016). DBPedia_Classes [Dataset]. https://huggingface.co/datasets/DeveloperOats/DBPedia_Classes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 8, 2016
Authors
Willem
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
About Dataset DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in Wikipedia. This is an extract of the data (after cleaning, kernel included) that provides taxonomic, hierarchical categories ("classes") for 342,782 wikipedia articles. There are 3 levels, with 9, 70 and 219 classes respectively. A version of this dataset is a popular baseline for NLP/text classification tasks. This version of the dataset is much tougher… See the full description on the dataset page: https://huggingface.co/datasets/DeveloperOats/DBPedia_Classes.
h
enwiki_structured_content
huggingface.co
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lung-Chuan Chen (2025). enwiki_structured_content [Dataset]. https://huggingface.co/datasets/Blaze7451/enwiki_structured_content
Explore at:
Dataset updated
Jul 15, 2025
Authors
Lung-Chuan Chen
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for enwiki_structured_content

Dataset Description

This dataset is derived from the early official Wikipedia release, downloaded from the en subset of Wikipedia Structured Contents.Articles were converted to Markdown.
h
RAG_docset_wiki
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aziz Dhouib (2025). RAG_docset_wiki [Dataset]. https://huggingface.co/datasets/azizdh00/RAG_docset_wiki
Explore at:
Dataset updated
Jun 1, 2025
Authors
Aziz Dhouib
Description
Dataset Description

This dataset contains 348,854 Wikipedia articles

Dataset Structure

The dataset follows a simple structure with two fields:

text: The content of the Wikipedia article source: The source identifier (e.g., "Wikipedia:Albedo")

Format

The dataset is provided in JSONL format, where each line contains a JSON object with the above fields. Example: { "text": "Albedo is the fraction of sunlight that is reflected by a surface...", "source":… See the full description on the dataset page: https://huggingface.co/datasets/azizdh00/RAG_docset_wiki.
h
yue-wiki-pl-bert
huggingface.co
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hon9kon9ize (2025). yue-wiki-pl-bert [Dataset]. https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert
Explore at:
Dataset updated
Apr 6, 2025
Dataset authored and provided by
hon9kon9ize
Description
Yue-Wiki-PL-BERT Dataset

Overview

This dataset contains processed text data from Cantonese Wikipedia articles, specifically formatted for training or fine-tuning BERT-like models for Cantonese language processing. The dataset is created by hon9kon9ize and contains approximately 176,177 rows of training data.

Description

The Yue-Wiki-PL-BERT dataset is a structured collection of Cantonese text data extracted from Wikipedia, with each entry containing:

id: A… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert.
h
megawika-2
huggingface.co
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for Language and Speech Processing @ JHU (2025). megawika-2 [Dataset]. https://huggingface.co/datasets/jhu-clsp/megawika-2
Explore at:
Dataset updated
Aug 7, 2025
Dataset authored and provided by
Center for Language and Speech Processing @ JHU
Description
MegaWika 2

MegaWika 2 is an improved multi- and cross-lingual text dataset containing a structured view of Wikipedia, eventually covering 50 languages, including cleanly extracted content from all cited web sources. The initial data release is based on Wikipedia dumps from May 1, 2024. In total, the data contains about 77 million articles and 71 million scraped web citations. The English collection, the largest, contains about 10 million articles and 24 million scraped web… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/megawika-2.
h
TiEBe
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timely Events Benchmark, TiEBe [Dataset]. https://huggingface.co/datasets/TimelyEventsBenchmark/TiEBe
Explore at:
Dataset authored and provided by
Timely Events Benchmark
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for TiEBe

Dataset Summary

TiEBe (Timely Events Benchmark) is a large-scale dataset designed to assess the factual recall and regional knowledge representation of large language models (LLMs) concerning significant global and regional events. It contains over 23,000 question–answer pairs covering more than 10 years (Jan 2015 - Apr 2025) of events, across 23 geographic regions and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to… See the full description on the dataset page: https://huggingface.co/datasets/TimelyEventsBenchmark/TiEBe.
h
finewiki
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LeMoussel, finewiki [Dataset]. https://huggingface.co/datasets/LeMoussel/finewiki
Explore at:
Authors
LeMoussel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
📚 FineWiki

Dataset Overview

FineWiki is a high-quality French-language dataset designed for pretraining and NLP tasks. It is derived from the French edition of Wikipedia using the Wikipedia Structured Contents dataset released by the Wikimedia Foundation on Kaggle. Each entry is a structured JSON line representing a full Wikipedia article, parsed and cleaned from HTML snapshots provided by Wikimedia Enterprise. The dataset has been carefully filtered and… See the full description on the dataset page: https://huggingface.co/datasets/LeMoussel/finewiki.
h
Tamil_Thaai_Vaazhthu
huggingface.co
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Selvakumar Duraipandian (2025). Tamil_Thaai_Vaazhthu [Dataset]. https://huggingface.co/datasets/Selvakumarduraipandian/Tamil_Thaai_Vaazhthu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2025
Authors
Selvakumar Duraipandian
Description
தமிழ்த்தாய் வாழ்த்து Dataset

Overview

This dataset contains the lyrics of Tamil Thaai Vaazhthu, the official state song of Tamil Nadu, as well as the Tamil Thaai Vaazhthu version used in Puducherry. The data has been sourced from Wikipedia and formatted into a structured dataset for research and educational purposes.

Contents

The dataset consists of two rows:

Tamil Nadu's Tamil Thaai Vaazhthu - The official state song of Tamil Nadu. Puducherry's Tamil Thaai… See the full description on the dataset page: https://huggingface.co/datasets/Selvakumarduraipandian/Tamil_Thaai_Vaazhthu.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia

structured-wikipedia

wikimedia/structured-wikipedia

Explore at:

222 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Sep 16, 2024

Dataset provided by

Wikimedia Foundationhttp://www.wikimedia.org/

Authors

Wikimedia

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset Card for Wikimedia Structured Wikipedia

  Dataset Description





  Dataset Summary

Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.

Clear search

Close search

Google apps

Main menu

structured-wikipedia

Wikipedia-Articles

Wikipedia corpus for synthetic data made for Handwritten Text Recognition...

wiki40b

Han Instruct Dataset

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Fields

Considerations for Using the Data

Licensing Information

DBPedia_Classes

enwiki_structured_content

RAG_docset_wiki

yue-wiki-pl-bert

megawika-2

TiEBe

finewiki

Tamil_Thaai_Vaazhthu

structured-wikipedia

wikimedia/structured-wikipedia