Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Wikimedia Structured Wikipedia
Dataset Description
Dataset Summary
Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "BrightData/Wikipedia-Articles"
Dataset Summary
Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).
The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.
The contents of the archive should be placed in the Datasets/raw
directory of the DANIEL codebase.
Contents of the archive:
wiki_en
: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.
wiki_en_ner
: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.
wiki_fr
: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.
wiki_de.txt
: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.
Data format for corpora in Hugging Face datasets structure:
Each record in the datasets follows the dictionary structure below:
{
"id": "
Dataset Card for "wiki40b"
Dataset Summary
Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.… See the full description on the dataset page: https://huggingface.co/datasets/google/wiki40b.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
🪿 Han (ห่าน or goose) Instruct Dataset is a Thai instruction dataset by PyThaiNLP. It collect the instruction following in Thai from many source.
Many question are collect from Reference desk at Thai wikipedia.
Data sources:
Thai
The dataset can be biased by human annotators. You should check the dataset to select or remove an instruction before training the model or using it at your risk.
CC-BY-SA 4.0
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
About Dataset DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in Wikipedia. This is an extract of the data (after cleaning, kernel included) that provides taxonomic, hierarchical categories ("classes") for 342,782 wikipedia articles. There are 3 levels, with 9, 70 and 219 classes respectively. A version of this dataset is a popular baseline for NLP/text classification tasks. This version of the dataset is much tougher… See the full description on the dataset page: https://huggingface.co/datasets/DeveloperOats/DBPedia_Classes.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for enwiki_structured_content
Dataset Description
This dataset is derived from the early official Wikipedia release, downloaded from the en subset of Wikipedia Structured Contents.Articles were converted to Markdown.
Dataset Description
This dataset contains 348,854 Wikipedia articles
Dataset Structure
The dataset follows a simple structure with two fields:
text: The content of the Wikipedia article source: The source identifier (e.g., "Wikipedia:Albedo")
Format
The dataset is provided in JSONL format, where each line contains a JSON object with the above fields. Example: { "text": "Albedo is the fraction of sunlight that is reflected by a surface...", "source":… See the full description on the dataset page: https://huggingface.co/datasets/azizdh00/RAG_docset_wiki.
Yue-Wiki-PL-BERT Dataset
Overview
This dataset contains processed text data from Cantonese Wikipedia articles, specifically formatted for training or fine-tuning BERT-like models for Cantonese language processing. The dataset is created by hon9kon9ize and contains approximately 176,177 rows of training data.
Description
The Yue-Wiki-PL-BERT dataset is a structured collection of Cantonese text data extracted from Wikipedia, with each entry containing:
id: A… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/yue-wiki-pl-bert.
MegaWika 2
MegaWika 2 is an improved multi- and cross-lingual text dataset containing a structured view of Wikipedia, eventually covering 50 languages, including cleanly extracted content from all cited web sources. The initial data release is based on Wikipedia dumps from May 1, 2024. In total, the data contains about 77 million articles and 71 million scraped web citations. The English collection, the largest, contains about 10 million articles and 24 million scraped web… See the full description on the dataset page: https://huggingface.co/datasets/jhu-clsp/megawika-2.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for TiEBe
Dataset Summary
TiEBe (Timely Events Benchmark) is a large-scale dataset designed to assess the factual recall and regional knowledge representation of large language models (LLMs) concerning significant global and regional events. It contains over 23,000 question–answer pairs covering more than 10 years (Jan 2015 - Apr 2025) of events, across 23 geographic regions and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to… See the full description on the dataset page: https://huggingface.co/datasets/TimelyEventsBenchmark/TiEBe.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📚 FineWiki
Dataset Overview
FineWiki is a high-quality French-language dataset designed for pretraining and NLP tasks. It is derived from the French edition of Wikipedia using the Wikipedia Structured Contents dataset released by the Wikimedia Foundation on Kaggle. Each entry is a structured JSON line representing a full Wikipedia article, parsed and cleaned from HTML snapshots provided by Wikimedia Enterprise. The dataset has been carefully filtered and… See the full description on the dataset page: https://huggingface.co/datasets/LeMoussel/finewiki.
தமிழ்த்தாய் வாழ்த்து Dataset
Overview
This dataset contains the lyrics of Tamil Thaai Vaazhthu, the official state song of Tamil Nadu, as well as the Tamil Thaai Vaazhthu version used in Puducherry. The data has been sourced from Wikipedia and formatted into a structured dataset for research and educational purposes.
Contents
The dataset consists of two rows:
Tamil Nadu's Tamil Thaai Vaazhthu - The official state song of Tamil Nadu. Puducherry's Tamil Thaai… See the full description on the dataset page: https://huggingface.co/datasets/Selvakumarduraipandian/Tamil_Thaai_Vaazhthu.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Wikimedia Structured Wikipedia
Dataset Description
Dataset Summary
Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.