Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.
This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).
Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.
The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.
The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.
Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/
Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.
The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).
We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.
Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.
The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.
Infoboxes - Compressed: 2GB - Uncompressed: 11GB
Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB
Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921
This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.
This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs
The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).
Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset 1: Wikipedia Article Metadata and Content Distribution (2019–2023)
This dataset represents metadata and structural information extracted from Wikipedia articles across multiple language editions between January 2019 and December 2023. The data was collected through the Wikimedia REST API and Wikidata Query Service, focusing on high-level article characteristics such as content length, number of references, topic classification, and readership activity. Each row corresponds to a unique Wikipedia article identified by an article_id and includes metadata describing its topic category (e.g., Politics, Science, Culture), geographic focus, and quality assessment.
The dataset was designed to help quantify content inequality and topic bias across languages. For example, English and German editions tend to have more extensive coverage of scientific and technological topics, while Swahili and Arabic editions show higher representation of local cultural and geographical content but fewer high-quality (“Featured Article”) designations. Article-level metrics like word_count, references_count, and page_views were gathered to provide indicators of article depth, credibility, and public engagement. The last_edit_date variable helps capture how frequently articles are updated, indicating editorial activity over time.
Temporal coverage: 2019–2023 Data sources: Wikimedia REST API, Wikidata Query Service, Pageview Analytics Primary purpose: To analyze disparities in article depth, topic diversity, and regional focus across Wikipedia’s major language editions.
Dataset 2: Wikipedia Editor Demographics and Contribution Data (2018–2023)
This dataset summarizes demographic and contribution patterns of active Wikipedia editors from 2018 to 2023, based on public edit histories available through the Wikimedia Dumps and MediaWiki API. Each record corresponds to a unique editor identified by editor_id, containing attributes such as country, primary language of editing, total edit counts, and dominant topic area.
Although Wikipedia does not directly record personal information, country and language data were inferred using IP-based geolocation for anonymous edits and user-declared data for registered contributors. The dataset was sampled to capture editors across seven major languages (English, French, Spanish, German, Swahili, Arabic, and Chinese). Demographic variables like gender and education_level are approximations derived from community surveys conducted by the Wikimedia Foundation in 2019 and 2021, used here to represent broad participation trends rather than individual identities.
This dataset provides insight into editorial imbalance, highlighting, for example, that editors from Europe and North America contribute disproportionately more to technical and scientific topics compared to those from Africa or South America. Fields such as total_edits, articles_edited, and avg_edit_size reflect productivity and depth of engagement, while active_since helps trace editor retention and historical participation.
Temporal coverage: 2018–2023 Data sources: Wikimedia Dumps, MediaWiki API, Wikimedia Community Surveys (2019, 2021) Primary purpose: To analyze demographic participation gaps and editing activity distribution across languages and regions.
Dataset 3: Wikipedia Language and Geographic Coverage Statistics (2023)
This dataset presents aggregated statistics at the language edition level, representing Wikipedia’s overall content and contributor structure as of December 2023. The data was compiled from the Wikimedia Statistics Portal and Meta-Wiki language reports, which provide high-level metrics such as total number of articles, average article length, number of active editors, and editing intensity per language.
Each entry represents one Wikipedia language edition, capturing its global footprint and coverage balance. The column coverage_score is a composite index derived from article volume, diversity of covered topics, and proportional representation of countries and regions. underrepresented_regions indicates the number of global regions (out of ten defined by the UN geoscheme) that have low coverage or minimal article representation in that language edition. The dataset allows researchers to identify which language Wikipedias most effectively cover global topics and which remain regionally or linguistically constrained.
Facebook
TwitterDataset Card for Speech Wikimedia
Dataset Summary
The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.
Transcription languages
English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people
Dataset Summary
Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a
Supported Tasks and Leaderboards
The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is
Languages
english only
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
Facebook
TwitterSpecies pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.
See https://github.com/mdoering/wikipedia-dwca for details.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.
Facebook
TwitterThe dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities, Countries, Universities, and Novels.
Facebook
Twittersimple-wikipedia
Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Visit the Wikimedia Dumps page:
Select the desired Wikipedia version:
enwiki directory.Download the latest dump:
enwiki-latest-pages-articles.xml.bz2: Contains the current versions of article content.enwiki-latest-pages-meta-current.xml.bz2: Contains current versions of article content, including page metadata.Handling large files:
bzip2 to decompress .bz2 files.Parsing the dump:
xml.etree.ElementTree for XML parsing.WikiExtractor, which is a Python script designed to extract and clean text from Wikipedia XML dumps.Example of Download and Parsing
Here's an example of how you might use Python to download and parse a Wikipedia dump:
import requests
url = 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2'
response = requests.get(url, stream=True)
with open('enwiki-latest-pages-articles.xml.bz2', 'wb') as file:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
file.write(chunk)
2. Decompress and parse using WikiExtractor:
```bash
# First, ensure you have WikiExtractor installed
pip install wikiextractor
# Run WikiExtractor to process the dump
wikiextractor enwiki-latest-pages-articles.xml.bz2
Important Notes - Ensure you have appropriate storage and processing power to handle large datasets. - Parsing and processing Wikipedia dumps can be resource-intensive, so plan accordingly. - Always check the licensing and usage terms for Wikipedia content to ensure compliance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three corpora in different domains extracted from Wikipedia.For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.The article structure, and particularly the sub-titles and paragraphs are kept in these datasets.
Wines: Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are Dom Pérignon - Moët & Chandon, Pinot Meunier - Chardonnay.
Movies: The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are Schindler's List - The PianistLion King - The Jungle Book.
Video games: The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia, Burnout Paradise - Forza Horizon 3.
Facebook
Twitterhttps://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )
from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📚 Overview: This dataset provides a compact and efficient way to explore the massive "Wikipedia Structured Contents" dataset by Wikimedia Foundation, which consists of 38 large JSONL files (each ~2.5GB). Loading these directly in Kaggle or Colab is impractical due to resource constraints. This file index solves that problem.
🔍 What’s Inside:
This dataset includes a single JSONL file named wiki_structured_dataset_navigator.jsonl that contains metadata for every file in the English portion of the Wikimedia dataset.
Each line in the JSONL file is a JSON object with the following fields:
- file_name: the actual filename in the source dataset (e.g., enwiki_namespace_0_0.jsonl)
- file_index: the numeric row index of the file
- name: the Wikipedia article title or identifier
- url: a link to the full article on Wikipedia
- description: a short description or abstract of the article (when available)
🛠 Use Case: Use this dataset to search by keyword, article name, or description to find which specific files from the full Wikimedia dataset contain the topics you're interested in. You can then download only the relevant file(s) instead of the entire dataset.
⚡️ Benefits: - Lightweight (~MBs vs. GBs) - Easy to load and search - Great for indexing, previewing, and subsetting the Wikimedia dataset - Saves time, bandwidth, and compute resources
📎 Example Usage (Python): ```python import kagglehub import json import pandas as pd import numpy as np import os from tqdm import tqdm from datetime import datetime import re
def read_jsonl(file_path, max_records=None): data = [] with open(file_path, 'r', encoding='utf-8') as f: for i, line in enumerate(tqdm(f)): if max_records and i >= max_records: break data.append(json.loads(line)) return data
file_path = kagglehub.dataset_download("mehranism/wikimedia-structured-dataset-navigator-jsonl",path="wiki_structured_dataset_navigator.jsonl") data = read_jsonl(file_path) print(f"Successfully loaded {len(data)} records")
df = pd.DataFrame(data) print(f"Dataset shape: {df.shape}") print(" Columns in the dataset:") for col in df.columns: print(f"- {col}")
This dataset is perfect for developers working on:
- Retrieval-Augmented Generation (RAG)
- Large Language Model (LLM) fine-tuning
- Search and filtering pipelines
- Academic research on structured Wikipedia content
💡 Tip:
Pair this index with the original [Wikipedia Structured Contents dataset](https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents) for full article access.
📃 Format:
- File: `wiki_structured_dataset_navigator.jsonl`
- Format: JSON Lines (1 object per line)
- Encoding: UTF-8
---
### **Tags**
wikipedia, wikimedia, jsonl, structured-data, search-index, metadata, file-catalog, dataset-index, large-language-models, machine-learning ```
CC0: Public Domain Dedication
(Recommended for open indexing tools with no sensitive data.)
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
A comprehensive Wikipedia dataset containing 15,320 pages with 5.6 million links, collected using breadth-first search crawling algorithm. This dataset includes complete page metadata, link relationships, and a network graph representation suitable for network analysis, graph algorithms, NLP research, and machine learning applications.
Facebook
TwitterSpecies pages extracted from the Spanish Wikipedia article XML dump from 2022-08-01. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.
See https://github.com/mdoering/wikipedia-dwca for details.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Bilingual (EN-BS) corpus acquired from Wikipedia on health and COVID-19 domain (2nd May 2020)
Facebook
TwitterWith this feature the user is able to extend CSV datasets with existing information in the Wikidata KG. The tool applies entity linking to all concepts in the same column and enable the user to use the extracted entities to extend the dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset comprises raw data extracted from Wikipedia, encompassing various types of content including articles, metadata, and user interactions. The dataset is in its unprocessed form, providing an excellent opportunity for data enthusiasts and professionals to engage in data cleaning and preprocessing tasks. It is ideal for those looking to practice and enhance their data cleaning skills, as well as for researchers and developers who require a rich and diverse corpus for natural language processing (NLP) projects.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.
This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).
Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.
The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.
The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.
Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/
Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...