Saved datasets
Last updated
Download format
Croissant
Croissant is a format for Machine Learning datasets
Learn more about this at mlcommons.org/croissant.
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Provider
Free
Cost to access
Described as free to access or have a license that allows redistribution.
100+ datasets found
  1. Wikipedia Structured Contents

    • kaggle.com
    zip
    Updated Apr 11, 2025
  2. T

    wikipedia

    • tensorflow.org
    • huggingface.co
  3. h

    wikipedia_markdown

    • huggingface.co
    Updated Jan 20, 2025
    + more versions
  4. h

    wikipedia-summary-dataset

    • huggingface.co
    Updated Sep 15, 2017
  5. Wikipedia Dataset

    • kaggle.com
    zip
    Updated Oct 13, 2025
  6. Turkish Wikipedia Dataset

    • kaggle.com
    zip
    Updated Mar 19, 2024
  7. Wikipedia data.tsv

    • figshare.com
    txt
    Updated Oct 10, 2023
  8. Z

    A Wikipedia dataset of 5 categories

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
  9. e

    Wikimedia - Datasets - OpenData.eol.org

    • opendata.eol.org
    Updated Oct 28, 2017
  10. h

    simple-wikipedia

    • huggingface.co
    Updated Aug 17, 2023
  11. Dataset Wikipedia

    • figshare.com
    txt
    Updated Jul 9, 2021
  12. Data from: English Wikipedia - Species Pages

    • gbif.org
    • gbif-north-america.org
    Updated Aug 23, 2022
  13. Wikipedia Talk Corpus

    • figshare.com
    • kaggle.com
    application/x-gzip
    Updated Jan 23, 2017
  14. arabic wikipedia dump 2021

    • kaggle.com
    zip
    Updated Feb 25, 2021
  15. Raw Wikipedia

    • kaggle.com
    zip
    Updated May 21, 2024
  16. wit_base

    • huggingface.co
  17. Chinese Wikipedia 2024

    • kaggle.com
    zip
    Updated Dec 13, 2024
  18. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +3more
    pdf, tsv
    Updated Jul 17, 2024
  19. Wiki-talk Datasets

    • zenodo.org
    • data.europa.eu
    application/gzip
    Updated Jan 24, 2020
  20. t

    Wikidata Explorer Feature - Dataset - LDM

    • service.tib.eu
    Updated Jul 16, 2024
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
Organization logo

Wikipedia Structured Contents

Pre-parsed English and French Wikipedia Articles, Including Infoboxes

Explore at:
zip(25121685657 bytes)Available download formats
Dataset updated
Apr 11, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...

Search
Clear search
Close search
Google apps
Main menu