Saved datasets
Last updated
Download format
Croissant
Croissant is a format for Machine Learning datasets
Learn more about this at mlcommons.org/croissant.
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Provider
Free
Cost to access
Described as free to access or have a license that allows redistribution.
100+ datasets found
  1. T

    wikipedia

    • tensorflow.org
    • huggingface.co
  2. Wikipedia Structured Contents

    • kaggle.com
    zip
    Updated Apr 11, 2025
  3. Wikipedia Plaintext (2023-07-01)

    • kaggle.com
    Updated Jul 17, 2023
  4. h

    wikipedia-summary-dataset

    • huggingface.co
    Updated Sep 15, 2017
  5. Wikipedia Talk Corpus

    • figshare.com
    • kaggle.com
    application/x-gzip
    Updated Jan 23, 2017
  6. h

    wikipedia-persons-masked

    • huggingface.co
    Updated May 23, 2009
  7. Data from: English Wikipedia - Species Pages

    • gbif.org
    Updated Aug 23, 2022
    + more versions
  8. Plain text Wikipedia (SimpleEnglish)

    • kaggle.com
    zip
    Updated Apr 1, 2024
  9. T

    wiki40b

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Aug 30, 2023
    + more versions
  10. f

    English Wikipedia Quality Asssessment Dataset

    • figshare.com
    application/bzip2
    Updated May 31, 2023
  11. t

    Wikipedia Corpus - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Dec 16, 2024
  12. 4

    Title and subtitles of Wikipedia articles

    • data.4tu.nl
    zip
    Updated Jun 6, 2017
  13. h

    simple-wikipedia

    • huggingface.co
    Updated Aug 17, 2023
  14. Wikipedia data.tsv

    • figshare.com
    txt
    Updated Oct 10, 2023
  15. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +2more
    pdf, tsv
    Updated Jul 17, 2024
  16. E

    Long document similarity datasets, Wikipedia excerptions for movies, video...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Apr 6, 2024
    + more versions
  17. wikipedia-22-12-simple-embeddings

    • huggingface.co
    • opendatalab.com
    Updated Mar 29, 2023
    + more versions
  18. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
  19. o

    Armenian wikipedia (hywiki) XML dumps - Dataset - Data Catalog Armenia

    • data.opendata.am
    Updated Apr 6, 2023
  20. R

    Wikipedia Dataset

    • universe.roboflow.com
    zip
    Updated Jul 10, 2025
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia

wikipedia

Explore at:
39 scholarly articles cite this dataset (View in Google Scholar)
Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('wikipedia', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu