100+ datasets found
  1. wikidata-all

    • huggingface.co
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia Movement (2024). wikidata-all [Dataset]. https://huggingface.co/datasets/Wikimedians/wikidata-all
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2024
    Dataset provided by
    Wikimedia movementhttps://wikimedia.org/
    Authors
    Wikimedia Movement
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Wikidata - All Entities

    This Hugging Face Data Set contains the entirety of Wikidata as of the date listed below. Wikidata is a freely licensed structured knowledge graph following the wiki model of user contributions. If you build on this data please consider contributing back to Wikidata. For more on the size and other statistics of Wikidata, see: Special:Statistics. Current Dump as of: 2024-03-04

      Original Source
    

    The data contained in this repository is retrieved… See the full description on the dataset page: https://huggingface.co/datasets/Wikimedians/wikidata-all.

  2. Wikipedia Plaintext (2023-07-01)

    • kaggle.com
    Updated Jul 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JJ (2023). Wikipedia Plaintext (2023-07-01) [Dataset]. https://www.kaggle.com/datasets/jjinho/wikipedia-20230701
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JJ
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    While other great datasets containing Wikipedia exist, the latest one dates from 2020, and so this is an updated version that contains 6,286,775 articles, titles, text, and categories from the July 1st, 2023 Wikipedia dump.

    Articles are sorted in alphanumeric order and separated into parquet files corresponding to the first character of the article title. The data is partitioned into parquet files named a-z, number (titles that began with numbers), and other (titles that began with symbols).

    The best place to see it in action is: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam

    If you find this dataset helpful, please upvote!

  3. h

    wikipedia

    • huggingface.co
    Updated Feb 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
    Explore at:
    Dataset updated
    Feb 21, 2023
    Dataset authored and provided by
    Online Language Modelling
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

  4. h

    wikipedia-summary-dataset

    • huggingface.co
    Updated Sep 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Clive (2017). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2017
    Authors
    Jordan Clive
    Description

    Dataset Summary

    This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.

  5. Wikipedia Talk Corpus

    • figshare.com
    application/x-gzip
    Updated Jan 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.4264973.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jan 23, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.

  6. h

    speech-wikimedia

    • huggingface.co
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons (2023). speech-wikimedia [Dataset]. https://huggingface.co/datasets/MLCommons/speech-wikimedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Dataset authored and provided by
    MLCommons
    Description

    Dataset Card for Speech Wikimedia

      Dataset Summary
    

    The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.

      Transcription languages
    

    English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.

  7. E

    Long document similarity datasets, Wikipedia excerptions for movies, video...

    • live.european-language-grid.eu
    csv
    Updated Apr 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7843
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 6, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three corpora in different domains extracted from Wikipedia.For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.The article structure, and particularly the sub-titles and paragraphs are kept in these datasets.

    Wines: Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are Dom Pérignon - Moët & Chandon, Pinot Meunier - Chardonnay.

    Movies: The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are Schindler's List - The PianistLion King - The Jungle Book.

    Video games: The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia, Burnout Paradise - Forza Horizon 3.

  8. f

    English Wikipedia Quality Asssessment Dataset

    • figshare.com
    application/bzip2
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morten Warncke-Wang (2023). English Wikipedia Quality Asssessment Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1375406.v2
    Explore at:
    application/bzip2Available download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Morten Warncke-Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.

  9. t

    Wikipedia Corpus - Dataset - LDM

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wikipedia Corpus - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/wikipedia-corpus
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities, Countries, Universities, and Novels.

  10. Z

    A Wikipedia dataset of 5 categories

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maitre, Julien (2020). A Wikipedia dataset of 5 categories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3260045
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    L3i, La Rochelle University
    Authors
    Maitre, Julien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

    Economy : 44'876 articles

    History : 92'041 articles

    Informatics : 25'408 articles

    Health : 22'143 articles

    Law : 9'964 articles

  11. Wikipedia Knowledge Graph dataset

    • zenodo.org
    pdf, tsv
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  12. Wikimedia - Datasets - OpenData.eol.org

    • opendata.eol.org
    Updated Oct 28, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    eol.org (2017). Wikimedia - Datasets - OpenData.eol.org [Dataset]. https://opendata.eol.org/dataset/wikimedia
    Explore at:
    Dataset updated
    Oct 28, 2017
    Dataset provided by
    Encyclopedia of Lifehttp://eol.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wikimedia Commons is a media file repository making available public domain and freely-licensed educational media content (images, sound and video clips) to all. It acts as a common repository for the various projects of the Wikimedia Foundation, but you do not need to belong to one of those projects to use media hosted here. The repository is created and maintained not by paid-for artists but by volunteers. https://commons.wikimedia.org/wiki/Main_Page

  13. wikipedia-22-12-de-embeddings

    • huggingface.co
    Updated Aug 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-22-12-de-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (de) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (de) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings.

  14. Dataset Wikipedia

    • figshare.com
    txt
    Updated Jul 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Rizzo (2021). Dataset Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14939319.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 9, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lucas Rizzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantitative features extracted from Wikipedia dumps for the inference of computational trust. Dumps provided at:https://dumps.wikimedia.org/Files used:XML dump Portuguese: ptwiki-20200820-stub-meta-history.xmlXML dump Italian: itwiki-20200801-stub-meta-history.xml

  15. Wikipedia Dataset

    • kaggle.com
    zip
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JAYAPRAKASHPONDY (2024). Wikipedia Dataset [Dataset]. https://www.kaggle.com/datasets/jayaprakashpondy/wikipedia-dataset
    Explore at:
    zip(44391875 bytes)Available download formats
    Dataset updated
    Sep 25, 2024
    Authors
    JAYAPRAKASHPONDY
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by JAYAPRAKASHPONDY

    Released under CC0: Public Domain

    Contents

  16. Data from: Spanish Wikipedia - Species Pages

    • gbif.org
    Updated Jun 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Döring; Markus Döring (2025). Spanish Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/pudryt
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Authors
    Markus Döring; Markus Döring
    Description

    Species pages extracted from the Spanish Wikipedia article XML dump from 2022-08-01. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

    See https://github.com/mdoering/wikipedia-dwca for details.

  17. Wikipedia data.tsv

    • figshare.com
    txt
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengyi Wei (2023). Wikipedia data.tsv [Dataset]. http://doi.org/10.6084/m9.figshare.24278299.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 10, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mengyi Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Using Wikipedia data to study AI ethics.

  18. Processed Wikipedia Dataset

    • zenodo.org
    bin
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Nie; Hao Nie (2024). Processed Wikipedia Dataset [Dataset]. http://doi.org/10.5281/zenodo.12683869
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hao Nie; Hao Nie
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jul 2024
    Description

    We extract a subset of about 1,000,000 documents of Wikipedia 2020 and extract the keywords of them. The wiki_kws_dict.pkl is a map which maps each keyword to its total counts in files and query trend. The wiki_doc_0.pkl contains lists of keywords of each document. These two datasets can be loaded by the pickle package with python.

  19. Wikidata Persons in Relevant Categories

    • opensanctions.org
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikidata (2025). Wikidata Persons in Relevant Categories [Dataset]. https://www.opensanctions.org/datasets/wd_categories/
    Explore at:
    Dataset updated
    Nov 15, 2025
    Dataset authored and provided by
    Wikidata//wikidata.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Category-based imports from Wikidata, the structured data version of Wikipedia.

  20. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia Movement (2024). wikidata-all [Dataset]. https://huggingface.co/datasets/Wikimedians/wikidata-all
Organization logo

wikidata-all

Wikidata - All Entities

Wikimedians/wikidata-all

Explore at:
298 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2024
Dataset provided by
Wikimedia movementhttps://wikimedia.org/
Authors
Wikimedia Movement
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Wikidata - All Entities

This Hugging Face Data Set contains the entirety of Wikidata as of the date listed below. Wikidata is a freely licensed structured knowledge graph following the wiki model of user contributions. If you build on this data please consider contributing back to Wikidata. For more on the size and other statistics of Wikidata, see: Special:Statistics. Current Dump as of: 2024-03-04

  Original Source

The data contained in this repository is retrieved… See the full description on the dataset page: https://huggingface.co/datasets/Wikimedians/wikidata-all.

Search
Clear search
Close search
Google apps
Main menu