100+ datasets found
  1. Wikipedia Structured Contents

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
    Explore at:
    zip(25121685657 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

    This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

    Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

    The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

    The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

    Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

    Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...

  2. T

    wikipedia

    • tensorflow.org
    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
    Explore at:
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. h

    wikipedia-summary-dataset

    • huggingface.co
    Updated Sep 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Clive (2017). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2017
    Authors
    Jordan Clive
    Description

    Dataset Summary

    This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.

  4. Wikipedia Dataset

    • kaggle.com
    zip
    Updated Oct 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Amonde (2025). Wikipedia Dataset [Dataset]. https://www.kaggle.com/datasets/vincentdsc/wikipedia-dataset
    Explore at:
    zip(12363201 bytes)Available download formats
    Dataset updated
    Oct 13, 2025
    Authors
    Vincent Amonde
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset 1: Wikipedia Article Metadata and Content Distribution (2019–2023)

    This dataset represents metadata and structural information extracted from Wikipedia articles across multiple language editions between January 2019 and December 2023. The data was collected through the Wikimedia REST API and Wikidata Query Service, focusing on high-level article characteristics such as content length, number of references, topic classification, and readership activity. Each row corresponds to a unique Wikipedia article identified by an article_id and includes metadata describing its topic category (e.g., Politics, Science, Culture), geographic focus, and quality assessment.

    The dataset was designed to help quantify content inequality and topic bias across languages. For example, English and German editions tend to have more extensive coverage of scientific and technological topics, while Swahili and Arabic editions show higher representation of local cultural and geographical content but fewer high-quality (“Featured Article”) designations. Article-level metrics like word_count, references_count, and page_views were gathered to provide indicators of article depth, credibility, and public engagement. The last_edit_date variable helps capture how frequently articles are updated, indicating editorial activity over time.

    Temporal coverage: 2019–2023 Data sources: Wikimedia REST API, Wikidata Query Service, Pageview Analytics Primary purpose: To analyze disparities in article depth, topic diversity, and regional focus across Wikipedia’s major language editions.

    Dataset 2: Wikipedia Editor Demographics and Contribution Data (2018–2023)

    This dataset summarizes demographic and contribution patterns of active Wikipedia editors from 2018 to 2023, based on public edit histories available through the Wikimedia Dumps and MediaWiki API. Each record corresponds to a unique editor identified by editor_id, containing attributes such as country, primary language of editing, total edit counts, and dominant topic area.

    Although Wikipedia does not directly record personal information, country and language data were inferred using IP-based geolocation for anonymous edits and user-declared data for registered contributors. The dataset was sampled to capture editors across seven major languages (English, French, Spanish, German, Swahili, Arabic, and Chinese). Demographic variables like gender and education_level are approximations derived from community surveys conducted by the Wikimedia Foundation in 2019 and 2021, used here to represent broad participation trends rather than individual identities.

    This dataset provides insight into editorial imbalance, highlighting, for example, that editors from Europe and North America contribute disproportionately more to technical and scientific topics compared to those from Africa or South America. Fields such as total_edits, articles_edited, and avg_edit_size reflect productivity and depth of engagement, while active_since helps trace editor retention and historical participation.

    Temporal coverage: 2018–2023 Data sources: Wikimedia Dumps, MediaWiki API, Wikimedia Community Surveys (2019, 2021) Primary purpose: To analyze demographic participation gaps and editing activity distribution across languages and regions.

    Dataset 3: Wikipedia Language and Geographic Coverage Statistics (2023)

    This dataset presents aggregated statistics at the language edition level, representing Wikipedia’s overall content and contributor structure as of December 2023. The data was compiled from the Wikimedia Statistics Portal and Meta-Wiki language reports, which provide high-level metrics such as total number of articles, average article length, number of active editors, and editing intensity per language.

    Each entry represents one Wikipedia language edition, capturing its global footprint and coverage balance. The column coverage_score is a composite index derived from article volume, diversity of covered topics, and proportional representation of countries and regions. underrepresented_regions indicates the number of global regions (out of ten defined by the UN geoscheme) that have low coverage or minimal article representation in that language edition. The dataset allows researchers to identify which language Wikipedias most effectively cover global topics and which remain regionally or linguistically constrained.

  5. h

    simple-wikipedia

    • huggingface.co
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul (2023). simple-wikipedia [Dataset]. https://huggingface.co/datasets/rahular/simple-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Authors
    Rahul
    Description

    simple-wikipedia

    Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.

  6. t

    Wikipedia Corpus - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wikipedia Corpus - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/wikipedia-corpus
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities, Countries, Universities, and Novels.

  7. Wikipedia data.tsv

    • figshare.com
    txt
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengyi Wei (2023). Wikipedia data.tsv [Dataset]. http://doi.org/10.6084/m9.figshare.24278299.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 10, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mengyi Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Using Wikipedia data to study AI ethics.

  8. wit_base

    • huggingface.co
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2024). wit_base [Dataset]. https://huggingface.co/datasets/wikimedia/wit_base
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for WIT

      Dataset Summary
    

    Wikimedia's version of the Wikipedia-based Image Text (WIT) Dataset, a large multimodal multilingual dataset. From the official blog post:

    The core training data is taken from the Wikipedia Image-Text (WIT) Dataset, a large curated set of more than 37 million image-text associations extracted from Wikipedia articles in 108 languages that was recently released by Google Research. The WIT dataset offers extremely valuable data about the… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/wit_base.

  9. Wikipedia Talk Corpus

    • figshare.com
    • kaggle.com
    application/x-gzip
    Updated Jan 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.4264973.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jan 23, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.

  10. Dataset Wikipedia

    • figshare.com
    txt
    Updated Jul 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Rizzo (2021). Dataset Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14939319.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 9, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lucas Rizzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantitative features extracted from Wikipedia dumps for the inference of computational trust. Dumps provided at:https://dumps.wikimedia.org/Files used:XML dump Portuguese: ptwiki-20200820-stub-meta-history.xmlXML dump Italian: itwiki-20200801-stub-meta-history.xml

  11. Wikipedia Dataset

    • kaggle.com
    zip
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JAYAPRAKASHPONDY (2024). Wikipedia Dataset [Dataset]. https://www.kaggle.com/datasets/jayaprakashpondy/wikipedia-dataset
    Explore at:
    zip(44391875 bytes)Available download formats
    Dataset updated
    Sep 25, 2024
    Authors
    JAYAPRAKASHPONDY
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by JAYAPRAKASHPONDY

    Released under CC0: Public Domain

    Contents

  12. t

    Wizard of Wikipedia - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wizard of Wikipedia - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/wizard-of-wikipedia
    Explore at:
    Dataset updated
    Nov 25, 2024
    Description

    Wizard of Wikipedia is a recent, large-scale dataset of multi-turn knowledge-grounded dialogues between a “apprentice” and a “wizard”, who has access to information from Wikipedia documents.

  13. arabic wikipedia dump 2021

    • kaggle.com
    zip
    Updated Feb 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Fawzy (2021). arabic wikipedia dump 2021 [Dataset]. https://www.kaggle.com/datasets/z3rocool/arabic-wikipedia-dump-2021
    Explore at:
    zip(419487481 bytes)Available download formats
    Dataset updated
    Feb 25, 2021
    Authors
    Mohamed Fawzy
    Description

    Dataset

    This dataset was created by Mohamed Fawzy

    Contents

  14. Raw Wikipedia

    • kaggle.com
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ismael (2024). Raw Wikipedia [Dataset]. https://www.kaggle.com/datasets/ismaeldwikat/wikipedia
    Explore at:
    zip(8575597 bytes)Available download formats
    Dataset updated
    May 21, 2024
    Authors
    Ismael
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset comprises raw data extracted from Wikipedia, encompassing various types of content including articles, metadata, and user interactions. The dataset is in its unprocessed form, providing an excellent opportunity for data enthusiasts and professionals to engage in data cleaning and preprocessing tasks. It is ideal for those looking to practice and enhance their data cleaning skills, as well as for researchers and developers who require a rich and diverse corpus for natural language processing (NLP) projects.

  15. wikidata

    • kaggle.com
    zip
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ABEL BIHINDA (2025). wikidata [Dataset]. https://www.kaggle.com/datasets/abelbihinda/wikidata
    Explore at:
    zip(5327 bytes)Available download formats
    Dataset updated
    Apr 16, 2025
    Authors
    ABEL BIHINDA
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by ABEL BIHINDA

    Released under Apache 2.0

    Contents

  16. E

    Long document similarity datasets, Wikipedia excerptions for movies, video...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Apr 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7843
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 6, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three corpora in different domains extracted from Wikipedia.For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.The article structure, and particularly the sub-titles and paragraphs are kept in these datasets.

    Wines: Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are Dom Pérignon - Moët & Chandon, Pinot Meunier - Chardonnay.

    Movies: The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are Schindler's List - The PianistLion King - The Jungle Book.

    Video games: The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia, Burnout Paradise - Forza Horizon 3.

  17. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +3more
    pdf, tsv
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  18. A Wikipedia dataset of 5 categories

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Maitre; Julien Maitre (2020). A Wikipedia dataset of 5 categories [Dataset]. http://doi.org/10.5281/zenodo.3260046
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julien Maitre; Julien Maitre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

    • Economy : 44'876 articles
    • History : 92'041 articles
    • Informatics : 25'408 articles
    • Health : 22'143 articles
    • Law : 9'964 articles
  19. Wiki-talk Datasets

    • zenodo.org
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis (2020). Wiki-talk Datasets [Dataset]. http://doi.org/10.5281/zenodo.49561
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    User interaction networks of Wikipedia of 28 different languages. Nodes (orininal wikipedia user IDs) represent users of the Wikipedia, and an edge from user A to user B denotes that user A wrote a message on the talk page of user B at a certain timestamp.

    More info: http://yfiua.github.io/academic/2016/02/14/wiki-talk-datasets.html

  20. t

    Wikidata Explorer Feature - Dataset - LDM

    • service.tib.eu
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wikidata Explorer Feature - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/wikidata-explorer-feature
    Explore at:
    Dataset updated
    Jul 16, 2024
    Description

    With this feature the user is able to extend CSV datasets with existing information in the Wikidata KG. The tool applies entity linking to all concepts in the same column and enable the user to use the extracted entities to extend the dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
Organization logo

Wikipedia Structured Contents

Pre-parsed English and French Wikipedia Articles, Including Infoboxes

Explore at:
zip(25121685657 bytes)Available download formats
Dataset updated
Apr 11, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...

Search
Clear search
Close search
Google apps
Main menu