100+ datasets found
  1. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
    Explore at:
    zip(4293465577 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

    The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

    We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

    Data Structure

    • File name: wme_people_infobox.tar.gz
    • Size of compressed file: 4.12 GB
    • Size of uncompressed file: 21.28 GB

    Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

    The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

    Stats

    Infoboxes - Compressed: 2GB - Uncompressed: 11GB

    Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

    Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

    This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

    Maintenance and Support

    This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

    Initial Data Collection and Normalization

    The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

    Who are the source language producers?

    Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

    Attribution

    Terms and conditions

    Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

  2. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +2more
    pdf, tsv
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  3. h

    Wikipedia-Knowledge-2M

    • huggingface.co
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyu Chen (2024). Wikipedia-Knowledge-2M [Dataset]. https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Authors
    Xinyu Chen
    Description

    📃 Paper | 🤗 Hugging Face | ⭐ Github

      Dataset Overview
    

    In the table below, we provide a brief summary of the dataset statistics.

    Category Size

    Total Sample 2019163

    Total Image 2019163

    Average Answer Length 84

    Maximum Answer Length 5851

      JSON Overview
    

    Each dictionary in the JSON file contains three keys: 'id', 'image', and 'conversations'. The 'id' is the unique identifier for the current data in the entire dataset. The 'image' stores… See the full description on the dataset page: https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M.

  4. d

    Archival Data for Page Protection: Another Missing Dimension of Wikipedia...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hill, Benjamin Mako; Shaw, Aaron (2023). Archival Data for Page Protection: Another Missing Dimension of Wikipedia Research [Dataset]. http://doi.org/10.7910/DVN/P1VECE
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Hill, Benjamin Mako; Shaw, Aaron
    Description

    This dataset contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2015) “Page Protection: Another Missing Dimension of Wikipedia Research.” In Proceedings of the 11th International Symposium on Open Collaboration (OpenSym 2015). ACM Press. doi: 10.1145/2788993.2789846 This is an archival version of the data and software released with the paper. All of these data were (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-proetection/ Page protection is a feature of MediaWiki software that allows administrators to restrict contributions to particular pages. For example, a page can be “protected” so that only administrators or logged-in editors with a history of good editing can edit, move, or create it. Protection might involve “full protection” where a page can only be edited by administrators (i.e., “sysops”) or “semi-protection” where a page can only be edited by accounts with a history of good edits (i.e., “autoconfirmed” users). Although largely hidden, page protection profoundly shapes activity on the site. For example, page protection is an important tool used to manage access and participation in situations where vandalism or interpersonal conflict can threaten to undermine content quality. While protection affects only a small portion of pages in English Wikipedia, many of the most highly viewed pages are protected. For example, the “Main Page” in English Wikipedia has been protected since February, 2006 and all Featured Articles are protected at the time they appear on the site’s main page. Millions of viewers may never edit Wikipedia because they never see an edit button. Despite it's widespread and influential nature, very little quantitative research on Wikipedia has taken page protection into account systematically. This page contains software and data to help Wikipedia researchers do exactly this in their work. Because a page's protection status can change over time, the snapshots of page protection data stored by Wikimedia and published by Wikimedia Foundation in as dumps is incomplete. As a result, taking protection into account involves looking at several different sources of data. Much more detail can be found in our paper Page Protection: Another Missing Dimension of Wikipedia Research. If you use this software or these data, we would appreciate if you cite the paper.

  5. Quality of Wikipedia articles by WikiRank

    • kaggle.com
    zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Włodzimierz Lewoniewski (2025). Quality of Wikipedia articles by WikiRank [Dataset]. https://www.kaggle.com/datasets/lewoniewski/quality-of-wikipedia-articles-by-wikirank
    Explore at:
    zip(771671698 bytes)Available download formats
    Dataset updated
    Mar 18, 2025
    Authors
    Włodzimierz Lewoniewski
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Datasets with quality score for 47 million Wikipedia articles across 55 language versions by Wikirank, as of 1 August 2024.

    Potential Applications:

    • Academic research: scholars can incorporate WikiRank scores into studies on information accuracy, digital literacy, collective intelligence, and crowd dynamics. This data can also inform sociological research into biases, representation, and content disparities across different languages and cultures.
    • Educational tools and platforms: educational institutions and learning platforms can integrate WikiRank scores to recommend reliable and high-quality articles, significantly aiding learners in sourcing accurate information.
    • AI and machine learning development: developers and data scientists can use WikiRank scores to train sophisticated NLP and content-generation models to recognize and produce high-quality, structured, and well-referenced content.
    • Content moderation and policy development: Wikipedia community can use these metrics to enforce content quality policies more effectively.
    • Content strategy and editorial planning: media companies, publishers, and content strategists can employ these scores to identify high-performing content and detect topics needing deeper coverage or improvement.

    More information about the quality score can be found in scientific papers:

  6. Data for: Wikipedia as a gateway to biomedical research

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip, txt
    Updated Sep 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio (2020). Data for: Wikipedia as a gateway to biomedical research [Dataset]. http://doi.org/10.5281/zenodo.831459
    Explore at:
    txt, application/gzipAvailable download formats
    Dataset updated
    Sep 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia has been described as a gateway to knowledge. However, the extent to which this gateway ends at Wikipedia or continues via supporting citations is unknown. This dataset was used to establish benchmarks for the relative distribution and referral (click) rate of citations, as indicated by presence of a Digital Object Identifier (DOI), from Wikipedia with a focus on medical citations.

    This data set includes for each day in August 2016 a listing of all DOI present in the English language version of Wikipedia and whether or not the DOI are biomedical in nature. Source Code for these data are available at: Ryan Steinberg. (2017, July 9). Lane-Library/wiki-extract: initial Zenodo/DOI release. Zenodo. http://doi.org/10.5281/zenodo.824813

    This dataset also includes a listing from Crossref DOIs that were referred from Wikipedia in August 2016 (Wikipedia_referred_DOI). Source code for these data sets is available at: Joe Wass. (2017, July 4). CrossRef/logppj: Initial DOI registered release. Zenodo. http://doi.org/10.5281/zenodo.822636

    An article based on this data was published in PLOS One:

    Maggio LA, Willinsky JM, Steinberg RM, Mietchen D, Wass JL, Dong T. Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia. PloS one. 2017 Dec 21;12(12):e0190046.

    https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190046

  7. Wikipedia SQLITE Portable DB, Huge 5M+ Rows

    • kaggle.com
    zip
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
    Explore at:
    zip(6064169983 bytes)Available download formats
    Dataset updated
    Jun 29, 2024
    Authors
    christernyc
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

    I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

    Key Features:

    Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

    The database consists of four main tables:

    • items: Contains information about Wikipedia items, including labels and descriptions
    • properties: Stores details about Wikidata properties, such as labels and descriptions
    • pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts
    • link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

    This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

    https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

    Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

    Usage with LIKE queries: ``` import aiosqlite import asyncio

    class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

    async def _aenter_(self):
      self.conn = await aiosqlite.connect(self.db_file)
      return self
    
    async def _aexit_(self, exc_type, exc_val, exc_tb):
      await self.conn.close()
    
    async def search_pages_by_title(self, title):
      query = """
      SELECT pages.page_id, pages.item_id, pages.title, pages.views, 
          items.labels AS item_labels, items.description AS item_description,
          link_annotated_text.sections
      FROM pages 
      JOIN items ON pages.item_id = items.id
      JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
      WHERE pages.title LIKE ?
      """
      async with self.conn.execute(query, (f"%{title}%",)) as cursor:
        return await cursor.fetchall()
    
    async def search_items_by_label_or_description(self, keyword):
      query = """
      SELECT id, labels, description 
      FROM items
      WHERE labels LIKE ? OR description LIKE ?
      """
      async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
        return await cursor.fetchall()
    
    async def search_items_by_label(self, label):
      query = """
      SELECT id, labels, description
      FROM items 
      WHERE labels LIKE ?
      """
      async with self.conn.execute(query, (f"%{label}%",)) as cursor:
        return await cursor.fetchall()
    
    async def search_properties_by_label_or_desc...
    
  8. H

    Replication Data for: Taboo and Collaborative Knowledge Production: Evidence...

    • dataverse.harvard.edu
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaylea Champion; Benjamin Mako Hill (2024). Replication Data for: Taboo and Collaborative Knowledge Production: Evidence from Wikipedia [Dataset]. http://doi.org/10.7910/DVN/5OKEEO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Kaylea Champion; Benjamin Mako Hill
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    By definition, people are reticent or even unwilling to talk about taboo subjects. Because subjects like sexuality, health, and violence are taboo in most cultures, important information on each can be difficult to obtain. Are peer produced knowledge bases like Wikipedia a promising approach for providing people with information on taboo subjects? With its reliance on volunteers who might also be averse to taboo, can the peer production model be relied on to produce high-quality information on taboo subjects? In this paper, we seek to understand the role of taboo in volunteer-produced knowledge bases. We do so by developing a novel computational approach to identify taboo subjects and by using this method to identify a set of articles on taboo subjects in English Wikipedia. We find that articles on taboo subjects are more popular than non-taboo articles and that they are frequently subject to vandalism. Despite frequent attacks, we also find that taboo articles are higher quality. We hypothesize that societal attitudes will lead contributors to taboo subjects to seek to be less identifiable. Although our results are consistent with this proposal in several ways, we surprisingly find that contributors make themselves more identifiable in others.

  9. E

    A meta analysis of Wikipedia's coronavirus sources during the COVID-19...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    txt
    Updated Sep 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 8, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.

  10. Data from: EventWiki: A knowledge base of major events

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Apr 29, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tao Ge; Lei Cui; Baobao Chang; Ming Zhou; Zhifang Sui (2016). EventWiki: A knowledge base of major events [Dataset]. http://doi.org/10.6084/m9.figshare.3171472.v12
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 29, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Tao Ge; Lei Cui; Baobao Chang; Ming Zhou; Zhifang Sui
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EventWiki is a knowledge base of major events happening throughout mankind history. It contains 21,275 events of 95 types. The details of event entries can be found in our paper submission and documentation file. Data in the knowledge base is mainly harvested from Wikipedia.As Wikipedia, this resource can be distributed and shared under CC-BY 3.0 license.

  11. Wikipedia Data Science Articles Dataset

    • kaggle.com
    zip
    Updated Apr 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sita berete (2024). Wikipedia Data Science Articles Dataset [Dataset]. https://www.kaggle.com/datasets/sitaberete/wikipedia-data-science-articles-dataset
    Explore at:
    zip(34981109 bytes)Available download formats
    Dataset updated
    Apr 27, 2024
    Authors
    sita berete
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by sita berete

    Released under MIT

    Contents

  12. Structured knowledge bases for the inference of computational trust of...

    • figshare.com
    pdf
    Updated May 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Rizzo; luca longo (2020). Structured knowledge bases for the inference of computational trust of Wikipedia editors [Dataset]. http://doi.org/10.6084/m9.figshare.12249770.v4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 5, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lucas Rizzo; luca longo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge bases structured around IF-THEN rules and defined for the inference of computational trust in the Wikipedia context.

  13. D

    Data from: Evolution of Wikipedia Categories

    • ssh.datastations.nl
    java, pdf, txt, zip
    Updated Jul 11, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. Scharnhorst; C. Gao; A. Akdag Salah; K. Suchecki; A. Scharnhorst; C. Gao; A. Akdag Salah; K. Suchecki (2012). Evolution of Wikipedia Categories [Dataset]. http://doi.org/10.17026/DANS-XJP-ZFUW
    Explore at:
    txt(33032974), txt(35369259), txt(4117715), txt(35264986), txt(41292104), txt(24243125), txt(2265298), txt(58440553), txt(41931167), txt(88778120), txt(41126367), txt(28677451), txt(34385842), txt(258099563), txt(41498742), txt(218660748), txt(1983071), txt(46304635), txt(1265318861), txt(47640307), txt(914867480), txt(54012100), txt(3378324), pdf(150123), txt(40333305), txt(32119456), txt(34067472), txt(37437197), txt(40831947), txt(55345683), txt(50853424), txt(29333733), txt(78527665), txt(66243463), txt(6114855), txt(39943928), txt(29236412), txt(35762723), txt(54186791), txt(30011306), txt(29474344), txt(36009576), txt(16283936), txt(45000109), txt(41143476), txt(26394771), java(6859), txt(46686266), txt(15506101), txt(43105254), txt(42154291), txt(50548553), txt(17319810), txt(38849525), txt(25913876), txt(37961660), txt(30823490), txt(16550403), txt(952505436), txt(109322245), txt(3633102), txt(41934664), txt(44260226), txt(45317846), txt(39643128), txt(32305567), txt(6987648), txt(47024242), txt(1067127455), txt(48025211), txt(31897590), txt(37484419), txt(38164266), txt(47459458), txt(14351261), txt(24860364), txt(26155184), txt(42417668), txt(34226158), java(3416), txt(42509270), txt(36707564), txt(44658240), txt(25645673), txt(37351150), txt(585651958), txt(2102396), txt(28747585), txt(54497537), txt(48622454), txt(47764915), txt(688578566), txt(19773260), txt(33901065), txt(17217744), txt(42431391), txt(764725606), txt(51446908), txt(49632391), txt(36302697), txt(38362424), txt(42062770), txt(126171919), txt(17013094), txt(40997973), txt(40012567), txt(33487318), txt(31583165), txt(30768179), txt(39164518), txt(1334772000), txt(50183246), java(11265), txt(52345802), txt(52726218), txt(28736651), txt(39768202), pdf(110150), txt(37728066), txt(34687899), txt(55571566), java(1603), txt(190829819), txt(45160342), txt(43875230), txt(352434761), txt(35261890), txt(36483279), txt(38707347), txt(28777405), txt(42095088), txt(26250474), txt(23156510), txt(30536465), txt(4794697), java(2110), java(10322), txt(38646916), txt(62403066), txt(38013254), txt(163078842), txt(21840704), txt(37442885), txt(38331967), txt(35611692), txt(49461543), txt(48415938), txt(20982868), txt(48735292), txt(49322432), txt(39678411), txt(40626514), txt(41125978), txt(63448477), txt(193432), txt(17188912), txt(44224371), txt(1297141534), txt(42183743), txt(308373180), txt(42663602), txt(27690233), txt(37589440), java(995), txt(38125937), txt(35178223), txt(204975333), txt(52099568), txt(41961692), txt(3890505), txt(157682742), txt(47253870), txt(56393401), txt(21334448), txt(52872987), txt(31885808), txt(38799587), txt(227611586), txt(43151707), txt(41392165), txt(41712709), txt(47539352), txt(54609533), txt(43389901), txt(17507607), txt(37806111), txt(43165634), txt(1225920417), txt(133542000), txt(33057450), txt(35260616), txt(35529200), txt(44636352), txt(47745399), txt(843293431), txt(128143908), txt(60159529), txt(52836430), txt(52308031), txt(43801831), txt(48095886), java(915), txt(28577673), txt(29349243), txt(56908715), txt(20634451), txt(13386944), txt(6741043), txt(43117658), txt(39012496), txt(45518733), txt(46276385), txt(90686127), txt(41447460), txt(43065286), java(1225), txt(20500655), txt(13058479), txt(31492686), txt(43564117), txt(28738426), txt(21200173), txt(42685416), txt(22599531), txt(45038669), txt(487589), txt(2934435), txt(19617297), txt(19472419), txt(12655092), txt(20556357), txt(36148341), java(1720), txt(22852661), txt(34112098), txt(31529478), txt(35416785), txt(37043830), txt(45647747), txt(28525286), txt(27790866), txt(58691414), txt(37446435), txt(52116721), txt(35877304), txt(34806765), txt(4437820), txt(487611895), txt(27374226), txt(42966748), txt(37498875), txt(45498298), txt(10056941), txt(32374), txt(49797466), txt(26406214), txt(41930169), txt(34858609), txt(44311900), txt(50835469), txt(200372585), txt(27680489), txt(250739134), txt(0), txt(35901517), txt(48541100), txt(1107298287), txt(33088170), txt(31691166), txt(59205195), txt(45561353), txt(5146366), txt(209095045), txt(38473715), txt(30658842), txt(37559848), txt(24942395), txt(19267756), txt(24647493), txt(28830376), txt(3014365), txt(42226500), txt(26411939), txt(29799710), txt(18029016), txt(27622515), txt(1850806), txt(117766254), txt(36632339), txt(26829427), txt(49077336), txt(30856436), txt(44944646), txt(235019432), txt(47354231), txt(29158597), txt(49984811), txt(34786744), txt(53439646), txt(36838168), txt(38875584), txt(50789667), txt(35322512), txt(34500890), txt(51197289), txt(36716659), txt(42090809), txt(33928990), txt(39351870), txt(150102485), txt(17660), txt(5622224), txt(27733986), txt(47951010), txt(34861772), txt(25808336), txt(31218719), txt(31865291), txt(24134125), txt(31535565), txt(44634411), txt(26748105), txt(52645902), txt(1181597097), txt(43248565), txt(44479794), txt(35184256), txt(56494413), txt(23358688), txt(20764882), txt(379729932), txt(52413633), txt(34449193), txt(22055569), txt(37942346), txt(1638418), txt(35524491), txt(32935411), txt(26712291), txt(26179014), java(11109), txt(19681006), txt(131156), txt(14370026), txt(40930102), txt(62766421), java(6766), txt(4132293), txt(2600), java(27514), txt(57288263), txt(34453125), txt(31824578), txt(38279122), txt(30562620), txt(22961650), txt(930392), txt(41488268), txt(20771686), txt(49128081), txt(21003031), txt(992855547), txt(44936799), txt(56014528), txt(52739126), txt(31984512), txt(29635044), txt(31924859), txt(36887397), txt(55697972), txt(42692696), txt(30271698), txt(37150544), txt(41907961), txt(34255771), txt(41528059), txt(47421399), txt(42013597), txt(35718012), txt(20084396), txt(18528908), txt(50403821), txt(24969336), txt(33843189), txt(40977021), txt(49780431), txt(20752356), txt(38762074), txt(26165469), txt(42995205), txt(33293616), txt(47766398), txt(44505332), txt(56209651), txt(42245020), txt(48263139), txt(55949786), txt(53578305), txt(46170580), txt(5468515), txt(37829855), txt(21166157), txt(41879899), txt(36163443), txt(21813777), txt(43575337), txt(26603986), txt(33497213), txt(37850321), txt(226069707), txt(25352451), txt(289766), txt(50392475), txt(288389798), txt(64948679), txt(19009807), txt(34568834), txt(25388782), txt(807539529), txt(33395916), txt(39499018), txt(45586), txt(31188834), txt(554349827), txt(42281886), txt(43370191), txt(31650088), txt(37732629), txt(117480248), txt(159817), txt(44713554), txt(42359269), txt(41773968), txt(618887860), txt(24447870), txt(11402468), txt(34165517), txt(42074168), txt(38989592), txt(49058470), txt(44855395), txt(519832277), txt(47110651), txt(39515065), txt(21006455), txt(31427585), txt(50916261), txt(148840267), txt(90824), txt(19102644), txt(60536419), txt(35689608), txt(2730367), txt(38248862), txt(34874700), zip(389926), txt(259196479), txt(4765329), txt(49657686), txt(41273876), txt(666788), txt(610355), txt(44745075), txt(31912651), txt(43149852), txt(40152228), txt(26942186), txt(41761596), txt(20448660), txt(52581985), txt(34934355), txt(43029600), txt(38047705), txt(39446330), txt(38722262), txt(803661), txt(30328578), txt(7831881), txt(35413143), txt(20947653), java(7389), txt(37007327), txt(34864421), txt(42076004), txt(42358758), txt(48838759), txt(34005293), txt(48622252), txt(6376935), txt(39570683), txt(18757525), txt(42902032), txt(1144837492), txt(63302916), txt(2403400), txt(20704677), txt(191948035), java(998), txt(329), txt(29513440), txt(45537015), txt(38773295), txt(51797515), pdf(1412543), txt(56025377), txt(60698144), txt(24195481), txt(37194357), txt(42702582), txt(37093786), txt(58054397), txt(10165850), txt(724987147), txt(44637671), txt(16663480), txt(42350398), txt(42267233), txt(63946969), java(876), txt(16778735), txt(3117356), txt(64293115), txt(54449354), txt(45704465), txt(37055699), txt(565530), txt(29603585), txt(37838987), java(1566), txt(328932544), txt(102004837), txt(14569303), txt(31175185), txt(43999744), txt(21445906), txt(50054698), txt(33296946), txt(37025769), txt(29315056), txt(433971175), txt(183201610), txt(15534431), txt(728951), txt(32005112), txt(54611627), txt(43059562), txt(97900827), txt(260630709), txt(69209262), txt(52233516), txt(42520263), txt(46872202), txt(1367185652), txt(39300385), txt(64416491), txt(57032343), txt(395), txt(32981733), txt(28953660), txt(408830234), txt(106105556), txt(36151155), txt(319764), txt(38441027), txt(39583326), txt(42143282), txt(69051079), txt(37795073), txt(39943552), txt(19809172), txt(54784534), txt(46541150), txt(42459587), txt(35783072), txt(24081416), txt(12369123), txt(2236), txt(9718570), txt(39720922), txt(53703), txt(37463520), txt(1400326542), txt(42203356), txt(14106816), txt(42473450), txt(20323076), txt(2565313), txt(43369855), txt(20865102), txt(461098775), txt(19313037), txt(31733002), txt(15380303), txt(18086399), txt(53240080), txt(28277861), txt(1150099), txt(653475), txt(48107812), txt(35581734), txt(1406557990), txt(28443813), txt(51828705), txt(8629394), txt(1015309), txt(511301), txt(32859345), txt(138696033), txt(84105359), txt(47962501), txt(374992), txt(37354546), txt(52733513), txt(12481669), txt(35050326), txt(31000246), txt(66678), txt(45232667), txt(58913538), txt(32487571), txt(51117994), txt(46829257), txt(35244965), txt(58863923), txt(38275341), txt(35672748), txt(33101029), txt(34958860), txt(16536287), txt(17375188), txt(45986764), txt(33701119), txt(44804568), txt(53223), txt(44556188), txt(30817797), txt(95059419), txt(22813148), txt(49458317), txt(22620085), txt(49170652), txt(41246703), txt(4321410), txt(55725069), txt(24355172), txt(51064889), txt(61403), txt(53648443), txt(25013384), txt(243499980), txt(65407906), txt(45559636), txt(45869328), txt(40449868), txt(30018932), txt(34788477), txt(20232100), txt(10539772), txt(45197710), txt(43882885), txt(38721027), txt(57888631), txt(177711836), txt(54040994), txt(54201949), txt(28434345), txt(234169), txt(73964876), txt(45369816), txt(43802776), txt(64012573), txt(56171930), txt(24526225), txt(878816729), txt(41522123), txt(174839777), txt(652735660), txt(32299186), txt(36962322), txt(48441500), txt(47098864), txt(1029659570), txt(244676), txt(41133236), txt(45714502), txt(21760067), txt(14483460), txt(82664660), txt(36649777), txt(24919738), txt(43209985), txt(42961348), txt(55200796), txt(75640658), txt(19708925), txt(29243602), txt(37620213), txt(10741609), txt(38010146), txt(35987506), txt(47936014), txt(44186905), java(23208), txt(26883080), txt(38016105), txt(141225562), txt(35904654), txt(26496388), txt(28372347), txt(31118124), txt(243004439), txt(36098280), txt(271523843), txt(29775169), txt(35551990), txt(44924001), txt(34989987), txt(48465369), txt(56508060), txt(26921), txt(33717744), txt(42364919), txt(165958033), txt(28397856), txt(8007), txt(42997272), txt(45220759), txt(52861274), txt(41690198), txt(145207327)Available download formats
    Dataset updated
    Jul 11, 2012
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    A. Scharnhorst; C. Gao; A. Akdag Salah; K. Suchecki; A. Scharnhorst; C. Gao; A. Akdag Salah; K. Suchecki
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Knowledge Space Lab: Design versus Emergence. Comparison between the structure and evolution of categories in the Wikipedia and the Universal Decimal Classification. 2009-2011.Background:This research has been conducted by the project "Knowledge Space Lab - mapping knowledge interactively" (OND 1337291}. Funded by the Royal Netherlands Academy of Arts and Sciences - KNAW, from September 2009 - March 2011 [Strategiefondsproject KNAW - Amsterdam - The Netherlands] the project contributed to the new research area of mapping and modelling of science. The project addressed the difference between representing scholarly knowledge in (external) classification systems (such as thesauri, ontologies, bibliographic systems) and 'internal' representations based on data- and user-tagging (such as network analysis, user annotations/tagging, folksonomies).

  14. f

    Data Sheet 1_Development and evaluation of a Wikipedia based group...

    • frontiersin.figshare.com
    pdf
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katelyn Mroczek; Pru Mitchell; Brian Patrick McSharry; Alice Woods; Belinda Spry; Timothy Paustian; Thiru Vanniasinkam (2025). Data Sheet 1_Development and evaluation of a Wikipedia based group assessment to enhance science communication.pdf [Dataset]. http://doi.org/10.3389/feduc.2025.1620804.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 20, 2025
    Dataset provided by
    Frontiers
    Authors
    Katelyn Mroczek; Pru Mitchell; Brian Patrick McSharry; Alice Woods; Belinda Spry; Timothy Paustian; Thiru Vanniasinkam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project, conducted in collaboration with Wikimedia Australia, introduced an assessment that aimed to enhance science communication skills among third-year microbiology students. With assistance from Wikimedia Australia, suitable Wikipedia articles on immunology topics were selected. All concepts had been covered in course content. Students worked in groups to evaluate these Wikipedia articles, assessing their accuracy, organization, verifiability, depth, and suitability for a general audience. Each group also generated an AI-created article on the same topic and evaluated it using the same criteria. The final report compared the AI-generated content with the Wikipedia article, focusing on key measures of science communication: accuracy, clarity, relevance, and reliability. The evaluation highlighted strengths and areas for improvement in both types of content, providing recommendations for enhancing Wikipedia articles. Students also submitted a reflection on the importance of information literacy and science communication in the digital age. After submission, a survey on students’ perspectives of the assignment was completed by 64% of the class (N = 42). Most students found the assignment to be a novel experience compared to previous tasks. Notably, 60% found it useful, and half indicated that they learned from their peers through the collaborative process. Students rated the readability of both Wikipedia and AI articles and assessed the accuracy and their suitability for a general audience. Additionally, students noted differences in output when generating AI articles, developing their AI literacy skills. The readability of Wikipedia articles compared to other scientific literature (textbooks and journal articles) was also rated, with 45% of students assessing these Wikipedia articles on immunology topics as not pitched for a general audience. By completing this assignment students reported gaining essential graduate competencies such as critical thinking, analysis, communication, and teamwork, as well as a better understanding of Wikipedia and AI. Students also shared their perspectives on whether they would consider using Wikipedia and AI for future assignments.

  15. h

    Data from: WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia

    • heidata.uni-heidelberg.de
    application/x-gzip +1
    Updated Apr 5, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler; Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler (2017). WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia [Dataset]. http://doi.org/10.11588/DATA/10003
    Explore at:
    text/plain; charset=us-ascii(1858), application/x-gzip(887887912)Available download formats
    Dataset updated
    Apr 5, 2017
    Dataset provided by
    heiDATA
    Authors
    Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler; Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/10003https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/10003

    Description

    WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models. The corpus contains training, development and testing subsets randomly split on the query level. Relevance judgments for Cross-Language Information Retrieval (CLIR) are constructed from the inter-language links between German and English Wikipedia articles. A relevance level of (3) is assigned to the (English) cross-lingual mate, and level (2) to all other (English) articles that link to the mate, AND are linked by the mate. Our intuition for this level (2) is that arti cles in a bidirectional link relation to the mate are likely to either define similar concepts or are instances of the concept defined by the mate. For a more detailed description of the corpus construction process, see the above publication.

  16. Data from: Wikipedia Category Granularity (WikiGrain) data

    • zenodo.org
    csv, txt
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
    Explore at:
    txt, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jürgen Lerner; Jürgen Lerner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

    The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

    WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

    The WikiGrain Data is analyzed in the paper

    Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

    ===============================================================
    Individual files (tables in comma-separated-values-format):

    ---------------------------------------------------------------
    * article_info.csv contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "granularity"
    (decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

    - "is.FA"
    (boolean) True ('1') if the article is a featured article; false ('0') else.

    - "is.FA.or.GA"
    (boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

    - "is.top.importance"
    (boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

    - "number.of.revisions"
    (integer) Number of times a new version of the article has been uploaded.


    ---------------------------------------------------------------
    * article_to_tlc.csv
    is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
    The file contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "id.of.tlc"
    (integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

    - "title.of.tlc"
    (string) Title of the TLC in which the article is contained.

    ---------------------------------------------------------------
    * article_info_normalized.csv
    contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
    The file contains the following variables:

    - "id"
    Article id.

    - "is.FA"
    Boolean indicator for whether the article is featured.

    - "log1p.length"
    Length measured by the number of bytes.

    - "age"
    Age measured by the time since the first edit.

    - "log1p.number.of.edits"
    Number of times a new version of the article has been uploaded.

    - "log1p.number.of.reverts"
    Number of times a revision has been reverted to a previous one.

    - "log1p.number.of.contributors"
    Number of unique contributors to the article.

    - "number.of.characters.per.word"
    Average number of characters per word (one component of 'reading complexity').

    - "number.of.words.per.sentence"
    Average number of words per sentence (second component of 'reading complexity').

    - "number.of.level.1.sections"
    Number of first level sections in the article.

    - "number.of.level.2.sections"
    Number of second level sections in the article.

    - "number.of.categories"
    Number of categories the article is in.

    - "log1p.average.size.of.categories"
    Average size of the categories the article is in.

    - "log1p.number.of.intra.wiki.links"
    Number of links to pages in the English-language version of Wikipedia.

    - "log1p.number.of.external.references"
    Number of external references given in the article.

    - "log1p.number.of.images"
    Number of images in the article.

    - "log1p.number.of.templates"
    Number of templates that the article uses.

    - "log1p.number.of.inter.language.links"
    Number of links to articles in different language edition of Wikipedia.

    - "granularity"
    As in article_info.csv (but normalized to standard deviation one).

  17. LLM Science Exam Training Data Wiki Pages

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jude Hunt (2023). LLM Science Exam Training Data Wiki Pages [Dataset]. https://www.kaggle.com/datasets/judehunt23/llm-science-exam-training-data-wiki-pages
    Explore at:
    zip(2843758 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    Jude Hunt
    Description

    Text extracts for each section of the wikipedia pages used to generate the training dataset in the LLM Science Exam competition, plus extracts from the wikipedia category "Concepts in Physics".

    Each page is broken down by section titles, and should also include a "Summary" section

  18. r

    Data from: Wizard of Wikipedia: Knowledge-powered conversational agents

    • resodate.org
    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    E. Dinan; S. Roller; K. Shuster; A. Fan; M. Auli; J. Weston (2024). Wizard of Wikipedia: Knowledge-powered conversational agents [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvd2l6YXJkLW9mLXdpa2lwZWRpYS0ta25vd2xlZGdlLXBvd2VyZWQtY29udmVyc2F0aW9uYWwtYWdlbnRz
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    E. Dinan; S. Roller; K. Shuster; A. Fan; M. Auli; J. Weston
    Description

    The dataset is used for knowledge-grounded dialogue generation, where the goal is to generate responses to context based on external knowledge.

  19. n

    Data from: Robust clustering of languages across Wikipedia growth

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristina Ban; Matjaž Perc; Zoran Levnajić (2017). Robust clustering of languages across Wikipedia growth [Dataset]. http://doi.org/10.5061/dryad.sk0q2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 19, 2017
    Dataset provided by
    University of Maribor
    Faculty of Information Studies, Ljubljanska cesta 31A, 8000 Novo Mesto, Slovenia
    Authors
    Kristina Ban; Matjaž Perc; Zoran Levnajić
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over 5 million articles, comparatively little is known about the behaviour and growth of the remaining 283 smaller Wikipedias, the smallest of which, Afar, has only one article. Here, we use a subset of these data, consisting of 14 962 different articles, each of which exists in 26 different languages, from Arabic to Ukrainian. We study the growth of Wikipedias in these languages over a time span of 15 years. We show that, while an average article follows a random path from one language to another, there exist six well-defined clusters of Wikipedias that share common growth patterns. The make-up of these clusters is remarkably robust against the method used for their determination, as we verify via four different clustering methods. Interestingly, the identified Wikipedia clusters have little correlation with language families and groups. Rather, the growth of Wikipedia across different languages is governed by different factors, ranging from similarities in culture to information literacy.

  20. WORLD DATA by country (2020)

    • kaggle.com
    zip
    Updated Sep 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniboy370 (2020). WORLD DATA by country (2020) [Dataset]. https://www.kaggle.com/daniboy370/world-data-by-country-2020
    Explore at:
    zip(21249 bytes)Available download formats
    Dataset updated
    Sep 19, 2020
    Authors
    Daniboy370
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Area covered
    World
    Description

    Context

    The kernel aims to extract data from Wikipedia's list of countries by category, and visualize it. The database itself, contains a HUGE amount of analyzed data at different categories, waiting anxiously for someone to present them elegantly ( 😏 ), and compare the trends between the different countries.

                   <img src="https://github.com/Daniboy370/Machine-Learning/blob/master/Misc/Animation/VID-out-Wiki.gif?raw=true" width="550">
    

    Content

    The list contains 143 analyses of countries with respect to a specific criterion. Practically, I will refer to several criteria that I found interesting, however the reader is free to add as much as he pleases :

    CriterionFile
    GDP per capitadf_{GDP}
    Population growthdf_{Pop-Growth}
    Life expectancydf_{Life-exp}
    Median agedf_{Med-age}
    Meat consumptiondf_{Meat-cons}
    Sex-ratiodf_{GDP}
    Suicide ratedf_{Suicide}
    Urbanizationdf_{Urban}
    Fertility ratedf_{Fertile}

    The well processed data should be able to provide such a visualization ( for example ) :

                          <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-Dataset-Wiki.gif?raw=true" width="600">
    

    Pipeline

    Choose criterion >> Extract data >> Examine & Clean >> Convert to dataframe >> Visualize :

                          <img src="https://github.com/Daniboy370/Uploads/blob/master/VID-Globe.gif?raw=true" width="400">
    

    \[ \text{Enjoy !}\]

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Organization logo

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

  • File name: wme_people_infobox.tar.gz
  • Size of compressed file: 4.12 GB
  • Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

Search
Clear search
Close search
Google apps
Main menu