100+ datasets found

English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Wikipedia Knowledge Graph dataset
zenodo.org
produccioncientifica.ugr.es
+2more
pdf, tsv
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Explore at:
tsv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6346900
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
h
Wikipedia-Knowledge-2M
huggingface.co
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinyu Chen (2024). Wikipedia-Knowledge-2M [Dataset]. https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2024
Authors
Xinyu Chen
Description
📃 Paper | 🤗 Hugging Face | ⭐ Github

Dataset Overview

In the table below, we provide a brief summary of the dataset statistics.

Category Size

Total Sample 2019163

Total Image 2019163

Average Answer Length 84

Maximum Answer Length 5851

JSON Overview

Each dictionary in the JSON file contains three keys: 'id', 'image', and 'conversations'. The 'id' is the unique identifier for the current data in the entire dataset. The 'image' stores… See the full description on the dataset page: https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M.
d
Archival Data for Page Protection: Another Missing Dimension of Wikipedia...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hill, Benjamin Mako; Shaw, Aaron (2023). Archival Data for Page Protection: Another Missing Dimension of Wikipedia Research [Dataset]. http://doi.org/10.7910/DVN/P1VECE
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/P1VECE
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Hill, Benjamin Mako; Shaw, Aaron
Description
This dataset contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2015) “Page Protection: Another Missing Dimension of Wikipedia Research.” In Proceedings of the 11th International Symposium on Open Collaboration (OpenSym 2015). ACM Press. doi: 10.1145/2788993.2789846 This is an archival version of the data and software released with the paper. All of these data were (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-proetection/ Page protection is a feature of MediaWiki software that allows administrators to restrict contributions to particular pages. For example, a page can be “protected” so that only administrators or logged-in editors with a history of good editing can edit, move, or create it. Protection might involve “full protection” where a page can only be edited by administrators (i.e., “sysops”) or “semi-protection” where a page can only be edited by accounts with a history of good edits (i.e., “autoconfirmed” users). Although largely hidden, page protection profoundly shapes activity on the site. For example, page protection is an important tool used to manage access and participation in situations where vandalism or interpersonal conflict can threaten to undermine content quality. While protection affects only a small portion of pages in English Wikipedia, many of the most highly viewed pages are protected. For example, the “Main Page” in English Wikipedia has been protected since February, 2006 and all Featured Articles are protected at the time they appear on the site’s main page. Millions of viewers may never edit Wikipedia because they never see an edit button. Despite it's widespread and influential nature, very little quantitative research on Wikipedia has taken page protection into account systematically. This page contains software and data to help Wikipedia researchers do exactly this in their work. Because a page's protection status can change over time, the snapshots of page protection data stored by Wikimedia and published by Wikimedia Foundation in as dumps is incomplete. As a result, taking protection into account involves looking at several different sources of data. Much more detail can be found in our paper Page Protection: Another Missing Dimension of Wikipedia Research. If you use this software or these data, we would appreciate if you cite the paper.
Quality of Wikipedia articles by WikiRank
kaggle.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Włodzimierz Lewoniewski (2025). Quality of Wikipedia articles by WikiRank [Dataset]. https://www.kaggle.com/datasets/lewoniewski/quality-of-wikipedia-articles-by-wikirank
Explore at:
zip(771671698 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
Włodzimierz Lewoniewski
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Datasets with quality score for 47 million Wikipedia articles across 55 language versions by Wikirank, as of 1 August 2024.

Potential Applications:

Academic research: scholars can incorporate WikiRank scores into studies on information accuracy, digital literacy, collective intelligence, and crowd dynamics. This data can also inform sociological research into biases, representation, and content disparities across different languages and cultures.

Educational tools and platforms: educational institutions and learning platforms can integrate WikiRank scores to recommend reliable and high-quality articles, significantly aiding learners in sourcing accurate information.

AI and machine learning development: developers and data scientists can use WikiRank scores to train sophisticated NLP and content-generation models to recognize and produce high-quality, structured, and well-referenced content.

Content moderation and policy development: Wikipedia community can use these metrics to enforce content quality policies more effectively.

Content strategy and editorial planning: media companies, publishers, and content strategists can employ these scores to identify high-performing content and detect topics needing deeper coverage or improvement.

More information about the quality score can be found in scientific papers:

Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics

Quality and Importance of Wikipedia Articles in Different Languages

Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles

Modelling the quality of attributes in Wikipedia infoboxes
Data for: Wikipedia as a gateway to biomedical research
zenodo.org
data.niaid.nih.gov
+1more
application/gzip, txt
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio (2020). Data for: Wikipedia as a gateway to biomedical research [Dataset]. http://doi.org/10.5281/zenodo.831459
Explore at:
txt, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.831459
Dataset updated
Sep 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia has been described as a gateway to knowledge. However, the extent to which this gateway ends at Wikipedia or continues via supporting citations is unknown. This dataset was used to establish benchmarks for the relative distribution and referral (click) rate of citations, as indicated by presence of a Digital Object Identifier (DOI), from Wikipedia with a focus on medical citations.

This data set includes for each day in August 2016 a listing of all DOI present in the English language version of Wikipedia and whether or not the DOI are biomedical in nature. Source Code for these data are available at: Ryan Steinberg. (2017, July 9). Lane-Library/wiki-extract: initial Zenodo/DOI release. Zenodo. http://doi.org/10.5281/zenodo.824813

This dataset also includes a listing from Crossref DOIs that were referred from Wikipedia in August 2016 (Wikipedia_referred_DOI). Source code for these data sets is available at: Joe Wass. (2017, July 4). CrossRef/logppj: Initial DOI registered release. Zenodo. http://doi.org/10.5281/zenodo.822636

An article based on this data was published in PLOS One:

Maggio LA, Willinsky JM, Steinberg RM, Mietchen D, Wass JL, Dong T. Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia. PloS one. 2017 Dec 21;12(12):e0190046.

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190046
Wikipedia SQLITE Portable DB, Huge 5M+ Rows
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
Explore at:
zip(6064169983 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
christernyc
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

Key Features:

Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

The database consists of four main tables:

items: Contains information about Wikipedia items, including labels and descriptions

properties: Stores details about Wikidata properties, such as labels and descriptions

pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts

link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

Usage with LIKE queries: ``` import aiosqlite import asyncio

class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

async def _aenter_(self): self.conn = await aiosqlite.connect(self.db_file) return self async def _aexit_(self, exc_type, exc_val, exc_tb): await self.conn.close() async def search_pages_by_title(self, title): query = """ SELECT pages.page_id, pages.item_id, pages.title, pages.views, items.labels AS item_labels, items.description AS item_description, link_annotated_text.sections FROM pages JOIN items ON pages.item_id = items.id JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id WHERE pages.title LIKE ? """ async with self.conn.execute(query, (f"%{title}%",)) as cursor: return await cursor.fetchall() async def search_items_by_label_or_description(self, keyword): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? OR description LIKE ? """ async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor: return await cursor.fetchall() async def search_items_by_label(self, label): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? """ async with self.conn.execute(query, (f"%{label}%",)) as cursor: return await cursor.fetchall() async def search_properties_by_label_or_desc...
H
Replication Data for: Taboo and Collaborative Knowledge Production: Evidence...
dataverse.harvard.edu
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaylea Champion; Benjamin Mako Hill (2024). Replication Data for: Taboo and Collaborative Knowledge Production: Evidence from Wikipedia [Dataset]. http://doi.org/10.7910/DVN/5OKEEO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/5OKEEO
Dataset updated
Oct 15, 2024
Dataset provided by
Harvard Dataverse
Authors
Kaylea Champion; Benjamin Mako Hill
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
By definition, people are reticent or even unwilling to talk about taboo subjects. Because subjects like sexuality, health, and violence are taboo in most cultures, important information on each can be difficult to obtain. Are peer produced knowledge bases like Wikipedia a promising approach for providing people with information on taboo subjects? With its reliance on volunteers who might also be averse to taboo, can the peer production model be relied on to produce high-quality information on taboo subjects? In this paper, we seek to understand the role of taboo in volunteer-produced knowledge bases. We do so by developing a novel computational approach to identify taboo subjects and by using this method to identify a set of articles on taboo subjects in English Wikipedia. We find that articles on taboo subjects are more popular than non-taboo articles and that they are frequently subject to vandalism. Despite frequent attacks, we also find that taboo articles are higher quality. We hypothesize that societal attitudes will lead contributors to taboo subjects to seek to be less identifiable. Although our results are consistent with this proposal in several ways, we surprisingly find that contributors make themselves more identifiable in others.
E
A meta analysis of Wikipedia's coronavirus sources during the COVID-19...
live.european-language-grid.eu
data.niaid.nih.gov
txt
Updated Sep 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806
Explore at:
txtAvailable download formats
Dataset updated
Sep 8, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.
Data from: EventWiki: A knowledge base of major events
figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Apr 29, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Ge; Lei Cui; Baobao Chang; Ming Zhou; Zhifang Sui (2016). EventWiki: A knowledge base of major events [Dataset]. http://doi.org/10.6084/m9.figshare.3171472.v12
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3171472.v12
Dataset updated
Apr 29, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Tao Ge; Lei Cui; Baobao Chang; Ming Zhou; Zhifang Sui
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EventWiki is a knowledge base of major events happening throughout mankind history. It contains 21,275 events of 95 types. The details of event entries can be found in our paper submission and documentation file. Data in the knowledge base is mainly harvested from Wikipedia.As Wikipedia, this resource can be distributed and shared under CC-BY 3.0 license.
Wikipedia Data Science Articles Dataset
kaggle.com
zip
Updated Apr 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sita berete (2024). Wikipedia Data Science Articles Dataset [Dataset]. https://www.kaggle.com/datasets/sitaberete/wikipedia-data-science-articles-dataset
Explore at:
zip(34981109 bytes)Available download formats
Dataset updated
Apr 27, 2024
Authors
sita berete
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by sita berete

Released under MIT

Contents
Structured knowledge bases for the inference of computational trust of...
figshare.com
pdf
Updated May 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Rizzo; luca longo (2020). Structured knowledge bases for the inference of computational trust of Wikipedia editors [Dataset]. http://doi.org/10.6084/m9.figshare.12249770.v4
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12249770.v4
Dataset updated
May 5, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lucas Rizzo; luca longo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge bases structured around IF-THEN rules and defined for the inference of computational trust in the Wikipedia context.
D
Data from: Evolution of Wikipedia Categories
ssh.datastations.nl
java, pdf, txt, zip
Updated Jul 11, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. Scharnhorst; C. Gao; A. Akdag Salah; K. Suchecki; A. Scharnhorst; C. Gao; A. Akdag Salah; K. Suchecki (2012). Evolution of Wikipedia Categories [Dataset]. http://doi.org/10.17026/DANS-XJP-ZFUW
Explore at:
txt(33032974), txt(35369259), txt(4117715), txt(35264986), txt(41292104), txt(24243125), txt(2265298), txt(58440553), txt(41931167), txt(88778120), txt(41126367), txt(28677451), txt(34385842), txt(258099563), txt(41498742), txt(218660748), txt(1983071), txt(46304635), txt(1265318861), txt(47640307), txt(914867480), txt(54012100), txt(3378324), pdf(150123), txt(40333305), txt(32119456), txt(34067472), txt(37437197), txt(40831947), txt(55345683), txt(50853424), txt(29333733), txt(78527665), txt(66243463), txt(6114855), txt(39943928), txt(29236412), txt(35762723), txt(54186791), txt(30011306), txt(29474344), txt(36009576), txt(16283936), txt(45000109), txt(41143476), txt(26394771), java(6859), txt(46686266), txt(15506101), txt(43105254), txt(42154291), txt(50548553), txt(17319810), txt(38849525), txt(25913876), txt(37961660), txt(30823490), txt(16550403), txt(952505436), txt(109322245), txt(3633102), txt(41934664), txt(44260226), txt(45317846), txt(39643128), txt(32305567), txt(6987648), txt(47024242), txt(1067127455), txt(48025211), txt(31897590), txt(37484419), txt(38164266), txt(47459458), txt(14351261), txt(24860364), txt(26155184), txt(42417668), txt(34226158), java(3416), txt(42509270), txt(36707564), txt(44658240), txt(25645673), txt(37351150), txt(585651958), txt(2102396), txt(28747585), txt(54497537), txt(48622454), txt(47764915), txt(688578566), txt(19773260), txt(33901065), txt(17217744), txt(42431391), txt(764725606), txt(51446908), txt(49632391), txt(36302697), txt(38362424), txt(42062770), txt(126171919), txt(17013094), txt(40997973), txt(40012567), txt(33487318), txt(31583165), txt(30768179), txt(39164518), txt(1334772000), txt(50183246), java(11265), txt(52345802), txt(52726218), txt(28736651), txt(39768202), pdf(110150), txt(37728066), txt(34687899), txt(55571566), java(1603), txt(190829819), txt(45160342), txt(43875230), txt(352434761), txt(35261890), txt(36483279), txt(38707347), txt(28777405), txt(42095088), txt(26250474), txt(23156510), txt(30536465), txt(4794697), java(2110), java(10322), txt(38646916), txt(62403066), txt(38013254), txt(163078842), txt(21840704), txt(37442885), txt(38331967), txt(35611692), txt(49461543), txt(48415938), txt(20982868), txt(48735292), txt(49322432), txt(39678411), txt(40626514), txt(41125978), txt(63448477), txt(193432), txt(17188912), txt(44224371), txt(1297141534), txt(42183743), txt(308373180), txt(42663602), txt(27690233), txt(37589440), java(995), txt(38125937), txt(35178223), txt(204975333), txt(52099568), txt(41961692), txt(3890505), txt(157682742), txt(47253870), txt(56393401), txt(21334448), txt(52872987), txt(31885808), txt(38799587), txt(227611586), txt(43151707), txt(41392165), txt(41712709), txt(47539352), txt(54609533), txt(43389901), txt(17507607), txt(37806111), txt(43165634), txt(1225920417), txt(133542000), txt(33057450), txt(35260616), txt(35529200), txt(44636352), txt(47745399), txt(843293431), txt(128143908), txt(60159529), txt(52836430), txt(52308031), txt(43801831), txt(48095886), java(915), txt(28577673), txt(29349243), txt(56908715), txt(20634451), txt(13386944), txt(6741043), txt(43117658), txt(39012496), txt(45518733), txt(46276385), txt(90686127), txt(41447460), txt(43065286), java(1225), txt(20500655), txt(13058479), txt(31492686), txt(43564117), txt(28738426), txt(21200173), txt(42685416), txt(22599531), txt(45038669), txt(487589), txt(2934435), txt(19617297), txt(19472419), txt(12655092), txt(20556357), txt(36148341), java(1720), txt(22852661), txt(34112098), txt(31529478), txt(35416785), txt(37043830), txt(45647747), txt(28525286), txt(27790866), txt(58691414), txt(37446435), txt(52116721), txt(35877304), txt(34806765), txt(4437820), txt(487611895), txt(27374226), txt(42966748), txt(37498875), txt(45498298), txt(10056941), txt(32374), txt(49797466), txt(26406214), txt(41930169), txt(34858609), txt(44311900), txt(50835469), txt(200372585), txt(27680489), txt(250739134), txt(0), txt(35901517), txt(48541100), txt(1107298287), txt(33088170), txt(31691166), txt(59205195), txt(45561353), txt(5146366), txt(209095045), txt(38473715), txt(30658842), txt(37559848), txt(24942395), txt(19267756), txt(24647493), txt(28830376), txt(3014365), txt(42226500), txt(26411939), txt(29799710), txt(18029016), txt(27622515), txt(1850806), txt(117766254), txt(36632339), txt(26829427), txt(49077336), txt(30856436), txt(44944646), txt(235019432), txt(47354231), txt(29158597), txt(49984811), txt(34786744), txt(53439646), txt(36838168), txt(38875584), txt(50789667), txt(35322512), txt(34500890), txt(51197289), txt(36716659), txt(42090809), txt(33928990), txt(39351870), txt(150102485), txt(17660), txt(5622224), txt(27733986), txt(47951010), txt(34861772), txt(25808336), txt(31218719), txt(31865291), txt(24134125), txt(31535565), txt(44634411), txt(26748105), txt(52645902), txt(1181597097), txt(43248565), txt(44479794), txt(35184256), txt(56494413), txt(23358688), txt(20764882), txt(379729932), txt(52413633), txt(34449193), txt(22055569), txt(37942346), txt(1638418), txt(35524491), txt(32935411), txt(26712291), txt(26179014), java(11109), txt(19681006), txt(131156), txt(14370026), txt(40930102), txt(62766421), java(6766), txt(4132293), txt(2600), java(27514), txt(57288263), txt(34453125), txt(31824578), txt(38279122), txt(30562620), txt(22961650), txt(930392), txt(41488268), txt(20771686), txt(49128081), txt(21003031), txt(992855547), txt(44936799), txt(56014528), txt(52739126), txt(31984512), txt(29635044), txt(31924859), txt(36887397), txt(55697972), txt(42692696), txt(30271698), txt(37150544), txt(41907961), txt(34255771), txt(41528059), txt(47421399), txt(42013597), txt(35718012), txt(20084396), txt(18528908), txt(50403821), txt(24969336), txt(33843189), txt(40977021), txt(49780431), txt(20752356), txt(38762074), txt(26165469), txt(42995205), txt(33293616), txt(47766398), txt(44505332), txt(56209651), txt(42245020), txt(48263139), txt(55949786), txt(53578305), txt(46170580), txt(5468515), txt(37829855), txt(21166157), txt(41879899), txt(36163443), txt(21813777), txt(43575337), txt(26603986), txt(33497213), txt(37850321), txt(226069707), txt(25352451), txt(289766), txt(50392475), txt(288389798), txt(64948679), txt(19009807), txt(34568834), txt(25388782), txt(807539529), txt(33395916), txt(39499018), txt(45586), txt(31188834), txt(554349827), txt(42281886), txt(43370191), txt(31650088), txt(37732629), txt(117480248), txt(159817), txt(44713554), txt(42359269), txt(41773968), txt(618887860), txt(24447870), txt(11402468), txt(34165517), txt(42074168), txt(38989592), txt(49058470), txt(44855395), txt(519832277), txt(47110651), txt(39515065), txt(21006455), txt(31427585), txt(50916261), txt(148840267), txt(90824), txt(19102644), txt(60536419), txt(35689608), txt(2730367), txt(38248862), txt(34874700), zip(389926), txt(259196479), txt(4765329), txt(49657686), txt(41273876), txt(666788), txt(610355), txt(44745075), txt(31912651), txt(43149852), txt(40152228), txt(26942186), txt(41761596), txt(20448660), txt(52581985), txt(34934355), txt(43029600), txt(38047705), txt(39446330), txt(38722262), txt(803661), txt(30328578), txt(7831881), txt(35413143), txt(20947653), java(7389), txt(37007327), txt(34864421), txt(42076004), txt(42358758), txt(48838759), txt(34005293), txt(48622252), txt(6376935), txt(39570683), txt(18757525), txt(42902032), txt(1144837492), txt(63302916), txt(2403400), txt(20704677), txt(191948035), java(998), txt(329), txt(29513440), txt(45537015), txt(38773295), txt(51797515), pdf(1412543), txt(56025377), txt(60698144), txt(24195481), txt(37194357), txt(42702582), txt(37093786), txt(58054397), txt(10165850), txt(724987147), txt(44637671), txt(16663480), txt(42350398), txt(42267233), txt(63946969), java(876), txt(16778735), txt(3117356), txt(64293115), txt(54449354), txt(45704465), txt(37055699), txt(565530), txt(29603585), txt(37838987), java(1566), txt(328932544), txt(102004837), txt(14569303), txt(31175185), txt(43999744), txt(21445906), txt(50054698), txt(33296946), txt(37025769), txt(29315056), txt(433971175), txt(183201610), txt(15534431), txt(728951), txt(32005112), txt(54611627), txt(43059562), txt(97900827), txt(260630709), txt(69209262), txt(52233516), txt(42520263), txt(46872202), txt(1367185652), txt(39300385), txt(64416491), txt(57032343), txt(395), txt(32981733), txt(28953660), txt(408830234), txt(106105556), txt(36151155), txt(319764), txt(38441027), txt(39583326), txt(42143282), txt(69051079), txt(37795073), txt(39943552), txt(19809172), txt(54784534), txt(46541150), txt(42459587), txt(35783072), txt(24081416), txt(12369123), txt(2236), txt(9718570), txt(39720922), txt(53703), txt(37463520), txt(1400326542), txt(42203356), txt(14106816), txt(42473450), txt(20323076), txt(2565313), txt(43369855), txt(20865102), txt(461098775), txt(19313037), txt(31733002), txt(15380303), txt(18086399), txt(53240080), txt(28277861), txt(1150099), txt(653475), txt(48107812), txt(35581734), txt(1406557990), txt(28443813), txt(51828705), txt(8629394), txt(1015309), txt(511301), txt(32859345), txt(138696033), txt(84105359), txt(47962501), txt(374992), txt(37354546), txt(52733513), txt(12481669), txt(35050326), txt(31000246), txt(66678), txt(45232667), txt(58913538), txt(32487571), txt(51117994), txt(46829257), txt(35244965), txt(58863923), txt(38275341), txt(35672748), txt(33101029), txt(34958860), txt(16536287), txt(17375188), txt(45986764), txt(33701119), txt(44804568), txt(53223), txt(44556188), txt(30817797), txt(95059419), txt(22813148), txt(49458317), txt(22620085), txt(49170652), txt(41246703), txt(4321410), txt(55725069), txt(24355172), txt(51064889), txt(61403), txt(53648443), txt(25013384), txt(243499980), txt(65407906), txt(45559636), txt(45869328), txt(40449868), txt(30018932), txt(34788477), txt(20232100), txt(10539772), txt(45197710), txt(43882885), txt(38721027), txt(57888631), txt(177711836), txt(54040994), txt(54201949), txt(28434345), txt(234169), txt(73964876), txt(45369816), txt(43802776), txt(64012573), txt(56171930), txt(24526225), txt(878816729), txt(41522123), txt(174839777), txt(652735660), txt(32299186), txt(36962322), txt(48441500), txt(47098864), txt(1029659570), txt(244676), txt(41133236), txt(45714502), txt(21760067), txt(14483460), txt(82664660), txt(36649777), txt(24919738), txt(43209985), txt(42961348), txt(55200796), txt(75640658), txt(19708925), txt(29243602), txt(37620213), txt(10741609), txt(38010146), txt(35987506), txt(47936014), txt(44186905), java(23208), txt(26883080), txt(38016105), txt(141225562), txt(35904654), txt(26496388), txt(28372347), txt(31118124), txt(243004439), txt(36098280), txt(271523843), txt(29775169), txt(35551990), txt(44924001), txt(34989987), txt(48465369), txt(56508060), txt(26921), txt(33717744), txt(42364919), txt(165958033), txt(28397856), txt(8007), txt(42997272), txt(45220759), txt(52861274), txt(41690198), txt(145207327)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-XJP-ZFUW
Dataset updated
Jul 11, 2012
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
A. Scharnhorst; C. Gao; A. Akdag Salah; K. Suchecki; A. Scharnhorst; C. Gao; A. Akdag Salah; K. Suchecki
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Knowledge Space Lab: Design versus Emergence. Comparison between the structure and evolution of categories in the Wikipedia and the Universal Decimal Classification. 2009-2011.Background:This research has been conducted by the project "Knowledge Space Lab - mapping knowledge interactively" (OND 1337291}. Funded by the Royal Netherlands Academy of Arts and Sciences - KNAW, from September 2009 - March 2011 [Strategiefondsproject KNAW - Amsterdam - The Netherlands] the project contributed to the new research area of mapping and modelling of science. The project addressed the difference between representing scholarly knowledge in (external) classification systems (such as thesauri, ontologies, bibliographic systems) and 'internal' representations based on data- and user-tagging (such as network analysis, user annotations/tagging, folksonomies).
f
Data Sheet 1_Development and evaluation of a Wikipedia based group...
frontiersin.figshare.com
pdf
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katelyn Mroczek; Pru Mitchell; Brian Patrick McSharry; Alice Woods; Belinda Spry; Timothy Paustian; Thiru Vanniasinkam (2025). Data Sheet 1_Development and evaluation of a Wikipedia based group assessment to enhance science communication.pdf [Dataset]. http://doi.org/10.3389/feduc.2025.1620804.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2025.1620804.s001
Dataset updated
Aug 20, 2025
Dataset provided by
Frontiers
Authors
Katelyn Mroczek; Pru Mitchell; Brian Patrick McSharry; Alice Woods; Belinda Spry; Timothy Paustian; Thiru Vanniasinkam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project, conducted in collaboration with Wikimedia Australia, introduced an assessment that aimed to enhance science communication skills among third-year microbiology students. With assistance from Wikimedia Australia, suitable Wikipedia articles on immunology topics were selected. All concepts had been covered in course content. Students worked in groups to evaluate these Wikipedia articles, assessing their accuracy, organization, verifiability, depth, and suitability for a general audience. Each group also generated an AI-created article on the same topic and evaluated it using the same criteria. The final report compared the AI-generated content with the Wikipedia article, focusing on key measures of science communication: accuracy, clarity, relevance, and reliability. The evaluation highlighted strengths and areas for improvement in both types of content, providing recommendations for enhancing Wikipedia articles. Students also submitted a reflection on the importance of information literacy and science communication in the digital age. After submission, a survey on students’ perspectives of the assignment was completed by 64% of the class (N = 42). Most students found the assignment to be a novel experience compared to previous tasks. Notably, 60% found it useful, and half indicated that they learned from their peers through the collaborative process. Students rated the readability of both Wikipedia and AI articles and assessed the accuracy and their suitability for a general audience. Additionally, students noted differences in output when generating AI articles, developing their AI literacy skills. The readability of Wikipedia articles compared to other scientific literature (textbooks and journal articles) was also rated, with 45% of students assessing these Wikipedia articles on immunology topics as not pitched for a general audience. By completing this assignment students reported gaining essential graduate competencies such as critical thinking, analysis, communication, and teamwork, as well as a better understanding of Wikipedia and AI. Students also shared their perspectives on whether they would consider using Wikipedia and AI for future assignments.
h
Data from: WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia
heidata.uni-heidelberg.de
application/x-gzip +1
Updated Apr 5, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler; Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler (2017). WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia [Dataset]. http://doi.org/10.11588/DATA/10003
Explore at:
text/plain; charset=us-ascii(1858), application/x-gzip(887887912)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/10003
Dataset updated
Apr 5, 2017
Dataset provided by
heiDATA
Authors
Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler; Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/10003https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/10003
Description
WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models. The corpus contains training, development and testing subsets randomly split on the query level. Relevance judgments for Cross-Language Information Retrieval (CLIR) are constructed from the inter-language links between German and English Wikipedia articles. A relevance level of (3) is assigned to the (English) cross-lingual mate, and level (2) to all other (English) articles that link to the mate, AND are linked by the mate. Our intuition for this level (2) is that arti cles in a bidirectional link relation to the mate are likely to either define similar concepts or are instances of the concept defined by the mate. For a more detailed description of the corpus construction process, see the above publication.
Data from: Wikipedia Category Granularity (WikiGrain) data
zenodo.org
csv, txt
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
Explore at:
txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1005175
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jürgen Lerner; Jürgen Lerner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

The WikiGrain Data is analyzed in the paper

Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

===============================================================
Individual files (tables in comma-separated-values-format):

---------------------------------------------------------------
* article_info.csv contains the following variables:

- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.

- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.

---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:

- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

- "title.of.tlc"
(string) Title of the TLC in which the article is contained.

---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:

- "id"
Article id.

- "is.FA"
Boolean indicator for whether the article is featured.

- "log1p.length"
Length measured by the number of bytes.

- "age"
Age measured by the time since the first edit.

- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.

- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.

- "log1p.number.of.contributors"
Number of unique contributors to the article.

- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').

- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').

- "number.of.level.1.sections"
Number of first level sections in the article.

- "number.of.level.2.sections"
Number of second level sections in the article.

- "number.of.categories"
Number of categories the article is in.

- "log1p.average.size.of.categories"
Average size of the categories the article is in.

- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.

- "log1p.number.of.external.references"
Number of external references given in the article.

- "log1p.number.of.images"
Number of images in the article.

- "log1p.number.of.templates"
Number of templates that the article uses.

- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.

- "granularity"
As in article_info.csv (but normalized to standard deviation one).
LLM Science Exam Training Data Wiki Pages
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jude Hunt (2023). LLM Science Exam Training Data Wiki Pages [Dataset]. https://www.kaggle.com/datasets/judehunt23/llm-science-exam-training-data-wiki-pages
Explore at:
zip(2843758 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
Jude Hunt
Description
Text extracts for each section of the wikipedia pages used to generate the training dataset in the LLM Science Exam competition, plus extracts from the wikipedia category "Concepts in Physics".

Each page is broken down by section titles, and should also include a "Summary" section
r
Data from: Wizard of Wikipedia: Knowledge-powered conversational agents
resodate.org
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E. Dinan; S. Roller; K. Shuster; A. Fan; M. Auli; J. Weston (2024). Wizard of Wikipedia: Knowledge-powered conversational agents [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvd2l6YXJkLW9mLXdpa2lwZWRpYS0ta25vd2xlZGdlLXBvd2VyZWQtY29udmVyc2F0aW9uYWwtYWdlbnRz
Explore at:
Dataset updated
Dec 16, 2024
Dataset provided by
Leibniz Data Manager
Authors
E. Dinan; S. Roller; K. Shuster; A. Fan; M. Auli; J. Weston
Description
The dataset is used for knowledge-grounded dialogue generation, where the goal is to generate responses to context based on external knowledge.
n
Data from: Robust clustering of languages across Wikipedia growth
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kristina Ban; Matjaž Perc; Zoran Levnajić (2017). Robust clustering of languages across Wikipedia growth [Dataset]. http://doi.org/10.5061/dryad.sk0q2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.sk0q2
Dataset updated
Sep 19, 2017
Dataset provided by
University of Maribor
Faculty of Information Studies, Ljubljanska cesta 31A, 8000 Novo Mesto, Slovenia
Authors
Kristina Ban; Matjaž Perc; Zoran Levnajić
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over 5 million articles, comparatively little is known about the behaviour and growth of the remaining 283 smaller Wikipedias, the smallest of which, Afar, has only one article. Here, we use a subset of these data, consisting of 14 962 different articles, each of which exists in 26 different languages, from Arabic to Ukrainian. We study the growth of Wikipedias in these languages over a time span of 15 years. We show that, while an average article follows a random path from one language to another, there exist six well-defined clusters of Wikipedias that share common growth patterns. The make-up of these clusters is remarkably robust against the method used for their determination, as we verify via four different clustering methods. Interestingly, the identified Wikipedia clusters have little correlation with language families and groups. Rather, the growth of Wikipedia across different languages is governed by different factors, ranging from similarities in culture to information literacy.

WORLD DATA by country (2020)

kaggle.com

zip

Updated Sep 19, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniboy370 (2020). WORLD DATA by country (2020) [Dataset]. https://www.kaggle.com/daniboy370/world-data-by-country-2020

Explore at:

zip(21249 bytes)Available download formats

Dataset updated

Sep 19, 2020

Authors

Daniboy370

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Area covered

World

Description

Context

The kernel aims to extract data from Wikipedia's list of countries by category, and visualize it. The database itself, contains a HUGE amount of analyzed data at different categories, waiting anxiously for someone to present them elegantly ( 😏 ), and compare the trends between the different countries.

               <img src="https://github.com/Daniboy370/Machine-Learning/blob/master/Misc/Animation/VID-out-Wiki.gif?raw=true" width="550">

Content

The list contains 143 analyses of countries with respect to a specific criterion. Practically, I will refer to several criteria that I found interesting, however the reader is free to add as much as he pleases :

Criterion	File
GDP per capita	df_{GDP}
Population growth	df_{Pop-Growth}
Life expectancy	df_{Life-exp}
Median age	df_{Med-age}
Meat consumption	df_{Meat-cons}
Sex-ratio	df_{GDP}
Suicide rate	df_{Suicide}
Urbanization	df_{Urban}
Fertility rate	df_{Fertile}

The well processed data should be able to provide such a visualization ( for example ) :

                      <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-Dataset-Wiki.gif?raw=true" width="600">

Pipeline

Choose criterion >> Extract data >> Examine & Clean >> Convert to dataframe >> Visualize :

                      <img src="https://github.com/Daniboy370/Uploads/blob/master/VID-Globe.gif?raw=true" width="400">

\[ \text{Enjoy !}\]

Facebook

Twitter

Click to copy link

Link copied

Cite

Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Explore at:

zip(4293465577 bytes)Available download formats

Dataset updated

Jul 31, 2025

Dataset provided by

Wikimedia Foundationhttp://www.wikimedia.org/

Authors

Wikimedia

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz
Size of compressed file: 4.12 GB
Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

Clear search

Close search

Google apps

Main menu

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

Wikipedia Knowledge Graph dataset

Wikipedia-Knowledge-2M

Archival Data for Page Protection: Another Missing Dimension of Wikipedia...

Quality of Wikipedia articles by WikiRank

Potential Applications:

Data for: Wikipedia as a gateway to biomedical research

Wikipedia SQLITE Portable DB, Huge 5M+ Rows

Replication Data for: Taboo and Collaborative Knowledge Production: Evidence...

A meta analysis of Wikipedia's coronavirus sources during the COVID-19...

Data from: EventWiki: A knowledge base of major events

Wikipedia Data Science Articles Dataset

Dataset

Contents

Structured knowledge bases for the inference of computational trust of...

Data from: Evolution of Wikipedia Categories

Data Sheet 1_Development and evaluation of a Wikipedia based group...

Data from: WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia

Data from: Wikipedia Category Granularity (WikiGrain) data

LLM Science Exam Training Data Wiki Pages

Data from: Wizard of Wikipedia: Knowledge-powered conversational agents

Data from: Robust clustering of languages across Wikipedia growth

WORLD DATA by country (2020)

Context

Content

Pipeline

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution