54 datasets found

Wikipedia.org: number of articles 2024, by language
statista.com
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Wikipedia.org: number of articles 2024, by language [Dataset]. https://www.statista.com/statistics/1427961/wikipedia-org-articles-language/
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
Worldwide
Description
As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.
Z
Data from: Relating Wikipedia Article Quality to Edit Behavior and Link...
data.niaid.nih.gov
Updated Jun 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thorsten Ruprechter (2020). Relating Wikipedia Article Quality to Edit Behavior and Link Structure [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3716097
Explore at:
Dataset updated
Jun 30, 2020
Dataset authored and provided by
Thorsten Ruprechter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was analyzed and produced during the study described in the paper "Relating Wikipedia Article Quality to Edit Behavior and Link Structure" (under review, doi and link follows - see references). Its creation process and use cases are described in the dedicated paper.

For directions and code to process and evaluate this data, please see the corresponding GitHub repository: https://github.com/ruptho/editlinkquality-wikipedia.

We provide three files for 4941 Wikipedia articles (in .pkl format): The "article_revisions_labeled.pkl" file provides the final, semantically labeled revisions for each analyzed article per quality category. The "article_revision_features.zip" file contains processed per-article features, divided into folders for the specific quality categories they belong to. In "article_revision_features_raw.zip", we provide the raw features as retrieved via RevScoring API (https://pythonhosted.org/revscoring/).
Data from: Wikipedia Article
kaggle.com
zip
Updated Aug 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
java (2019). Wikipedia Article [Dataset]. https://www.kaggle.com/datasets/elysiumjavadeveloper/wikipedia-article
Explore at:
zip(3496 bytes)Available download formats
Dataset updated
Aug 29, 2019
Authors
java
Description
Dataset

This dataset was created by java

Contents
d
Replication Data for: The Wikipedia Adventure: Field Evaluation of an...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narayan, Sneha; Orlowitz, Jake; Morgan, Jonathan T.; Shaw, Aaron D.; Hill, Benjamin Mako (2023). Replication Data for: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users [Dataset]. http://doi.org/10.7910/DVN/6HPRIG
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/6HPRIG
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Narayan, Sneha; Orlowitz, Jake; Morgan, Jonathan T.; Shaw, Aaron D.; Hill, Benjamin Mako
Description
This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We o... Visit https://dataone.org/datasets/sha256%3Ab1240bda398e8fa311ac15dbcc04880333d5f3fbe67a7a951786da2d44e33018 for complete metadata about this dataset.
d
Data from: Causal evidence for social group sizes from Wikipedia editing...
search.dataone.org
datadryad.org
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Burgess; Robin Dunbar (2024). Causal evidence for social group sizes from Wikipedia editing data [Dataset]. http://doi.org/10.5061/dryad.fn2z34v36
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.fn2z34v36
Dataset updated
Apr 9, 2024
Dataset provided by
Dryad Digital Repository
Authors
Mark Burgess; Robin Dunbar
Description
Human communities have self-organizing properties in which specific Dunbar Numbers may be invoked to explain group attachments. Â By analyzing Wikipedia editing histories across a wide range of subject pages, we show that there is an emergent coherence in the size of transient groups formed to edit the content of subject texts, with two peaks averaging at around $N=8$ for the size corresponding to maximalÂ contention, and at around $N=4$ as a regular team. These values are consistent with the observed sizes of conversational groups, as well as the hierarchical structuring of Dunbar graphs. Â We use the Promise Theory model of bipartite trust to derive a scaling law thatÂ fits the data and may apply to all group size distributions, when based on attraction to a seeded group process. Â In addition toÂ providing further evidence that even spontaneous communities of strangers are self-organizing, the results have important implications for the governance of the Wikipedia commons and for the se..., Data sets are collected by direct scanning of Wikipedia's open platform data. The data have been processed by code decribed at https://github.com/markburgess/Trustability and documented in detail at http://markburgess.org/trustproject.html, , # Causal evidence for social group sizes from Wikipedia editing data

https://doi.org/10.5061/dryad.fn2z34v36

This is part of a project to formulate a practical Promise Theory model of trust for our Internet and machine enabled age. It is not related to blockchain or so-called trustless technologies, and is not specifically based on cryptographic techniques. Rather it addresses trustworthiness as an assessment of reliability in keeping specific promises and trust as a tendency to monitor or oversee these processes.

The files measure data grabbed by parsing the history logs of many Wikipedia pages. While looking for evidence of signatures about trust, we discovered evidence of ad hoc group formation in users editing pages, consistent with the Dunbar number hypothesis.

We provide the cache of data used in our paper here, in accordance with procedure, but we encourage anyone to collect data themselves using the code referred to below or their o...
D
OK, Computer, what are these books about? - data files
ssh.datastations.nl
csv, txt, zip
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DANS Data Station Social Sciences and Humanities (2020). OK, Computer, what are these books about? - data files [Dataset]. http://doi.org/10.17026/dans-2z4-mrgm
Explore at:
zip(18922), txt(1677), csv(10416908)Available download formats
Unique identifier
https://doi.org/10.17026/dans-2z4-mrgm
Dataset updated
Jul 9, 2020
Dataset provided by
DANS Data Station Social Sciences and Humanities
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Date: 2020-06-03
CaLiGraph - A large-scale semantic knowledge graph compiled from Wikipedia...
zenodo.org
data.niaid.nih.gov
bz2
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolas Heist; Nicolas Heist; Heiko Paulheim; Heiko Paulheim (2023). CaLiGraph - A large-scale semantic knowledge graph compiled from Wikipedia Categories and Listpages [Dataset]. http://doi.org/10.5281/zenodo.3519922
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3519922
Dataset updated
Jun 22, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicolas Heist; Nicolas Heist; Heiko Paulheim; Heiko Paulheim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CaLiGraph is a large-scale semantic knowledge graph with a rich ontology which is compiled from the DBpedia ontology, and Wikipedia Categories & Listpages. For more information, visit http://caligraph.org

Information about uploaded files:
(all files are b-zipped and in the n-triple format)

caligraph-metadata.nt.bz2
Metadata about the dataset which is described using void vocabulary.

caligraph-ontology.nt.bz2
Class definitions, property definitions, restrictions, and labels of the CaLiGraph ontology.

caligraph-ontology_dbpedia-mapping.nt.bz2
Mapping of classes and properties to the DBpedia ontology.

caligraph-ontology_provenance.nt.bz2
Provenance information about classes (i.e. which Wikipedia category or list page has been used to create this class).

caligraph-instances_types.nt.bz2
Definition of instances and (non-transitive) types.

caligraph-instances_transitive-types.nt.bz2
Transitive types for instances (can also be induced by a reasoner).

caligraph-instances_labels.nt.bz2
Labels for instances.

caligraph-instances_relations.nt.bz2
Relations between instances derived from the class restrictions of the ontology (can also be induced by a reasoner).

caligraph-instances_dbpedia-mapping.nt.bz2
Mapping of instances to respective DBpedia instances.

caligraph-instances_provenance.nt.bz2
Provenance information about instances (e.g. if the instance has been extracted from a Wikipedia list page).

dbpedia_caligraph-instances.nt.bz2
Additional instances of CaLiGraph that are not in DBpedia.
! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The contained triples are formulated under the DBpedia namespace, so they can be used together with DBpedia itself. !

dbpedia_caligraph-types.nt.bz2
Additional types of CaLiGraph that are not in DBpedia.
! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The contained triples are formulated under the DBpedia namespace, so they can be used together with DBpedia itself. !

dbpedia_caligraph-relations.nt.bz2
Additional relations of CaLiGraph that are not in DBpedia.
! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The contained triples are formulated under the DBpedia namespace, so they can be used together with DBpedia itself. !
simplified-nq-train data
kaggle.com
Updated Jan 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vesselin (2020). simplified-nq-train data [Dataset]. https://www.kaggle.com/vggconsulting/simplifiednqtrain-data/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vesselin
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset

This dataset was created by Vesselin

Released under CC BY-SA 3.0

Contents
Data from: TokTrack: A Complete Token Provenance and Change Tracking Dataset...
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Flöck; Kenan Erdogan; Maribel Acosta; Fabian Flöck; Kenan Erdogan; Maribel Acosta (2020). TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia [Dataset]. http://doi.org/10.5281/zenodo.345571
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.345571
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabian Flöck; Kenan Erdogan; Maribel Acosta; Fabian Flöck; Kenan Erdogan; Maribel Acosta
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Please cite 10.5281/zenodo.789289 for all versions of this dataset, which will always resolve to the latest.

-----------------

This dataset contains every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history.

This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task.
Adapting a state-of-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage.

This dataset, its creation process and use cases are described in a dedicated dataset paper of the same name, published at the ICWSM 2017 conference. In this paper, we show how this data enables, on token level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics.

Tokenization used: https://gist.github.com/faflo/3f5f30b1224c38b1836d63fa05d1ac94

Toy example for how the token metadata is generated:
https://gist.github.com/faflo/8bd212e81e594676f8d002b175b79de8

Be sure to read the ReadMe.txt or - even more detailed - the supporting paper which is referenced under "related identifiers".
en,bn,hiwiki titles translation in indian laguages
kaggle.com
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayantan Biswas (2023). en,bn,hiwiki titles translation in indian laguages [Dataset]. https://www.kaggle.com/datasets/semaphore007/enbnhiwiki-titles-translation-in-indian-laguages
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 5, 2023
Dataset provided by
Kaggle
Authors
Sayantan Biswas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Creation of Parallel Terminology Bank from Wikipedia: The titles are from https://dumps.wikimedia.org/enwiki/latest/ this website. "https://en.wikipedia.org/wiki/" + "title_text" using webcrawling; I extracted 22 major indian language's words' translation into a single csv file. Total number of words are 58,210,906 for english wiki.
h
img-wikipedia-simple
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Israfel Salazar, img-wikipedia-simple [Dataset]. https://huggingface.co/datasets/israfelsr/img-wikipedia-simple
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Israfel Salazar
License
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Description
Dataset Card for [Dataset Name]

Dataset Summary

[More Information Needed]

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation Curation Rationale

[More Information Needed]

Source Data… See the full description on the dataset page: https://huggingface.co/datasets/israfelsr/img-wikipedia-simple.
d
Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...
datadryad.org
zip
Updated Aug 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
R. Stuart Geiger; Aaron Halfaker (2017). Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20 April 2017 [Dataset]. http://doi.org/10.6078/D1FD3K
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D1FD3K
Dataset updated
Aug 15, 2017
Dataset provided by
Dryad
Authors
R. Stuart Geiger; Aaron Halfaker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2017
Description
See https://meta.wikimedia.org/wiki/Data_dumps for more detail on using these dumps.
Top Viewed Japanese Wikipedia Articles (Full Text)
kaggle.com
Updated Jul 29, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aceofspades914 (2018). Top Viewed Japanese Wikipedia Articles (Full Text) [Dataset]. https://www.kaggle.com/datasets/aceofspades914/top-japanese-wikipedia-pages-nlp/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
aceofspades914
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by aceofspades914

Released under CC0: Public Domain

Contents
m
English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...
data.mendeley.com
Updated Feb 9, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
Explore at:
Unique identifier
https://doi.org/10.17632/cdcztymf4k.1
Dataset updated
Feb 9, 2017
Authors
H. Bahadir Sahin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
Galician Wikipedia (Galipedia) Contribution Tryads
kaggle.com
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Komi Ruza (2023). Galician Wikipedia (Galipedia) Contribution Tryads [Dataset]. http://doi.org/10.34740/kaggle/ds/4074249
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/4074249
Dataset updated
Nov 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Komi Ruza
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

The Galician Wikipedia Contribution Tryads Dataset is a comprehensive collection of 200k article pages and history data of 3M revisions by 76k editors from Galipedia, the Galician edition of Wikipedia. This dataset is derived from the November 2023 data dumps and provides insights into the editing patterns, contributor dynamics, and page content within the Galician Wikipedia.

Dataset Statistics

Pages : 201,072

Pages with no human-made revisions: 6,696

Average revisions per page: 15.88

Average unique editors per page: 5.72

Revisions: 3,193,500 (870,528 "minor" edits)

Contributors: 76,099

Average revisions per contributor: 41.97

Average pages revised per contributor: 15.11

Page Links: 4,944,866

Average links per page: 24.59

Data Relationships

Article Content: Text content of each article page can be found in articles/{article_id}.txt.

Pages: pages.csv contains basic information for each article page (id, title, creation date, last edit date).

Revisions: revisions.csv contains information on all revisions (date, contributor id, revision comment, minor edit mark).

Page Links: page_links.csv contains individual, unique internal links between Wikipedia pages (source article id, linked article id).

Contributors: contributors.csv contains data of unique contributors (contributor id, contributor username).

Potential Tasks

Can we predict, given its textual content, whether a given Article needs a revision in the near future?

+XAI: Can we highlight where or why it should be edited/expanded/fixed?

Given a newcomer or frequent Editor, can we predict other Articles he may be interested in editing?

+XAI: Can we highlight what parts of the Article would this particular Editor would want to edit, or why this Article is well-suited for the Editor?

Ethical Use and Attribution

Researchers are encouraged to use this dataset responsibly and adhere to ethical guidelines. Please provide proper attribution to this dataset and the Galician Wikipedia in any publications, and exercise caution and rigor in interpretation of any data analysis or results.
Z
Data from: Wikipedia Page Views of Japanese Comic
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoshida, Mitsuo (2020). Wikipedia Page Views of Japanese Comic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_60886
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Yoshida, Mitsuo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Abstract (our paper)

This paper investigates the page view and interlanguage link at Wikipedia for Japanese comic analysis. This paper is based on a preliminary investigation, and obtained three results, but the analysis is insufficient to use the results for a market research immediately. I am looking for research collaborators in order to conduct a more detailed analysis.

Data

Publication

This data set was created for our study. If you make use of this data set, please cite: Mitsuo Yoshida. Preliminary Investigation for Japanese Comic Analysis using Wikipedia. Proceedings of the Fifth Asian Conference on Information Systems (ACIS 2016). pp.229-230, 2016.

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

S
YouTube Creator Statistics By Revenue, Users Earning, Trends and Facts...
sci-tech-today.com
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sci-Tech Today (2025). YouTube Creator Statistics By Revenue, Users Earning, Trends and Facts (2025) [Dataset]. https://www.sci-tech-today.com/stats/youtube-creator-statistics-updated/
Explore at:
Dataset updated
Apr 30, 2025
Dataset authored and provided by
Sci-Tech Today
License
https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy
Time period covered
2022 - 2032
Area covered
YouTube, Global
Description
Introduction

YouTube Creator Statistics: YouTube is the world's largest video-sharing platform. It was launched in Feb 2005. In 2006 it was purchased by Google in 2006. As of 2024, it is the worldâ€™s second most-viewed website after Google search. YouTube has managed to create an unprecedented social impact on the world. It has been instrumental in changing the overall dynamics of social media presence.

It has been a dominant force in shaping internet trends and creating millionaire celebrities. Likewise, it would be interesting to highlight YouTube creator statistics to gain valuable information on how this video-sharing platform has profoundly impacted the internet world.
n
ගොනුව:Wikipedia article-creation-2.svg
wiki-data.si-lk.nina.az
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ගොනුව:Wikipedia article-creation-2.svg [Dataset]. https://www.wiki-data.si-lk.nina.az/%E0%B6%9C%E0%B7%9C%E0%B6%B1%E0%B7%94%E0%B7%80:Wikipedia_article-creation-2.svg.html
Explore at:
Dataset updated
May 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ග න ව ග න ඉත හ සය ග න භ ව තය ග ල ය ග න භ ව තය ප රදත තSize of this PNG preview of this SVG file 657 599 ප ක සල අන ක ත ව භ
WikiMed and PubMedDS: Two large-scale datasets for medical concept...
zenodo.org
zip
Updated Dec 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé (2021). WikiMed and PubMedDS: Two large-scale datasets for medical concept extraction and normalization research [Dataset]. http://doi.org/10.5281/zenodo.5753476
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5753476
Dataset updated
Dec 4, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).

WikiMed

Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.

WikiMed contains:

393,618 Wikipedia page texts

1,067,083 mentions of medical concepts

57,739 unique UMLS CUIs

Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.

PubMedDS

Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.

PubMedDS contains:

13,197,430 abstract texts

57,943,354 medical concept mentions

44,881 unique UMLS CUIs

Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.

Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.

Data format

Both datasets use JSON format with one document per line. Each document has the following structure:

{ "_id": "A unique identifier of each document", "text": "Contains text over which mentions are ", "title": "Title of Wikipedia/PubMed Article", "split": "[Not in PubMedDS] Dataset split:

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2024). Wikipedia.org: number of articles 2024, by language [Dataset]. https://www.statista.com/statistics/1427961/wikipedia-org-articles-language/

Wikipedia.org: number of articles 2024, by language

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Dec 4, 2024

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2024

Area covered

Worldwide

Description

As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.

Clear search

Close search

Google apps

Main menu

Wikipedia.org: number of articles 2024, by language

Data from: Relating Wikipedia Article Quality to Edit Behavior and Link...

Data from: Wikipedia Article

Dataset

Contents

Replication Data for: The Wikipedia Adventure: Field Evaluation of an...

Data from: Causal evidence for social group sizes from Wikipedia editing...

OK, Computer, what are these books about? - data files

CaLiGraph - A large-scale semantic knowledge graph compiled from Wikipedia...

simplified-nq-train data

Dataset

Contents

Data from: TokTrack: A Complete Token Provenance and Change Tracking Dataset...

en,bn,hiwiki titles translation in indian laguages

img-wikipedia-simple

Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...

Top Viewed Japanese Wikipedia Articles (Full Text)

Dataset

Contents

English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

Galician Wikipedia (Galipedia) Contribution Tryads

Overview

Dataset Statistics

Data Relationships

Potential Tasks

Ethical Use and Attribution

Data from: Wikipedia Page Views of Japanese Comic

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

YouTube Creator Statistics By Revenue, Users Earning, Trends and Facts...

Introduction

ගොනුව:Wikipedia article-creation-2.svg

WikiMed and PubMedDS: Two large-scale datasets for medical concept...

Wikipedia.org: number of articles 2024, by language

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`