As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was analyzed and produced during the study described in the paper "Relating Wikipedia Article Quality to Edit Behavior and Link Structure" (under review, doi and link follows - see references). Its creation process and use cases are described in the dedicated paper.
For directions and code to process and evaluate this data, please see the corresponding GitHub repository: https://github.com/ruptho/editlinkquality-wikipedia.
We provide three files for 4941 Wikipedia articles (in .pkl format): The "article_revisions_labeled.pkl" file provides the final, semantically labeled revisions for each analyzed article per quality category. The "article_revision_features.zip" file contains processed per-article features, divided into folders for the specific quality categories they belong to. In "article_revision_features_raw.zip", we provide the raw features as retrieved via RevScoring API (https://pythonhosted.org/revscoring/).
This dataset was created by java
This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We o... Visit https://dataone.org/datasets/sha256%3Ab1240bda398e8fa311ac15dbcc04880333d5f3fbe67a7a951786da2d44e33018 for complete metadata about this dataset.
Human communities have self-organizing properties in which specific Dunbar Numbers may be invoked to explain group attachments.  By analyzing Wikipedia editing histories across a wide range of subject pages, we show that there is an emergent coherence in the size of transient groups formed to edit the content of subject texts, with two peaks averaging at around $N=8$ for the size corresponding to maximal contention, and at around $N=4$ as a regular team. These values are consistent with the observed sizes of conversational groups, as well as the hierarchical structuring of Dunbar graphs.  We use the Promise Theory model of bipartite trust to derive a scaling law that fits the data and may apply to all group size distributions, when based on attraction to a seeded group process.  In addition to providing further evidence that even spontaneous communities of strangers are self-organizing, the results have important implications for the governance of the Wikipedia commons and for the se..., Data sets are collected by direct scanning of Wikipedia's open platform data. The data have been processed by code decribed at https://github.com/markburgess/Trustability and documented in detail at http://markburgess.org/trustproject.html, , # Causal evidence for social group sizes from Wikipedia editing data
https://doi.org/10.5061/dryad.fn2z34v36
This is part of a project to formulate a practical Promise Theory model of trust for our Internet and machine enabled age. It is not related to blockchain or so-called trustless technologies, and is not specifically based on cryptographic techniques. Rather it addresses trustworthiness as an assessment of reliability in keeping specific promises and trust as a tendency to monitor or oversee these processes.
The files measure data grabbed by parsing the history logs of many Wikipedia pages. While looking for evidence of signatures about trust, we discovered evidence of ad hoc group formation in users editing pages, consistent with the Dunbar number hypothesis.
We provide the cache of data used in our paper here, in accordance with procedure, but we encourage anyone to collect data themselves using the code referred to below or their o...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Date: 2020-06-03
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CaLiGraph is a large-scale semantic knowledge graph with a rich ontology which is compiled from the DBpedia ontology, and Wikipedia Categories & Listpages. For more information, visit http://caligraph.org
Information about uploaded files:
(all files are b-zipped and in the n-triple format)
caligraph-metadata.nt.bz2
Metadata about the dataset which is described using void vocabulary.
caligraph-ontology.nt.bz2
Class definitions, property definitions, restrictions, and labels of the CaLiGraph ontology.
caligraph-ontology_dbpedia-mapping.nt.bz2
Mapping of classes and properties to the DBpedia ontology.
caligraph-ontology_provenance.nt.bz2
Provenance information about classes (i.e. which Wikipedia category or list page has been used to create this class).
caligraph-instances_types.nt.bz2
Definition of instances and (non-transitive) types.
caligraph-instances_transitive-types.nt.bz2
Transitive types for instances (can also be induced by a reasoner).
caligraph-instances_labels.nt.bz2
Labels for instances.
caligraph-instances_relations.nt.bz2
Relations between instances derived from the class restrictions of the ontology (can also be induced by a reasoner).
caligraph-instances_dbpedia-mapping.nt.bz2
Mapping of instances to respective DBpedia instances.
caligraph-instances_provenance.nt.bz2
Provenance information about instances (e.g. if the instance has been extracted from a Wikipedia list page).
dbpedia_caligraph-instances.nt.bz2
Additional instances of CaLiGraph that are not in DBpedia.
! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The contained triples are formulated under the DBpedia namespace, so they can be used together with DBpedia itself. !
dbpedia_caligraph-types.nt.bz2
Additional types of CaLiGraph that are not in DBpedia.
! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The contained triples are formulated under the DBpedia namespace, so they can be used together with DBpedia itself. !
dbpedia_caligraph-relations.nt.bz2
Additional relations of CaLiGraph that are not in DBpedia.
! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The contained triples are formulated under the DBpedia namespace, so they can be used together with DBpedia itself. !
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset was created by Vesselin
Released under CC BY-SA 3.0
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Please cite 10.5281/zenodo.789289 for all versions of this dataset, which will always resolve to the latest.
-----------------
This dataset contains every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history.
This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task.
Adapting a state-of-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage.
This dataset, its creation process and use cases are described in a dedicated dataset paper of the same name, published at the ICWSM 2017 conference. In this paper, we show how this data enables, on token level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics.
Tokenization used: https://gist.github.com/faflo/3f5f30b1224c38b1836d63fa05d1ac94
Toy example for how the token metadata is generated:
https://gist.github.com/faflo/8bd212e81e594676f8d002b175b79de8
Be sure to read the ReadMe.txt or - even more detailed - the supporting paper which is referenced under "related identifiers".
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Creation of Parallel Terminology Bank from Wikipedia: The titles are from https://dumps.wikimedia.org/enwiki/latest/ this website. "https://en.wikipedia.org/wiki/" + "title_text" using webcrawling; I extracted 22 major indian language's words' translation into a single csv file. Total number of words are 58,210,906 for english wiki.
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Dataset Card for [Dataset Name]
Dataset Summary
[More Information Needed]
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data… See the full description on the dataset page: https://huggingface.co/datasets/israfelsr/img-wikipedia-simple.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
See https://meta.wikimedia.org/wiki/Data_dumps for more detail on using these dumps.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by aceofspades914
Released under CC0: Public Domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.
Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.
By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.
We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.
All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Galician Wikipedia Contribution Tryads Dataset is a comprehensive collection of 200k article pages and history data of 3M revisions by 76k editors from Galipedia, the Galician edition of Wikipedia. This dataset is derived from the November 2023 data dumps and provides insights into the editing patterns, contributor dynamics, and page content within the Galician Wikipedia.
articles/{article_id}.txt
.pages.csv
contains basic information for each article page (id, title, creation date, last edit date).revisions.csv
contains information on all revisions (date, contributor id, revision comment, minor edit mark).page_links.csv
contains individual, unique internal links between Wikipedia pages (source article id, linked article id).contributors.csv
contains data of unique contributors (contributor id, contributor username).Researchers are encouraged to use this dataset responsibly and adhere to ethical guidelines. Please provide proper attribution to this dataset and the Galician Wikipedia in any publications, and exercise caution and rigor in interpretation of any data analysis or results.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract (our paper)
This paper investigates the page view and interlanguage link at Wikipedia for Japanese comic analysis. This paper is based on a preliminary investigation, and obtained three results, but the analysis is insufficient to use the results for a market research immediately. I am looking for research collaborators in order to conduct a more detailed analysis.
Data
Publication
This data set was created for our study. If you make use of this data set, please cite: Mitsuo Yoshida. Preliminary Investigation for Japanese Comic Analysis using Wikipedia. Proceedings of the Fifth Asian Conference on Information Systems (ACIS 2016). pp.229-230, 2016.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks
Column Name | Type | Description |
---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks
Column Name | Type | Description |
---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy
YouTube Creator Statistics: YouTube is the world's largest video-sharing platform. It was launched in Feb 2005. In 2006 it was purchased by Google in 2006. As of 2024, it is the world’s second most-viewed website after Google search. YouTube has managed to create an unprecedented social impact on the world. It has been instrumental in changing the overall dynamics of social media presence.
It has been a dominant force in shaping internet trends and creating millionaire celebrities. Likewise, it would be interesting to highlight YouTube creator statistics to gain valuable information on how this video-sharing platform has profoundly impacted the internet world.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ග න ව ග න ඉත හ සය ග න භ ව තය ග ල ය ග න භ ව තය ප රදත තSize of this PNG preview of this SVG file 657 599 ප ක සල අන ක ත ව භ
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).
WikiMed
Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.
WikiMed contains:
Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.
PubMedDS
Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.
PubMedDS contains:
Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.
Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.
Data format
Both datasets use JSON format with one document per line. Each document has the following structure:
{
"_id": "A unique identifier of each document",
"text": "Contains text over which mentions are ",
"title": "Title of Wikipedia/PubMed Article",
"split": "[Not in PubMedDS] Dataset split:
As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.