As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.
In March 2024, close to 4.4 billion unique global visitors had visited Wikipedia.org, slightly down from 4.4 billion visitors since August of the same year. Wikipedia is a free online encyclopedia with articles generated by volunteers worldwide. The platform is hosted by the Wikimedia Foundation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.
In the six months ending March 2024, the United States accounted for 25.66 percent of traffic to Wikipedia.org. Japan was ranked second, accounting for over five percent of web visits to the website, closely followed by the United Kingdom and Germany.
The most viewed English-language article on Wikipedia in 2023 was Deaths in 2024, with a total of 44.4 million views. Political topics also dominated the list, with articles related to the 2024 U.S. presidential election and key political figures like Kamala Harris and Donald Trump ranking among the top ten most viewed pages. Wikipedia's language diversity As of December 2024, the English Wikipedia subdomain contained approximately 6.91 million articles, making it the largest in terms of content and registered active users. Interestingly, the Cebuano language ranked second with around 6.11 million entries, although many of these articles are reportedly generated by bots. German and French followed as the next most populous European language subdomains, each with over 18,000 active users. Compared to the rest of the internet, as of January 2024, English was the primary language for over 52 percent of websites worldwide, far outpacing Spanish at 5.5 percent and German at 4.8 percent. Global traffic to Wikipedia.org Hosted by the Wikimedia Foundation, Wikipedia.org saw around 4.4 billion unique global visits in March 2024, a slight decrease from 4.6 billion visitors in January. In addition, as of January 2024, Wikipedia ranked amongst the top ten websites with the most referring subnets worldwide.
The data was collected from the English Wikipedia (December 2018). These datasets represent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes represent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The presence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.
📃 Paper | 🤗 Hugging Face | ⭐ Github
Dataset Overview
In the table below, we provide a brief summary of the dataset statistics.
Category Size
Total Sample 2019163
Total Image 2019163
Average Answer Length 84
Maximum Answer Length 5851
JSON Overview
Each dictionary in the JSON file contains three keys: 'id', 'image', and 'conversations'. The 'id' is the unique identifier for the current data in the entire dataset. The 'image' stores… See the full description on the dataset page: https://huggingface.co/datasets/Ghaser/Wikipedia-Knowledge-2M.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia articles use Wikidata to list the links to the same article in other language versions. Therefore, each Wikipedia language edition stores the Wikidata Q-id for each article.
This dataset constitutes a Wikipedia link graph where all the article identifiers are normalized to Wikidata Q-ids. It contains the normalized links from all Wikipedia language versions. Detailed link count statistics are attached. Note that articles that have no incoming nor outgoing links are not part of this graph.
The format is as follows:
Q-id of linking page (outgoing) Q-id of linked page (incoming) language version - dump date (20241101)
This dataset was used to compute Wikidata PageRank. More information can be found on the danker repository, where the source code of the link extraction as well as the PageRank computation is hosted.
Example entries:$ bzcat 2024-11-06.allwiki.links.bz2 | head
1 107 ckbwiki-202411011 107 lawiki-202411011 107 ltwiki-202411011 107 tewiki-202411011 107 wuuwiki-202411011 111 hywwiki-202411011 11379 bat_smgwiki-202411011 11471 cdowiki-202411011 150 ckbwiki-202411011 150 lowiki-20241101
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using Wikipedia data to study AI ethics.
This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the page view statistics for all the WikiMedia projects in the year 2014, ordered by (project, page, timestamp). It has been generated starting from the WikiMedia's pagecounts-raw[1] dataset.The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:* project: the project name* page: the page requested, url-escaped* timestamp: the timestamp of the hour (format: "%Y%m%d-%H%M%S")* count: the number of times the page has been requested (in that hour)* bytes: the number of bytes transferred (in that hour)You can download the full dataset via torrent[2].Further information about this dataset are available at:http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-sorted-by-page-year-2014/[1] https://dumps.wikimedia.org/other/pagecounts-raw/[2] http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-sorted-by-page-year-2014/#download
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
simple wikipedia
the 'simple' split of Wikipedia, from Sept 1 2023. The train split contains about 65M tokens, Pulled via: dataset = load_dataset( "wikipedia", language="simple", date="20230901", beam_runner="DirectRunner" )
stats
train split
general info
0 id… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/simple_wikipedia.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set has been derived from the Simple English Wikipedia data publicly available and post-processed in the WikiWarMonitor project. The data set this is derived from is available from: http://wwm.phy.bme.hu/light.html
The data set comprises a collection of 15 CSV files with summary statistics of the contributors to Wikipedia (Simple English only) in the period of 18/05/2001 to 17/10/2012. The files cover:
Statistics of registered users, anonymous users and bots.
History of revert activity
History of edit wars
Activity statistics broken down into weekly snapshots
Each CSV file has a descriptive header that is generally self-explanatory, so the data is not further described here. However, note that in the activity_snapshots_aggregated.csv file, the edit war conditions are as follows:
Condition 1: ongoing (started before the snapshot and continues)
Condition 2: started and finished within the snapshot
Condition 3: started within the snapshot, but did not finish yet
Condition 4: started before the snapshot, but finished within the snapshot
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
List of 399684 digital object identifiers (185790 unique) linked from the pages of Wikipedia in all languages (the 140 most visited subdomains according to stats.wikimedia.org, one file for each) and available on DOAI.io as redirect to an URL other than dx.doi.org.
Those DOIs represent a cross section of research publications which are significant to the larger community of citizens and are available to them thanks to the green Open Access repositories.
The script used to produce the dataset is also attached.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
OverviewThis data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on usage statistics for reference footnotes made using the Cite extension, across Main-namespace pages (articles) on nearly all Wikimedia sites. It was produced by processing the Wikimedia Enterprise HTML dumps.Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research. Our specific goal was to understand the potential for improving the ways in which references can be reused within a page.Reference tags are frequently used in conjunction with wikitext templates, which is challenging . For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.We didn’t look at reuse across pages for this analysis.LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/The source code is distributed under BSD-3-Clause.Source code and pluggable frameworkThe dumps were processed by HTML dump scraper v0.3.1 written in the Elixir language.The job was run on the Wikimedia Analytics Cluster to take advantage of its high-speed access to HTML dumps. Production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .Our team plans to continue development of the scraper to support future projects as well.Suggestions for new or improved analysis units are welcomed.Data formatFiles are provided at several levels of granularity, from per-page and per-wiki analysis through all-wikis comparisons.Files are either ND-JSON (newline-delimited JSON), plain JSON or CSV.Column definitionsColumns are documented in metrics.md .Page summariesFine-grained results in which each line represents the summarization of a single wiki page.Example file name: enwiki-20240501-page-summary.ndjson.gzExample metrics found in these files:How many tags are created from templates vs. directly in the article.How many references contain a template transclusion to produce their content.How many references are unnamed, automatically, or manually named.How often references are reused via their name.Copy-pasted references that share the same or almost the same content, on the same page.Whether an article has more than one references list.Wiki summariesPage analyses are rolled up to the wiki level, in a separate file for each wiki.Example file name: enwiki-20240501-summary.jsonTop-level comparisonSummarized statistics for each wiki are collected into a single file.Non-scalar fields are discarded for now and various aggregations are used, as can be seen from aggregated column name suffixes.File name: all-wikis-20240501-summary.csvError count comparisonWe’re also collecting a total count of different Cite errors for each wiki. File name: all-wikis-20240501-cite-error-summary.csvEnvironmental costsThere were several rounds of experimentation and mistakes, costs below should be multiplied by 3-4.The computation took 4.5 days at 24x vCPU sharing 2 GB of memory at a data center in Virginia, US. Estimating the environmental impact through https://www.green-algorithms.org/ we get an upper bound of 12.6 kg CO2e, or 40.8 kWh, or 72 km driven in a passenger car.Disk usage was significant as well, with 827 GB read and 4 GB written. At the high estimate of 7 kWh/GB, this could have used as much as 5.8 MWh of energy, but likely much less since streaming was contained within one data center.
As of March 2024, Reddit.com accounted for 34.52 percent of social media referral traffic to Wikipedia.org. The website's second-largest social media traffic driver was YouTube.com, which generated 29.6 percent of social media traffic to the platform.
As of December 2024, the English subdomain of Wikipedia was by far the largest in terms of participation, with more than 122 thousand active registered users and by number of published articles. The languages French and German followed, each over 18 thousand users. Overall, Wikipedia's English subdomain was the largest subdomain of the website by number of entries, with around 6.91 million published articles.
https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy
Notable Ransomware Statistics: Even in the year 2024, ransomware is ranked among the most disruptive and expensive types of cybercrime. This is software that keeps people from accessing their gadgets until they pay an amount, and it keeps getting better with time, while looking for people to pay or companies.
Data as of 2024 indicated that there was an upward trend in the prevalence and economic losses caused by ransomware attacks throughout the world. Emerged are some notable ransomware statistics to consider in the year 2024.
In 2024, the English-language Wikipedia recorded over 31.2 million edits, significantly outpacing other domains. The German and French domains followed with over five million edits, while the Spanish version registered around 4.78 million. Throughout the year, Wikipedia registered more than 81.9 million edits across its platform.
As of the third quarter of 2024, Wikipedia's mobile app generated around 929,410 downloads worldwide. This figure represents a modest increase from previous quarters but falls short of the app's peak popularity in 2015 when quarterly downloads exceeded four million.
As of December 2023, the English subdomain of Wikipedia had around 6.91 million articles published, being the largest subdomain of the website by number of entries and registered active users. German and French ranked third and fourth, with over 29.6 million and 26.5 million entries. Being the only Asian language figuring among the top 10, Cebuano was the language with the second-most articles on the portal, amassing around 6.11 million entries. However, while most Wikipedia articles in English and other European languages are written by humans, entries in Cebuano are reportedly mostly generated by bots.