100+ datasets found

Most viewed Wiki Pages
kaggle.com
zip
Updated Sep 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathurin Aché (2020). Most viewed Wiki Pages [Dataset]. https://www.kaggle.com/datasets/mathurinache/mostviewedwikipages
Explore at:
zip(9241 bytes)Available download formats
Dataset updated
Sep 5, 2020
Authors
Mathurin Aché
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset about Most viewed Wiki Pages is extracted from Flourish visualisation. If you want to know more about Flourish click here.
e
wikipedia.org Traffic Analytics Data
analytics.explodingtopics.com
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). wikipedia.org Traffic Analytics Data [Dataset]. https://analytics.explodingtopics.com/website/wikipedia.org
Explore at:
Dataset updated
Oct 1, 2025
Variables measured
Global Rank, Monthly Visits, Authority Score, US Country Rank, Online Services Category Rank
Description
Traffic analytics, rankings, and competitive metrics for wikipedia.org as of October 2025
Wikipedia Phase 1 Official Quality Dataset
kaggle.com
zip
Updated Nov 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Explorer Lab (2025). Wikipedia Phase 1 Official Quality Dataset [Dataset]. https://www.kaggle.com/datasets/dataexplorerlab/wikipedia-phase1-official-quality
Explore at:
zip(649613234 bytes)Available download formats
Dataset updated
Nov 11, 2025
Authors
Data Explorer Lab
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Phase 1 snapshot of English Wikipedia articles collected in November 2025. Each row includes the human-maintained WikiProject quality and importance labels (e.g. Stub→FA, Low→Top), along with structural features gathered from the MediaWiki API. The corpus is designed for training quality estimators, monitoring coverage, and prioritising editorial workflows.

Contents

full.csv: complete dataset (~1.5M rows)

full_part01.csv – full_part15.csv: 100k-row chunks (final file contains the remainder)

sample_10k.csv, sample_100k.csv: stratified samples for quick experimentation

prepare_kaggle_release.ipynb: reproducible sampling and chunking workflow

lightgbm_quality_importance.ipynb: baseline models predicting quality/importance

Column summary

title: article title

page_id: Wikipedia page identifier

size: byte length of the current revision

touched: last touched timestamp (UTC, ISO 8601)

internal_links_count, external_links_count, langlinks_count, images_count, redirects_count: MediaWiki API structural metrics

protection_level: current protection status (e.g. unprotected, semi-protected)

official_quality: human label (Stub, Start, C, B, GA, A, FA, etc.)

official_quality_score: numeric mapping of official_quality (Stub=1, Start=2, C=3, B=4, GA=5, A=6, FA=7, 8–10 for rare higher tiers)

official_importance: human label (Low, Mid, High, Top, etc.)

official_importance_score: numeric mapping of the importance label (Low=1, Mid=3, High=5, Top=8, 10 for special tiers)

categories, templates: pipe-delimited lists of categories/templates (UTF-8 sanitised)

Notes

Files are encoded in UTF-8 with BOM; straight double quotes were replaced with double-prime characters to remain Excel-friendly.

Use the chunked files or chunksize when streaming the full dataset.

Feedback and feature requests are welcome; Phase 2 roadmap adds pageview aggregates and revision metrics.
Wikipedia Knowledge Graph dataset
zenodo.org
produccioncientifica.ugr.es
+2more
pdf, tsv
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Explore at:
tsv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6346900
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
E
A meta analysis of Wikipedia's coronavirus sources during the COVID-19...
live.european-language-grid.eu
data.niaid.nih.gov
txt
Updated Sep 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806
Explore at:
txtAvailable download formats
Dataset updated
Sep 8, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.
Selection of English Wikipedia pages (CNs) regarding topics with a direct...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Kämpf; Eric Tessenow; Dror Y. Kenett; Jan W. Kantelhardt (2023). Selection of English Wikipedia pages (CNs) regarding topics with a direct relation to the emerging Hadoop (Big Data) market. [Dataset]. http://doi.org/10.1371/journal.pone.0141892.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0141892.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mirko Kämpf; Eric Tessenow; Dror Y. Kenett; Jan W. Kantelhardt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Apache Hadoop is the central software project, beside Apache SOLR, and Apache Lucene (SW, software). Companies which offer Hadoop distributions and Hadoop based solutions are the central companies in the scope of the study (HV, hardware vendors). Other companies started very early with Hadoop related projects as early adopters (EA). Global players (GP) are affected by this emerging market, its opportunities and the new competitors (NC). Some new but highly relevant companies like Talend or LucidWorks have been selected because of their obvious commitment to the open source ideas. Widely adopted technologies with a relation to the selected research topic are represented by the group TEC.
World Wikipedia Statistics (2023)
kaggle.com
Updated Jan 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhavik Jikadara (2024). World Wikipedia Statistics (2023) [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/wikipedia-world-statistics-2023
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 4, 2024
Dataset provided by
Kaggle
Authors
Bhavik Jikadara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Here are some interesting statistics about the world from Wikipedia

This dataset provides a comprehensive snapshot of global country statistics for the year 2023. It was scraped from various Wikipedia pages using BeautifulSoup, consolidating key indicators and metrics for 142 countries. The dataset covers diverse aspects such as land area, water area, Human Development Index (HDI), GDP forecasts, internet usage, and population changes.

Key Columns and Metrics:

Country: The name of the country.

Total in km2: Total area of the country.

Land in km2: Land area excluding water bodies.

Water in km2: Area covered by water bodies.

Water %: Percentage of the total area covered by water.

HDI: Human Development Index, a measure of a country's overall achievement in its social and economic dimensions.

%HDI Growth: Percentage growth in HDI.

IMF Forecast GDP(Nominal): International Monetary Fund's forecast for Gross Domestic Product in nominal terms.

World Bank Forecast GDP(Nominal): World Bank's forecast for Gross Domestic Product in nominal terms.

UN Forecast GDP(Nominal): United Nations' forecast for Gross Domestic Product in nominal terms.

IMF Forecast GDP(PPP): IMF's forecast for Gross Domestic Product in purchasing power parity terms.

World Bank Forecast GDP(PPP): World Bank's forecast for Gross Domestic Product in purchasing power parity terms.

CIA Forecast GDP(PPP): Central Intelligence Agency's forecast for Gross Domestic Product in purchasing power parity terms.

Internet Users: Number of internet users in the country.

UN Continental Region: Continental region classification by the United Nations.

UN Statistical Subregion: Statistical subregion classification by the United Nations.

Population 2022: Population of the country in the year 2022.

Population 2023: Population of the country in the year 2023.

Population %Change: Percentage change in population from 2022 to 2023.
Wikipedia Pageviews
kaggle.com
zip
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vlad (2025). Wikipedia Pageviews [Dataset]. https://www.kaggle.com/datasets/vladtasca/wikipedia-pageviews/code
Explore at:
zip(3888245 bytes)Available download formats
Dataset updated
Nov 28, 2025
Authors
Vlad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset aggregates the 100 most popular Wikipedia articles by pageviews - enabling the tracking of trending topics on Wikipedia.

The data begins in the year 2016 and the textual data is presented as it is found on the website of Wikipedia.

Column description

rank- Rank of the article (out of 100).

article - Title of the article.

views - Number of pageviews (across all platforms).

date - Date of the pageviews.

Update schedule

This dataset is updated on a daily basis with new data sourced from the WikiMedia API.
wikipedia.de Website Traffic, Ranking, Analytics [October 2025]
semrush.ebundletools.com
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semrush (2025). wikipedia.de Website Traffic, Ranking, Analytics [October 2025] [Dataset]. https://semrush.ebundletools.com/website/wikipedia.de/overview/
Explore at:
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Semrushhttps://fr.semrush.com/
License
https://semrush.ebundletools.com/company/legal/terms-of-service/https://semrush.ebundletools.com/company/legal/terms-of-service/
Time period covered
Nov 12, 2025
Area covered
Worldwide
Variables measured
visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
Measurement technique
Semrush Traffic Analytics; Click-stream data
Description
wikipedia.de is ranked #1885 in DE with 1.21M Traffic. Categories: . Learn more about website traffic, market share, and more!
W
Wiki Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Wiki Software Report [Dataset]. https://www.archivemarketresearch.com/reports/wiki-software-18842
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The size of the Wiki Software market was valued at USD 985 million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX % during the forecast period.
H
Hosted Wiki Platform Report
archivemarketresearch.com
doc, pdf, ppt
Updated Dec 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Hosted Wiki Platform Report [Dataset]. https://www.archivemarketresearch.com/reports/hosted-wiki-platform-30154
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Dec 3, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The size of the Hosted Wiki Platform market was valued at USD XXX million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX % during the forecast period.
bg-wiki.com Website Traffic, Ranking, Analytics [October 2025]
semrush.ebundletools.com
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semrush (2025). bg-wiki.com Website Traffic, Ranking, Analytics [October 2025] [Dataset]. https://semrush.ebundletools.com/website/bg-wiki.com/overview/
Explore at:
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Semrushhttps://fr.semrush.com/
License
https://semrush.ebundletools.com/company/legal/terms-of-service/https://semrush.ebundletools.com/company/legal/terms-of-service/
Time period covered
Nov 12, 2025
Area covered
Worldwide
Variables measured
visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
Measurement technique
Semrush Traffic Analytics; Click-stream data
Description
bg-wiki.com is ranked #12424 in US with 1M Traffic. Categories: Computer and Video Games. Learn more about website traffic, market share, and more!
wavu.wiki Website Traffic, Ranking, Analytics [October 2025]
sr01.toolswala.net
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semrush (2025). wavu.wiki Website Traffic, Ranking, Analytics [October 2025] [Dataset]. https://sr01.toolswala.net/website/wavu.wiki/overview/
Explore at:
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Semrushhttps://fr.semrush.com/
License
https://sr01.toolswala.net/_www/company/legal/terms-of-service/https://sr01.toolswala.net/_www/company/legal/terms-of-service/
Time period covered
Nov 12, 2025
Area covered
Worldwide
Variables measured
visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
Measurement technique
Semrush Traffic Analytics; Click-stream data
Description
wavu.wiki is ranked #5158 in JP with 866.03K Traffic. Categories: Online Services. Learn more about website traffic, market share, and more!
hololive.wiki Website Traffic, Ranking, Analytics [October 2025]
sem3.heaventechit.com
semrush.ebundletools.com
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semrush (2025). hololive.wiki Website Traffic, Ranking, Analytics [October 2025] [Dataset]. https://sem3.heaventechit.com/website/hololive.wiki/overview/
Explore at:
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Semrushhttps://fr.semrush.com/
License
https://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
Time period covered
Nov 12, 2025
Area covered
Worldwide
Variables measured
visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
Measurement technique
Semrush Traffic Analytics; Click-stream data
Description
hololive.wiki is ranked #68146 in US with 195.53K Traffic. Categories: Retail. Learn more about website traffic, market share, and more!
emojis.wiki Website Traffic, Ranking, Analytics [October 2025]
sem3.heaventechit.com
semrush.ebundletools.com
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semrush (2025). emojis.wiki Website Traffic, Ranking, Analytics [October 2025] [Dataset]. https://sem3.heaventechit.com/website/emojis.wiki/overview/
Explore at:
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Semrushhttps://fr.semrush.com/
License
https://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
Time period covered
Nov 12, 2025
Area covered
Worldwide
Variables measured
visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
Measurement technique
Semrush Traffic Analytics; Click-stream data
Description
emojis.wiki is ranked #92688 in US with 513.36K Traffic. Categories: Online Services. Learn more about website traffic, market share, and more!
d
Replication Data for: Measuring Wikipedia Article Quality in One Dimension...
search.dataone.org
dataverse.harvard.edu
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TeBlunthuis, Nathan (2024). Replication Data for: Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression [Dataset]. http://doi.org/10.7910/DVN/U5V0G1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/U5V0G1
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
TeBlunthuis, Nathan
Description
This dataset provides code, data, and instructions for replicating the analysis of Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression published in OpenSym 2021 (link to come). The paper introduces a method for transforming scores from the ORES quality models into a single dimensional measure of quality amenable for statistical analysis that is well-calibrated to a dataset. The purpose is to improve the validity of research into article quality through more precise measurement. The code and data for replicating the paper are found in this dataverse repository. If you wish to use method on a new dataset, you should obtain the actively maintaned version of the code from this git repository. If you attempt to replicate part of this repository please let me know via an email to nathante@uw.edu. Replicating the Analysis from the OpenSym Paper This project analyzes a sample of articles with quality labels from the English Wikipedia XML dumps from March 2020. Copies of the dumps are not provided in this dataset. They can be obtained via https://dumps.wikimedia.org/. Everything else you need to replicate the project (other than a sufficiently powerful computer) should be available here. The project is organized into stages. The prerequisite data files are provided at each stage so you do not need to rerun the entire pipeline from the beginning, which is not easily done without a high-performance computer. If you start replicating at an intermediate stage, this should overwrite the inputs to the downstream stages. This should make it easier to verify a partial replication. To help manage the size of the dataverse, all code files are included in code.tar.gz. Extracting this with tar xzvf code.tar.gz is the first step. Getting Set Up You need a version of R >= 4.0 and a version of Python >= 3.7.8. You also need a bash shell, tar, gzip, and make installed as they should be installed on any Unix system. To install brms you need a working C++ compiler. If you run into trouble see the instructions for installing Rstan. The datasets were built on CentOS 7, except for the ORES scoring which was done on Ubuntu 18.04.5 and building which was done on Debian 9. The RemembR and pyRembr projects provide simple tools for saving intermediate variables for building papers with LaTex. First, extract the articlequality.tar.gz, RemembR.tar.gz and pyRembr.tar.gz archives. Then, install the following: Python Packages Running the following steps in a new Python virtual environment is strongly recommended. Run pip3 install -r requirements.txt to install the Python dependencies. Then navigate into the pyRembr directory and run python3 setup.py install. R Packages Run Rscript install_requirements.R to install the necessary R libraries. If you run into trouble installing brms see the instructions on Drawing a Sample of Labeled Articles I provide steps and intermediate data files for replicating the sampling of labeled articles. The steps in this section are quite computationally intensive. Those only interested in replicating the models and analyses should skip this section. Extracting Metadata from Wikipedia Dumps Metadata from the Wikipedia dumps is required for calibrating models to the revision and article levels of analysis. You can use the wikiq Python script from the mediawiki dump tools git repository to extract metadata from the XML dumps as TSV files. The version of wikiq that was used is provided here. Running Wikiq on a full dump of English Wikipedia in a reasonable amount of requires considerable computing resources. For this project, Wikiq was run on Hyak a high performance computer at the University of Washington. The code for doing so is highly speicific to Hyak. For transparency and in case it helps others using similar academic computers this code is included in WikiqRunning.tar.gz. A copy of the wikiq output is included in this dataset in the multi-part archive enwiki202003-wikiq.tar.gz. To extract this archive, download all the parts and then run cat enwiki202003-wikiq.tar.gz* > enwiki202003-wikiq.tar.gz && tar xvzf enwiki202003-wikiq.tar.gz. Obtaining Quality Labels for Articles We obtain up-to-date labels for each article using the articlequality python package included in articlequality.tar.gz. The XML dumps are also the input to this step, and while it does not require a great deal of memory, a powerful computer (we used 28 cores) is helpful so that it completes in a reasonable amount of time. extract_quality_labels.sh runs the command to extract the labels from the xml dumps. The resulting files have the format data/enwiki-20200301-pages-meta-history*.xml-p*.7z_article_labelings.json and are included in this dataset in the archive enwiki202003-article_labelings-json.tar.gz. Taking a Sample of Quality Labels I used Apache Spark to merge the metadata from Wikiq with the quality labels and to draw a sample of articles where each quality class is equally represented. To...
vrpirates.wiki Website Traffic, Ranking, Analytics [October 2025]
sem3.heaventechit.com
semrush.ebundletools.com
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semrush (2025). vrpirates.wiki Website Traffic, Ranking, Analytics [October 2025] [Dataset]. https://sem3.heaventechit.com/website/vrpirates.wiki/overview/
Explore at:
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Semrushhttps://fr.semrush.com/
License
https://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
Time period covered
Nov 12, 2025
Area covered
Worldwide
Variables measured
visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
Measurement technique
Semrush Traffic Analytics; Click-stream data
Description
vrpirates.wiki is ranked #111416 in US with 233.33K Traffic. Categories: . Learn more about website traffic, market share, and more!
[deprecated] Reference and map usage across Wikimedia wiki pages
figshare.com
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Wight (2023). [deprecated] Reference and map usage across Wikimedia wiki pages [Dataset]. http://doi.org/10.6084/m9.figshare.24064941.v2
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.24064941.v2
Dataset updated
Dec 18, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Adam Wight
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
ErrataPlease note that this data set includes some major inaccuracies and should not be used. The data files will be unpublished from their hosting and this metadata will eventually be unpublished as well.A short list of issues discovered:Many dumps were truncated (T345176).Pages appeared multiple times, with different revision numbers.Revisions were sometimes mixed, with wikitext and HTML coming from different versions of an article.Reference similarity was overcounted when more than two refs shared content.In particular, the truncation and duplication means that the aggregate statistics are inaccurate and can't be compared to other data points.OverviewThis data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on real-world usage statistics for reference footnotes (Cite extension) and maps (Kartographer extension) across all main-namespace pages (articles) on about 700 Wikimedia wikis. It was produced by processing the Wikimedia Enterprise HTML dumps which are a fully-parsed rendering of the pages, and by querying the MediaWiki query API to get more detailed information about maps. The data is also accompanied by several more general columns about each page for context.Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research, but the goal in our case was to understand the potential impact of improving the ways in which references can be reused within a page. Gathering the map data was to understand the actual impact of improvements made to how external data can be integrated in maps. Both tasks are complicated by the heavy use of wikitext templates, obscuring when and how and tags are being used. For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/The source code is distributed under BSD-3-Clause.Source code and executionThe program used to create these files is our HTML dump scraper, version 0.1, written in Elixir. It can be run locally, but we used the Wikimedia Cloud VPS in order to have intra-datacenter access to the HTML dump file inputs. Our production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .Execution was interrupted and restarted many times in order to make small fixes to the code. We expect that the only class of inconsistency this could have caused is that a small number of article records may potentially be repeated in the per-page summary files, and these pages’ statistics duplicated in the aggregates. Whatever the cause, we’ve found many of these duplicate errors and counts are given in the “duplicates.txt” file.The program is pluggable and configurable, it can be extended by writing new analysis modules. Our team plans to continue development and to run it again in the near future to track evolution of the collected metrics over time.FormatAll fields are documented in metrics.md as part of the code repository. Outputs are mostly split into separate ND-JSON (newline-delimited) and JSON files, and grand totals are gathered into a single CSV file.Per-page summary filesThe first phase of scraping produces a fine-grained report summarizing each page into a few statistics. Each file corresponds to a wiki (using its database name, for example "enwiki" for English Wikipedia) and each line of the file is a JSON object corresponding to a page.Example file name: enwiki-20230601-page-summary.ndjson.gzExample metrics:How many tags are created from templates vs. directly in the article.How many references contain a template transclusion to produce their content.How many references are unnamed, automatically, or manually named.How often references are reused via their name.Copy-pasted references that share the same or almost the same content, on the same page.Whether an article has more than one list.Mapdata filesExample file name: enwiki-20230601-mapdata.ndjson.gzThese files give the count of different types of map "external data" on each page. A line will either be empty "{}" or it will include the revid and number of external data references for maps on that page.External data is tallied in 9 different buckets, starting with "page" meaning that the source is .map data from the Wikimedia Commons server, or geoline / geoshape / geomask / geopoint and the data source, either an "ids" (Wikidata Q-ID) or "query" (SPARQL query) source.Mapdata summary filesEach wiki has a summary of map external data counts, which contains a sum for each type count.Example file name: enwiki-20230601-mapdata-summary.jsonWiki summary filesPer-page statistics are rolled up to the wiki level, and results are stored in a separate file for each wiki. Some statistics are summed, some are averaged, check the suffix on the column name for a hint.Example file name: enwiki-20230601-summary.jsonTop-level summary fileThere is one file which aggregates the wiki summary statistics, discarding non-numeric fields and formatting as a CSV for ease of use: all-wikis-20230601-summary.csv
rutracker.wiki Website Traffic, Ranking, Analytics [October 2025]
sem3.heaventechit.com
semrush.ebundletools.com
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semrush (2025). rutracker.wiki Website Traffic, Ranking, Analytics [October 2025] [Dataset]. https://sem3.heaventechit.com/website/rutracker.wiki/overview/
Explore at:
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Semrushhttps://fr.semrush.com/
License
https://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
Time period covered
Nov 12, 2025
Area covered
Worldwide
Variables measured
visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
Measurement technique
Semrush Traffic Analytics; Click-stream data
Description
rutracker.wiki is ranked #23616 in RU with 145.14K Traffic. Categories: Computer Software and Development, Information Technology, Online Services. Learn more about website traffic, market share, and more!
wiki.cs.money Website Traffic, Ranking, Analytics [October 2025]
sem3.heaventechit.com
semrush.ebundletools.com
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semrush (2025). wiki.cs.money Website Traffic, Ranking, Analytics [October 2025] [Dataset]. https://sem3.heaventechit.com/website/wiki.cs.money/overview/
Explore at:
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Semrushhttps://fr.semrush.com/
License
https://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
Time period covered
Nov 12, 2025
Area covered
Worldwide
Variables measured
visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
Measurement technique
Semrush Traffic Analytics; Click-stream data
Description
wiki.cs.money is ranked #448 in RU with 2.32M Traffic. Categories: . Learn more about website traffic, market share, and more!