Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset about Most viewed Wiki Pages is extracted from Flourish visualisation. If you want to know more about Flourish click here.
Facebook
TwitterTraffic analytics, rankings, and competitive metrics for wikipedia.org as of October 2025
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Phase 1 snapshot of English Wikipedia articles collected in November 2025. Each row includes the human-maintained WikiProject quality and importance labels (e.g. Stub→FA, Low→Top), along with structural features gathered from the MediaWiki API. The corpus is designed for training quality estimators, monitoring coverage, and prioritising editorial workflows.
full.csv: complete dataset (~1.5M rows)full_part01.csv – full_part15.csv: 100k-row chunks (final file contains the remainder)sample_10k.csv, sample_100k.csv: stratified samples for quick experimentationprepare_kaggle_release.ipynb: reproducible sampling and chunking workflowlightgbm_quality_importance.ipynb: baseline models predicting quality/importancetitle: article titlepage_id: Wikipedia page identifiersize: byte length of the current revisiontouched: last touched timestamp (UTC, ISO 8601)internal_links_count, external_links_count, langlinks_count, images_count, redirects_count: MediaWiki API structural metricsprotection_level: current protection status (e.g. unprotected, semi-protected)official_quality: human label (Stub, Start, C, B, GA, A, FA, etc.)official_quality_score: numeric mapping of official_quality (Stub=1, Start=2, C=3, B=4, GA=5, A=6, FA=7, 8–10 for rare higher tiers)official_importance: human label (Low, Mid, High, Top, etc.)official_importance_score: numeric mapping of the importance label (Low=1, Mid=3, High=5, Top=8, 10 for special tiers)categories, templates: pipe-delimited lists of categories/templates (UTF-8 sanitised)chunksize when streaming the full dataset.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.
There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).
The document Dataset_summary includes a detailed description of the dataset.
Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Apache Hadoop is the central software project, beside Apache SOLR, and Apache Lucene (SW, software). Companies which offer Hadoop distributions and Hadoop based solutions are the central companies in the scope of the study (HV, hardware vendors). Other companies started very early with Hadoop related projects as early adopters (EA). Global players (GP) are affected by this emerging market, its opportunities and the new competitors (NC). Some new but highly relevant companies like Talend or LucidWorks have been selected because of their obvious commitment to the open source ideas. Widely adopted technologies with a relation to the selected research topic are represented by the group TEC.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Key Columns and Metrics:
- Country: The name of the country.
- Total in km2: Total area of the country.
- Land in km2: Land area excluding water bodies.
- Water in km2: Area covered by water bodies.
- Water %: Percentage of the total area covered by water.
- HDI: Human Development Index, a measure of a country's overall achievement in its social and economic dimensions.
- %HDI Growth: Percentage growth in HDI.
- IMF Forecast GDP(Nominal): International Monetary Fund's forecast for Gross Domestic Product in nominal terms.
- World Bank Forecast GDP(Nominal): World Bank's forecast for Gross Domestic Product in nominal terms.
- UN Forecast GDP(Nominal): United Nations' forecast for Gross Domestic Product in nominal terms.
- IMF Forecast GDP(PPP): IMF's forecast for Gross Domestic Product in purchasing power parity terms.
- World Bank Forecast GDP(PPP): World Bank's forecast for Gross Domestic Product in purchasing power parity terms.
- CIA Forecast GDP(PPP): Central Intelligence Agency's forecast for Gross Domestic Product in purchasing power parity terms.
- Internet Users: Number of internet users in the country.
- UN Continental Region: Continental region classification by the United Nations.
- UN Statistical Subregion: Statistical subregion classification by the United Nations.
- Population 2022: Population of the country in the year 2022.
- Population 2023: Population of the country in the year 2023.
- Population %Change: Percentage change in population from 2022 to 2023.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset aggregates the 100 most popular Wikipedia articles by pageviews - enabling the tracking of trending topics on Wikipedia.
The data begins in the year 2016 and the textual data is presented as it is found on the website of Wikipedia.
rank- Rank of the article (out of 100).article - Title of the article.views - Number of pageviews (across all platforms).date - Date of the pageviews.This dataset is updated on a daily basis with new data sourced from the WikiMedia API.
Facebook
Twitterhttps://semrush.ebundletools.com/company/legal/terms-of-service/https://semrush.ebundletools.com/company/legal/terms-of-service/
wikipedia.de is ranked #1885 in DE with 1.21M Traffic. Categories: . Learn more about website traffic, market share, and more!
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The size of the Wiki Software market was valued at USD 985 million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX % during the forecast period.
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The size of the Hosted Wiki Platform market was valued at USD XXX million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX % during the forecast period.
Facebook
Twitterhttps://semrush.ebundletools.com/company/legal/terms-of-service/https://semrush.ebundletools.com/company/legal/terms-of-service/
bg-wiki.com is ranked #12424 in US with 1M Traffic. Categories: Computer and Video Games. Learn more about website traffic, market share, and more!
Facebook
Twitterhttps://sr01.toolswala.net/_www/company/legal/terms-of-service/https://sr01.toolswala.net/_www/company/legal/terms-of-service/
wavu.wiki is ranked #5158 in JP with 866.03K Traffic. Categories: Online Services. Learn more about website traffic, market share, and more!
Facebook
Twitterhttps://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
hololive.wiki is ranked #68146 in US with 195.53K Traffic. Categories: Retail. Learn more about website traffic, market share, and more!
Facebook
Twitterhttps://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
emojis.wiki is ranked #92688 in US with 513.36K Traffic. Categories: Online Services. Learn more about website traffic, market share, and more!
Facebook
TwitterThis dataset provides code, data, and instructions for replicating the analysis of Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression published in OpenSym 2021 (link to come). The paper introduces a method for transforming scores from the ORES quality models into a single dimensional measure of quality amenable for statistical analysis that is well-calibrated to a dataset. The purpose is to improve the validity of research into article quality through more precise measurement. The code and data for replicating the paper are found in this dataverse repository. If you wish to use method on a new dataset, you should obtain the actively maintaned version of the code from this git repository. If you attempt to replicate part of this repository please let me know via an email to nathante@uw.edu. Replicating the Analysis from the OpenSym Paper This project analyzes a sample of articles with quality labels from the English Wikipedia XML dumps from March 2020. Copies of the dumps are not provided in this dataset. They can be obtained via https://dumps.wikimedia.org/. Everything else you need to replicate the project (other than a sufficiently powerful computer) should be available here. The project is organized into stages. The prerequisite data files are provided at each stage so you do not need to rerun the entire pipeline from the beginning, which is not easily done without a high-performance computer. If you start replicating at an intermediate stage, this should overwrite the inputs to the downstream stages. This should make it easier to verify a partial replication. To help manage the size of the dataverse, all code files are included in code.tar.gz. Extracting this with tar xzvf code.tar.gz is the first step. Getting Set Up You need a version of R >= 4.0 and a version of Python >= 3.7.8. You also need a bash shell, tar, gzip, and make installed as they should be installed on any Unix system. To install brms you need a working C++ compiler. If you run into trouble see the instructions for installing Rstan. The datasets were built on CentOS 7, except for the ORES scoring which was done on Ubuntu 18.04.5 and building which was done on Debian 9. The RemembR and pyRembr projects provide simple tools for saving intermediate variables for building papers with LaTex. First, extract the articlequality.tar.gz, RemembR.tar.gz and pyRembr.tar.gz archives. Then, install the following: Python Packages Running the following steps in a new Python virtual environment is strongly recommended. Run pip3 install -r requirements.txt to install the Python dependencies. Then navigate into the pyRembr directory and run python3 setup.py install. R Packages Run Rscript install_requirements.R to install the necessary R libraries. If you run into trouble installing brms see the instructions on Drawing a Sample of Labeled Articles I provide steps and intermediate data files for replicating the sampling of labeled articles. The steps in this section are quite computationally intensive. Those only interested in replicating the models and analyses should skip this section. Extracting Metadata from Wikipedia Dumps Metadata from the Wikipedia dumps is required for calibrating models to the revision and article levels of analysis. You can use the wikiq Python script from the mediawiki dump tools git repository to extract metadata from the XML dumps as TSV files. The version of wikiq that was used is provided here. Running Wikiq on a full dump of English Wikipedia in a reasonable amount of requires considerable computing resources. For this project, Wikiq was run on Hyak a high performance computer at the University of Washington. The code for doing so is highly speicific to Hyak. For transparency and in case it helps others using similar academic computers this code is included in WikiqRunning.tar.gz. A copy of the wikiq output is included in this dataset in the multi-part archive enwiki202003-wikiq.tar.gz. To extract this archive, download all the parts and then run cat enwiki202003-wikiq.tar.gz* > enwiki202003-wikiq.tar.gz && tar xvzf enwiki202003-wikiq.tar.gz. Obtaining Quality Labels for Articles We obtain up-to-date labels for each article using the articlequality python package included in articlequality.tar.gz. The XML dumps are also the input to this step, and while it does not require a great deal of memory, a powerful computer (we used 28 cores) is helpful so that it completes in a reasonable amount of time. extract_quality_labels.sh runs the command to extract the labels from the xml dumps. The resulting files have the format data/enwiki-20200301-pages-meta-history*.xml-p*.7z_article_labelings.json and are included in this dataset in the archive enwiki202003-article_labelings-json.tar.gz. Taking a Sample of Quality Labels I used Apache Spark to merge the metadata from Wikiq with the quality labels and to draw a sample of articles where each quality class is equally represented. To...
Facebook
Twitterhttps://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
vrpirates.wiki is ranked #111416 in US with 233.33K Traffic. Categories: . Learn more about website traffic, market share, and more!
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
ErrataPlease note that this data set includes some major inaccuracies and should not be used. The data files will be unpublished from their hosting and this metadata will eventually be unpublished as well.A short list of issues discovered:Many dumps were truncated (T345176).Pages appeared multiple times, with different revision numbers.Revisions were sometimes mixed, with wikitext and HTML coming from different versions of an article.Reference similarity was overcounted when more than two refs shared content.In particular, the truncation and duplication means that the aggregate statistics are inaccurate and can't be compared to other data points.OverviewThis data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on real-world usage statistics for reference footnotes (Cite extension) and maps (Kartographer extension) across all main-namespace pages (articles) on about 700 Wikimedia wikis. It was produced by processing the Wikimedia Enterprise HTML dumps which are a fully-parsed rendering of the pages, and by querying the MediaWiki query API to get more detailed information about maps. The data is also accompanied by several more general columns about each page for context.Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research, but the goal in our case was to understand the potential impact of improving the ways in which references can be reused within a page. Gathering the map data was to understand the actual impact of improvements made to how external data can be integrated in maps. Both tasks are complicated by the heavy use of wikitext templates, obscuring when and how and tags are being used. For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/The source code is distributed under BSD-3-Clause.Source code and executionThe program used to create these files is our HTML dump scraper, version 0.1, written in Elixir. It can be run locally, but we used the Wikimedia Cloud VPS in order to have intra-datacenter access to the HTML dump file inputs. Our production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .Execution was interrupted and restarted many times in order to make small fixes to the code. We expect that the only class of inconsistency this could have caused is that a small number of article records may potentially be repeated in the per-page summary files, and these pages’ statistics duplicated in the aggregates. Whatever the cause, we’ve found many of these duplicate errors and counts are given in the “duplicates.txt” file.The program is pluggable and configurable, it can be extended by writing new analysis modules. Our team plans to continue development and to run it again in the near future to track evolution of the collected metrics over time.FormatAll fields are documented in metrics.md as part of the code repository. Outputs are mostly split into separate ND-JSON (newline-delimited) and JSON files, and grand totals are gathered into a single CSV file.Per-page summary filesThe first phase of scraping produces a fine-grained report summarizing each page into a few statistics. Each file corresponds to a wiki (using its database name, for example "enwiki" for English Wikipedia) and each line of the file is a JSON object corresponding to a page.Example file name: enwiki-20230601-page-summary.ndjson.gzExample metrics:How many tags are created from templates vs. directly in the article.How many references contain a template transclusion to produce their content.How many references are unnamed, automatically, or manually named.How often references are reused via their name.Copy-pasted references that share the same or almost the same content, on the same page.Whether an article has more than one list.Mapdata filesExample file name: enwiki-20230601-mapdata.ndjson.gzThese files give the count of different types of map "external data" on each page. A line will either be empty "{}" or it will include the revid and number of external data references for maps on that page.External data is tallied in 9 different buckets, starting with "page" meaning that the source is .map data from the Wikimedia Commons server, or geoline / geoshape / geomask / geopoint and the data source, either an "ids" (Wikidata Q-ID) or "query" (SPARQL query) source.Mapdata summary filesEach wiki has a summary of map external data counts, which contains a sum for each type count.Example file name: enwiki-20230601-mapdata-summary.jsonWiki summary filesPer-page statistics are rolled up to the wiki level, and results are stored in a separate file for each wiki. Some statistics are summed, some are averaged, check the suffix on the column name for a hint.Example file name: enwiki-20230601-summary.jsonTop-level summary fileThere is one file which aggregates the wiki summary statistics, discarding non-numeric fields and formatting as a CSV for ease of use: all-wikis-20230601-summary.csv
Facebook
Twitterhttps://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
rutracker.wiki is ranked #23616 in RU with 145.14K Traffic. Categories: Computer Software and Development, Information Technology, Online Services. Learn more about website traffic, market share, and more!
Facebook
Twitterhttps://sem3.heaventechit.com/company/legal/terms-of-service/https://sem3.heaventechit.com/company/legal/terms-of-service/
wiki.cs.money is ranked #448 in RU with 2.32M Traffic. Categories: . Learn more about website traffic, market share, and more!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset about Most viewed Wiki Pages is extracted from Flourish visualisation. If you want to know more about Flourish click here.