Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Climate data obtained from Wikipedia climate boxes. Scraping code is on Github.
Data consist of cities with population over 10.000. Not all cities have climate data. All data is normalized to metric units. Data varies a lot, but you can find temperatures (mean, low, high), humidity, precipitation.. among common ones. Data is per month, but there is also aggregation per year (depending on context, it can be avg, stdev, sum...). Data is not historical, but aggregated and aggregation years vary a lot.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Those datasets include lists of over 43 million Wikipedia articles in 55 languages with quality scores by WikiRank (https://wikirank.net). Additionally, the datasets contain the quality measures (metrics) which directly affect these scores. Quality measures were extracted based on Wikipedia dumps from April, 2022.
License All files included in this datasets are released under CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ Format
page_id -- The identifier of the Wikipedia article (int), e.g. 840191 page_name -- The title of the Wikipedia article (utf-8), e.g. Sagittarius A* wikirank_quality -- quality score for Wikipedia article in a scale 0-100 (as of April 1, 2022). This is a synthetic measure that was calculated based on the metrics below (also included in the datasets). norm_len - normalized "page length" norm_refs - normalized "number of references" norm_img - normalized "number of images" norm_sec - normalized "number of sections" norm_reflen - normalized "references per length ratio" norm_authors - normalized "number of authors" (without bots and anonymous users) flawtemps - flaw templates
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Datasets with quality score for 47 million Wikipedia articles across 55 language versions by Wikirank, as of 1 August 2024.
More information about the quality score can be found in scientific papers:
Facebook
TwitterTraffic analytics, rankings, and competitive metrics for wikipedia.org as of October 2025
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides a comprehensive snapshot of global country statistics for the year 2023. It was scraped from various Wikipedia pages using BeautifulSoup, consolidating key indicators and metrics for 142 countries. The dataset covers diverse aspects such as land area, water area, Human Development Index (HDI), GDP forecasts, internet usage, and population changes.
The dataset is sourced from various Wikipedia pages using BeautifulSoup, providing a consolidated and accessible resource for individuals interested in global country statistics. It spans a wide range of topics, making it a valuable asset for exploratory data analysis and research in fields such as economics, demographics, and international relations.
Feel free to explore and analyze this dataset to gain insights into the socio-economic dynamics of countries worldwide.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Phase 1 snapshot of English Wikipedia articles collected in November 2025. Each row includes the human-maintained WikiProject quality and importance labels (e.g. Stub→FA, Low→Top), along with structural features gathered from the MediaWiki API. The corpus is designed for training quality estimators, monitoring coverage, and prioritising editorial workflows.
full.csv: complete dataset (~1.5M rows)full_part01.csv – full_part15.csv: 100k-row chunks (final file contains the remainder)sample_10k.csv, sample_100k.csv: stratified samples for quick experimentationprepare_kaggle_release.ipynb: reproducible sampling and chunking workflowlightgbm_quality_importance.ipynb: baseline models predicting quality/importancetitle: article titlepage_id: Wikipedia page identifiersize: byte length of the current revisiontouched: last touched timestamp (UTC, ISO 8601)internal_links_count, external_links_count, langlinks_count, images_count, redirects_count: MediaWiki API structural metricsprotection_level: current protection status (e.g. unprotected, semi-protected)official_quality: human label (Stub, Start, C, B, GA, A, FA, etc.)official_quality_score: numeric mapping of official_quality (Stub=1, Start=2, C=3, B=4, GA=5, A=6, FA=7, 8–10 for rare higher tiers)official_importance: human label (Low, Mid, High, Top, etc.)official_importance_score: numeric mapping of the importance label (Low=1, Mid=3, High=5, Top=8, 10 for special tiers)categories, templates: pipe-delimited lists of categories/templates (UTF-8 sanitised)chunksize when streaming the full dataset.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Fixes in version 1.1 (= Zenodo's "version 2")
*In 20161101-revisions-part1-12-1728.csv, missing first data line is added.
*In Current_content and Deleted_content files, some token values ('str' column) which contain regular quotes ('"') are fixed.
*In Current_content and Deleted_content files, some wrong revision ID values for 'origin_rev_id', 'in' and 'out' columns are fixed.
This dataset contains every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history.
This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task. Adapting a state-of-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage.
This dataset, its creation process and use cases are described in a dedicated dataset paper of the same name, published at the ICWSM 2017 conference. In this paper, we show how this data enables, on token level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics.
Tokenization used: https://gist.github.com/faflo/3f5f30b1224c38b1836d63fa05d1ac94
Toy example for how the token metadata is generated: https://gist.github.com/faflo/8bd212e81e594676f8d002b175b79de8
Be sure to read the ReadMe.txt or - even more detailed - the supporting paper which is referenced under "related identifiers".
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
ErrataPlease note that this data set includes some major inaccuracies and should not be used. The data files will be unpublished from their hosting and this metadata will eventually be unpublished as well.A short list of issues discovered:Many dumps were truncated (T345176).Pages appeared multiple times, with different revision numbers.Revisions were sometimes mixed, with wikitext and HTML coming from different versions of an article.Reference similarity was overcounted when more than two refs shared content.In particular, the truncation and duplication means that the aggregate statistics are inaccurate and can't be compared to other data points.OverviewThis data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on real-world usage statistics for reference footnotes (Cite extension) and maps (Kartographer extension) across all main-namespace pages (articles) on about 700 Wikimedia wikis. It was produced by processing the Wikimedia Enterprise HTML dumps which are a fully-parsed rendering of the pages, and by querying the MediaWiki query API to get more detailed information about maps. The data is also accompanied by several more general columns about each page for context.Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research, but the goal in our case was to understand the potential impact of improving the ways in which references can be reused within a page. Gathering the map data was to understand the actual impact of improvements made to how external data can be integrated in maps. Both tasks are complicated by the heavy use of wikitext templates, obscuring when and how and tags are being used. For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/The source code is distributed under BSD-3-Clause.Source code and executionThe program used to create these files is our HTML dump scraper, version 0.1, written in Elixir. It can be run locally, but we used the Wikimedia Cloud VPS in order to have intra-datacenter access to the HTML dump file inputs. Our production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .Execution was interrupted and restarted many times in order to make small fixes to the code. We expect that the only class of inconsistency this could have caused is that a small number of article records may potentially be repeated in the per-page summary files, and these pages’ statistics duplicated in the aggregates. Whatever the cause, we’ve found many of these duplicate errors and counts are given in the “duplicates.txt” file.The program is pluggable and configurable, it can be extended by writing new analysis modules. Our team plans to continue development and to run it again in the near future to track evolution of the collected metrics over time.FormatAll fields are documented in metrics.md as part of the code repository. Outputs are mostly split into separate ND-JSON (newline-delimited) and JSON files, and grand totals are gathered into a single CSV file.Per-page summary filesThe first phase of scraping produces a fine-grained report summarizing each page into a few statistics. Each file corresponds to a wiki (using its database name, for example "enwiki" for English Wikipedia) and each line of the file is a JSON object corresponding to a page.Example file name: enwiki-20230601-page-summary.ndjson.gzExample metrics:How many tags are created from templates vs. directly in the article.How many references contain a template transclusion to produce their content.How many references are unnamed, automatically, or manually named.How often references are reused via their name.Copy-pasted references that share the same or almost the same content, on the same page.Whether an article has more than one list.Mapdata filesExample file name: enwiki-20230601-mapdata.ndjson.gzThese files give the count of different types of map "external data" on each page. A line will either be empty "{}" or it will include the revid and number of external data references for maps on that page.External data is tallied in 9 different buckets, starting with "page" meaning that the source is .map data from the Wikimedia Commons server, or geoline / geoshape / geomask / geopoint and the data source, either an "ids" (Wikidata Q-ID) or "query" (SPARQL query) source.Mapdata summary filesEach wiki has a summary of map external data counts, which contains a sum for each type count.Example file name: enwiki-20230601-mapdata-summary.jsonWiki summary filesPer-page statistics are rolled up to the wiki level, and results are stored in a separate file for each wiki. Some statistics are summed, some are averaged, check the suffix on the column name for a hint.Example file name: enwiki-20230601-summary.jsonTop-level summary fileThere is one file which aggregates the wiki summary statistics, discarding non-numeric fields and formatting as a CSV for ease of use: all-wikis-20230601-summary.csv
Facebook
TwitterData Explorer enables easy tracking of metrics about computer and Internet use over time. Simply choose a metric of interest from the drop-down menu. The default Map mode depicts percentages by state, while Chart mode allows metrics to be broken down by demographics and viewed as either percentages of the population or estimated numbers of people or households.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Wikipedia web pages in different languages are rarely linked except for the cross-lingual link between web pages about the same subject. Collected in June 2010, this data collection consists of 10GB of tagged Chinese, Japanese and Korean articles, converted from Wikipedia to an XML structure by a multi-lingual adaptation of the YAWN system (see Related Information). Data were collected as part of the NII Test Collection for IR Systems (NTCIR) Project, which aims to enhance research in Information Access (IA) technologies, including information retrieval, to enhance cross-lingual link discovery (a way of automatically finding potential links between documents written in different languages). Through cross-lingual link discovery, users are able to discover documents in languages which they are either familiar with, or which have a richer set of documents than in their language of choice.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
15th January 2020 - A Map of Science. v. 1.0 Description:A network which shows the similarities among different branches of science. It's based on Wikipedia pages in outline of natural, formal, social and applied sciences plus Data Science, which is not yet included (18 Jan. 2020). All pages called "Outline of X" were ignored. Pages are pre processed to get the main content with regular expressions. Stop words removal, lemmatization with WordNetLemmatizer in NLTK. Edges represent cosine similarity and filtered calculating zscore leaving only edges with a zscore > 1.959964 . Isolated nodes were removed.Materials:R, python, igraph, nltk, d3, javascript, html, wikipedia Contact:Alberto Calderone - sinnefa@gmail.comPreview:http://www.sinnefa.com/wikipediasciencemap/
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by scraping Wikipedia to compile a list of the top 100 companies in the USA. The data includes key information such as company names, industry sectors, revenue figures, and headquarters locations. The dataset captures the most recent rankings of these companies based on metrics like annual revenue, market capitalization, or employee size, as listed on Wikipedia. The dataset serves as a valuable resource for analyzing trends in the U.S. corporate landscape, including industry dominance and geographic distribution of major corporations.
Facebook
TwitterThe US Geological Survey (USGS) resource assessment (Williams et al., 2009) outlined a mean 30GWe of undiscovered hydrothermal resource in the western US. One goal of the Geothermal Technologies Office (GTO) is to accelerate the development of this undiscovered resource. The Geothermal Technologies Program (GTP) Blue Ribbon Panel (GTO, 2011) recommended that DOE focus efforts on helping industry identify hidden geothermal resources to increase geothermal capacity in the near term. Increased exploration activity will produce more prospects, more discoveries, and more readily developable resources. Detailed exploration case studies akin to those found in oil and gas (e.g. Beaumont, et al, 1990) will give operators a single point of information to gather clean, unbiased information on which to build geothermal drilling prospects. To support this effort, the National Renewable Energy laboratory (NREL) has been working with the Department of Energy (DOE) to develop a template for geothermal case studies on the Geothermal Gateway on OpenEI. In fiscal year 2013, the template was developed and tested with two case studies: Raft River Geothermal Area (http://en.openei.org/wiki/Raft_River_Geothermal_Area) and Coso Geothermal Area (http://en.openei.org/wiki/Coso_Geothermal_Area). In fiscal year 2014, ten additional case studies were completed, and additional features were added to the template to allow for more data and the direct citations of data. The template allows for: Data - a variety of data can be collected for each area, including power production information, well field information, geologic information, reservoir information, and geochemistry information. Narratives ? general (e.g. area overview, history and infrastructure), technical (e.g. exploration history, well field description, R&D activities) and geologic narratives (e.g. area geology, hydrothermal system, heat source, geochemistry.) Exploration Activity Catalog - catalog of exploration activities conducted in the area (with dates and references.) NEPA Analysis ? a query of NEPA analyses conducted in the area (that have been catalogued in the OpenEI NEPA database.) In fiscal year 2015, NREL is working with universities to populate additional case studies on OpenEI. The goal is to provide a large enough dataset to start conducting analyses of exploration programs to identify correlations between successful exploration plans for areas with similar geologic occurrence models.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
OpenStreetMap contains roughly 754.8 thousand km of roads in this region. Based on AI-mapped estimates, this is approximately 73 % of the total road length in the dataset region. The average age of data for the region is 3 years ( Last edited 9 days ago ) and 8% of roads were added or updated in the last 6 months. Read about what this summary means : indicators , metrics
This theme includes all OpenStreetMap features in this area matching ( Learn what tags means here ) :
tags['highway'] IS NOT NULL
Features may have these attributes:
This dataset is one of many "https://data.humdata.org/organization/hot">OpenStreetMap exports on HDX. See the Humanitarian OpenStreetMap Team website for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To overcome the frequently debated crisis of confidence, replicating studies is becoming increasingly more common. Multiple frequentist and Bayesian measures have been proposed to evaluate whether a replication is successful, but little is known about which method best captures replication success. This study is one of the first attempts to compare a number of quantitative measures of replication success with respect to their ability to draw the correct inference when the underlying truth is known, while taking publication bias into account. Our results show that Bayesian metrics seem to slightly outperform frequentist metrics across the board. Generally, meta-analytic approaches seem to slightly outperform metrics that evaluate single studies, except in the scenario of extreme publication bias, where this pattern reverses.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data generated from an old, private fork of the CODES simulation toolkit: https://github.com/codes-org/codes
Data dictionary: https://github.com/kevinabrown/codes/wiki/Dragonfly-Dally-DEBUG-Metrics
Facebook
TwitterThe National Renewable Energy Laboratory (NREL) was tasked with developing a metric in 2012 to measure the impacts of RD&D funding on the cost and time required for geothermal exploration activities. The development of this cost and time metric included collecting cost and time data for exploration techniques, creating a baseline suite of exploration techniques to which future exploration cost and time improvements can be compared, and developing an online tool for graphically showing potential project impacts (all available at http://en.openei.org/wiki/Gateway: Geothermal). This paper describes the methodology used to define the baseline exploration suite of techniques (baseline), as well as the approach that was used to create the cost and time data set that populates the baseline. The resulting product, an online tool for measuring impact, and the aggregated cost and time data are available on the Open Energy Information website (OpenEI, http://en.openei.org) for public access.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzed 400 m running performance metric data from the fastest athlete with bilateral leg amputations
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Drive Stats
Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.
Facebook
TwitterThis API returns broadband summary data for the entire United States. It is designed to retrieve broadband summary data and census metrics (population or households) combined as search criteria. The data includes wireline and wireless providers, different technologies and broadband speeds reported in the particular area being searched for on a scale of 0 to 1.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Climate data obtained from Wikipedia climate boxes. Scraping code is on Github.
Data consist of cities with population over 10.000. Not all cities have climate data. All data is normalized to metric units. Data varies a lot, but you can find temperatures (mean, low, high), humidity, precipitation.. among common ones. Data is per month, but there is also aggregation per year (depending on context, it can be avg, stdev, sum...). Data is not historical, but aggregated and aggregation years vary a lot.