Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).
The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.
The WikiGrain Data is analyzed in the paper
Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.
===============================================================
Individual files (tables in comma-separated-values-format):
---------------------------------------------------------------
* article_info.csv contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.
- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.
- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.
- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.
- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.
---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.
- "title.of.tlc"
(string) Title of the TLC in which the article is contained.
---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:
- "id"
Article id.
- "is.FA"
Boolean indicator for whether the article is featured.
- "log1p.length"
Length measured by the number of bytes.
- "age"
Age measured by the time since the first edit.
- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.
- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.
- "log1p.number.of.contributors"
Number of unique contributors to the article.
- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').
- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').
- "number.of.level.1.sections"
Number of first level sections in the article.
- "number.of.level.2.sections"
Number of second level sections in the article.
- "number.of.categories"
Number of categories the article is in.
- "log1p.average.size.of.categories"
Average size of the categories the article is in.
- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.
- "log1p.number.of.external.references"
Number of external references given in the article.
- "log1p.number.of.images"
Number of images in the article.
- "log1p.number.of.templates"
Number of templates that the article uses.
- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.
- "granularity"
As in article_info.csv (but normalized to standard deviation one).
Facebook
TwitterDig into granular industry data: what it is, how to use it and why you should pay attention to it.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: Portal Users Final Data Set Granularity
Facebook
TwitterExplore Granular import export trade data. Find top buyers, suppliers, HS codes, ports, & market trends to make smarter, data-driven trade decisions.
Facebook
TwitterThe FR 2510 will collect granular exposure data on the assets, liabilities, and off-balance sheet holdings of U.S. G-SIBs (Global Systemically Important Banks), providing breakdowns by instrument, currency, maturity, and sector. The FR 2510 will also collect data covering detailed positions for each U.S. G-SIB’s top 35 countries of exposure, on an immediate-counterparty basis, as reported in the consolidated Country Exposure Report (FFIEC 009; OMB No. 7100-0035), broken out by instrument and counterparty sector, with limited further breakouts by remaining maturity, subject to a $2 billion minimum threshold for country exposure. Further, the FR 2510 will collect information on financial derivatives by instrument type and foreign exchange derivatives by currency. The FR 2510 will allow the Federal Reserve to conduct a more complete balance sheet analysis of U.S. G-SIBs. Additionally, the FR 2510 will provide the Federal Reserve with valuable systemic information through the collection of more granular data regarding common or correlated exposures and funding dependencies than is currently collected by existing reports by providing more information about U.S. G-SIBs’ consolidated exposures and funding positions to different countries according to instrument, counterparty sector, currency and remaining maturity.
Facebook
TwitterGranular SKU-level transaction data from Measurable AI's proprietary email receipt panel across e-commerce companies in the emerging markets.
Our data is attained with consumer consent from our two consumer apps. We then aggregate and anonymize all the metrics across our panel to produce consumer insights for our end users. Our datasets are available on a granular and aggregate level.
Key clients range from the e-commerce companies themselves, buyside firms, financial institutions, consultancies, market research agencies and academia.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).
Key Definitions
Aggregation
Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes
Anonymisation
Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy
Dataset
Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.
Determinand
A constituent or property of drinking water which can be determined or estimated.
DWI
Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”
DWI Determinands
Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.
Granularity
Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours
ID
Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.
LSOA
Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.
ONS
Office for National Statistics
Open Data Triage
The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <
Sample
A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.
Schema
Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.
Units
Standard measurements used to quantify and compare different physical quantities.
Water Quality
The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.
Data History
Data Origin
These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.
Data Triage Considerations
Granularity
Is it useful to share results as averages or individual?
We decided to share as individual results as the lowest level of granularity
Anonymisation
It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:
<!--·
Water Supply Zone (WSZ) - Limits interoperability
with other datasets
<!--·
Postcode – Some postcodes contain very few
households and may not offer necessary anonymisation
<!--·
Postal Sector – Deemed not granular enough in
highly populated areas
<!--·
Rounded Co-ordinates – Not a recognised standard
and may cause overlapping areas
<!--·
MSOA – Deemed not granular enough
<!--·
LSOA – Agreed as a recognised standard appropriate
for England and Wales
<!--·
Data Zones – Agreed as a recognised standard
appropriate for Scotland
Data Specifications
Each dataset will cover a calendar year of samples
This dataset will be published annually
Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016
The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.
Context
Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset
Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.
Some samples are tested on site and others are sent to scientific laboratories.
Data Publish Frequency
Annually
Data Triage Review Frequency
Annually unless otherwise requested
Supplementary information
Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.
<!--1.
Drinking Water
Inspectorate Standards and Regulations:
<!--2.
https://www.dwi.gov.uk/drinking-water-standards-and-regulations/
<!--3.
LSOA (England
and Wales) and Data Zone (Scotland):
<!--5.
Description
for LSOA boundaries by the ONS: Census
2021 geographies - Office for National Statistics (ons.gov.uk)
<!--[6.
Postcode to
LSOA lookup tables: Postcode
to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer
Super Output Area to Local Authority District (August 2023) Lookup in the UK
(statistics.gov.uk)
<!--7.
Legislation history: Legislation -
Drinking Water Inspectorate (dwi.gov.uk)
Facebook
TwitterThis dataset consists of example model outputs for PFAS removal by GAC. Explicit descriptions of the data can be found in the associated manuscript. This dataset is associated with the following publication: Burkhardt, J., N. Burns, D. Mobley, J. Pressman, M. Magnuson, and T. Speth. Modeling PFAS Removal Using Granular Activated Carbon for Full-Scale System Design. JOURNAL OF ENVIRONMENTAL ENGINEERING. American Society of Civil Engineers (ASCE), Reston, VA, USA, 148(3): 04021086-1, (2022).
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
The exchange of carbon between the soil and the atmosphere is an important factor in climate change. Soil organic carbon (SOC) storage is sensitive to land management, soil properties, and climatic conditions, and these data serve as key inputs to computer models projecting SOC change. Farmland has been identified as a sink for atmospheric carbon, and we have previously estimated the potential for SOC sequestration in agricultural soils in Vermont, USA using the Rothamsted Carbon Model. However, fine spatial-scale (high granularity) input data are not always available, which can limit the skill of SOC projections. For example, climate projections are often only available at scales of 10s to 100s of km2. To overcome this, we use a climate projection dataset downscaled to <1 km2 (~18,000 cells). We compare SOC from runs forced by high granularity input data to runs forced by aggregated data averaged over the 11,690 km2 study region. We spin up and run the model individually for each cell in the fine-scale runs and for the region in the aggregated runs factorially over three agricultural land uses and four Global Climate Models.
In this repository are the downscaled climate input data that drive the RothC model, as well as the model outputs for each GCM.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data was reported at 10,822.925 RMB mn in Mar 2025. This records an increase from the previous number of 9,197.844 RMB mn for Feb 2025. China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data is updated monthly, averaging 7,378.321 RMB mn from Jan 2015 (Median) to Mar 2025, with 123 observations. The data reached an all-time high of 14,690.047 RMB mn in May 2021 and a record low of 1,856.803 RMB mn in Feb 2016. China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data remains active status in CEIC and is reported by General Administration of Customs. The data is categorized under China Premium Database’s International Trade – Table CN.JKF: RMB: HS26: Ores, Slag and Ash.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
520 Global import shipment records of Malic Acid Granular with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This project investigates retraction indexing agreement among data sources: BCI, BIOABS, CCC, Compendex, Crossref, GEOBASE, MEDLINE, PubMed, Retraction Watch, Scopus, and Web of Science Core. Post-retraction citation may be partly due to authors’ and publishers' challenges in systematically identifying retracted publications. To investigate retraction indexing quality, we investigate the agreement in indexing retracted publications between 11 database sources, restricting to their coverage, resulting in a union list of 85,392 unique items. This dataset highlights items that went through a DOI augmentation process to have PubMed added as a source and that have duplicated PMIDs, indicating data quality issues.
Facebook
TwitterGapMaps uses known population data combined with billions of mobile device location points to provide highly accurate and globally consistent GIS data at 150m grid levels across Asia and MENA. Understand who lives in a catchment, where they work and their spending potential.
Facebook
TwitterObserver data span more than two decades and multiple database development interations. To facilitate status of stocks authors and other users of these records, a data set separate from the OLAP was created which reformats and repackages vetted (debriefed) observer data from NORPAC into a common structure, format, and syntax. The granularity of current production data is compromised, however...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1264 Global import shipment records of Granular Mixture with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Facebook
TwitterThe granular formulation of Bayluscide [Bayluscide 3.2% Granular Sea Lamprey Larvicide, granular Bayluscide (gB)] is applied in lentic and lotic systems to survey (assessment) and kill (treatment) larval sea lampreys (Petromyzon marinus) in the Great Lakes basin. Granules are spread on the water surface, settle to the sediment surface, and dissolve. The potential risk of niclosamide exposure [5 Chloro-N-(2-chloro-4-nitrophenyl)-2-hydroxybenzamide], the active ingredient of gB, to non-target organisms located downstream of survey plots, is a concern of partner agencies (State-level Natural Resource Departments, U.S. Fish and Wildlife Service’s Ecological Service, Fisheries and Oceans Canada Species at Risk Branch). Temporal and spatial distribution of niclosamide in the water column and sediment was evaluated in and downstream of five larval survey plots in two rivers following the application of gB. Water samples were collected at 0.25, 2, 4, 6, 8, and 24 h from 3 depths in the water column (10 cm above the sediment, ½ water column depth, water surface) at three locations inside each survey plot, and 1 meter upstream from three sediment sample grids positioned 10, 30 and 100 m downstream. Sediment samples were collected from inside the grids at 0.25, 2, 4, 6, 8, and 24 h, and from inside the survey plots, 8 and 24 h after gB application. Niclosamide was detected in the sediment and water at all sample locations. From 2 to 24 h after application, average water concentrations 1) varied between study sites, 2) decreased from the survey plots to 100 m downstream, 3) varied by depth in the water column, and 4) decreased over time. Average sediment concentrations varied with distance downstream and time post application, but not by study site or river. Data suggests there would be negligible risk to non-target organisms downstream of a gB survey plot based on low niclosamide concentrations measured in the water and sediment. The depletion rate of niclosamide was also evaluated in St. Clair River sediment dosed at the field application rate. Niclosamide concentration decreased at a rate of 2.28% per hour over the 24 hours measured, equating to a half-life of 1.27 days. This indicates the length of time an organism in the sediment in a survey plot might be exposed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI). Key Definitions AggregationProcess involving summarising or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes Anonymisation Anonymised data is a type of information sanitisation in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy Dataset Structured and organised collection of related elements, often stored digitally, used for analysis and interpretation in various fields. Determinand A constituent or property of drinking water which can be determined or estimated. DWI Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.” DWI Determinands Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included. Granularity Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours ID Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance. LSOA Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy. ONS Office for National Statistics Open Data Triage The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. Sample A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards. Schema Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute. Units Standard measurements used to quantify and compare different physical quantities. Water Quality The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature. Data History Data Origin These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database. Data Triage Considerations Granularity Is it useful to share results as averages or individual? We decided to share as individual results as the lowest level of granularity Anonymisation It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed: • Water Supply Zone (WSZ) - Limits interoperability with other datasets • Postcode – Some postcodes contain very few households and may not offer necessary anonymisation • Postal Sector – Deemed not granular enough in highly populated areas • Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas • MSOA – Deemed not granular enough • LSOA – Agreed as a recognised standard appropriate for England and Wales • Data Zones – Agreed as a recognised standard appropriate for Scotland Data Triage Review Frequency Annually unless otherwise requested Publish Frequency Annually Data Specifications • Each dataset will cover a year of samples in calendar year • This dataset will be published annually • Historical datasets will be published as far back as 2016 from the introduction of The Water Supply (Water Quality) Regulations 2016 • The determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate. • A small proportion of samples could not be allocated to an LSOA – these represented less than 0.1% of samples and were removed from the dataset in 2023. • See supplementary information for the lookup table applied to each calendar year of data. Context Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset. Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area. Some samples are tested on site and others are sent to scientific laboratories. Supplementary information Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset. 1. Drinking Water Inspectorate Standards and Regulations: https://www.dwi.gov.uk/drinking-water-standards-and-regulations/ 2. LSOA (England and Wales) and Data Zone (Scotland): https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf 3. Description for LSOA boundaries by the ONS: https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies4. Postcode to LSOA lookup tables (2022 calendar year data): https://geoportal.statistics.gov.uk/datasets/3770c5e8b0c24f1dbe6d2fc6b46a0b18/about5. Postcode to LSOA lookup tables (2023 calendar year data): https://geoportal.statistics.gov.uk/datasets/b8451168e985446eb8269328615dec62/about6. Postcode to LSOA lookup tables (2024 calendar year data): https://geoportal.statistics.gov.uk/datasets/068ee476727d47a3a7a0d976d4343c59/about7. Legislation history: https://www.dwi.gov.uk/water-companies/legislation/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Echocardiographic data (n 102).
Facebook
TwitterDiscover published data which is local in nature. A local search will return results which include the statewide dataset, which can then be searched and/or filtered to view a specific locality. For numerous statewide datasets, it provides quick access to local information across a broad range of categories from health to transportation, from recreation to economic development; Find local farmer’s markets, child care regulated facilities, solar installations, food service establishment inspections, and much more. Datasets may be searched on one or more local attributes (e.g., county, city), depending upon the granularity of the data. See the overview document http://on.ny.gov/1SB66oL in the “About” section of the source dataset for ways to search specific localities within Statewide datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of the population according to TAPSE/TRV ratio values.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).
The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.
The WikiGrain Data is analyzed in the paper
Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.
===============================================================
Individual files (tables in comma-separated-values-format):
---------------------------------------------------------------
* article_info.csv contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.
- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.
- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.
- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.
- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.
---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.
- "title.of.tlc"
(string) Title of the TLC in which the article is contained.
---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:
- "id"
Article id.
- "is.FA"
Boolean indicator for whether the article is featured.
- "log1p.length"
Length measured by the number of bytes.
- "age"
Age measured by the time since the first edit.
- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.
- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.
- "log1p.number.of.contributors"
Number of unique contributors to the article.
- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').
- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').
- "number.of.level.1.sections"
Number of first level sections in the article.
- "number.of.level.2.sections"
Number of second level sections in the article.
- "number.of.categories"
Number of categories the article is in.
- "log1p.average.size.of.categories"
Average size of the categories the article is in.
- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.
- "log1p.number.of.external.references"
Number of external references given in the article.
- "log1p.number.of.images"
Number of images in the article.
- "log1p.number.of.templates"
Number of templates that the article uses.
- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.
- "granularity"
As in article_info.csv (but normalized to standard deviation one).