100+ datasets found
  1. Data from: Wikipedia Category Granularity (WikiGrain) data

    • zenodo.org
    csv, txt
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
    Explore at:
    txt, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jürgen Lerner; Jürgen Lerner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

    The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

    WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

    The WikiGrain Data is analyzed in the paper

    Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

    ===============================================================
    Individual files (tables in comma-separated-values-format):

    ---------------------------------------------------------------
    * article_info.csv contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "granularity"
    (decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

    - "is.FA"
    (boolean) True ('1') if the article is a featured article; false ('0') else.

    - "is.FA.or.GA"
    (boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

    - "is.top.importance"
    (boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

    - "number.of.revisions"
    (integer) Number of times a new version of the article has been uploaded.


    ---------------------------------------------------------------
    * article_to_tlc.csv
    is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
    The file contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "id.of.tlc"
    (integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

    - "title.of.tlc"
    (string) Title of the TLC in which the article is contained.

    ---------------------------------------------------------------
    * article_info_normalized.csv
    contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
    The file contains the following variables:

    - "id"
    Article id.

    - "is.FA"
    Boolean indicator for whether the article is featured.

    - "log1p.length"
    Length measured by the number of bytes.

    - "age"
    Age measured by the time since the first edit.

    - "log1p.number.of.edits"
    Number of times a new version of the article has been uploaded.

    - "log1p.number.of.reverts"
    Number of times a revision has been reverted to a previous one.

    - "log1p.number.of.contributors"
    Number of unique contributors to the article.

    - "number.of.characters.per.word"
    Average number of characters per word (one component of 'reading complexity').

    - "number.of.words.per.sentence"
    Average number of words per sentence (second component of 'reading complexity').

    - "number.of.level.1.sections"
    Number of first level sections in the article.

    - "number.of.level.2.sections"
    Number of second level sections in the article.

    - "number.of.categories"
    Number of categories the article is in.

    - "log1p.average.size.of.categories"
    Average size of the categories the article is in.

    - "log1p.number.of.intra.wiki.links"
    Number of links to pages in the English-language version of Wikipedia.

    - "log1p.number.of.external.references"
    Number of external references given in the article.

    - "log1p.number.of.images"
    Number of images in the article.

    - "log1p.number.of.templates"
    Number of templates that the article uses.

    - "log1p.number.of.inter.language.links"
    Number of links to articles in different language edition of Wikipedia.

    - "granularity"
    As in article_info.csv (but normalized to standard deviation one).

  2. d

    Portal Users Final Data Set Granularity

    • dune.com
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    serotonin_data (2024). Portal Users Final Data Set Granularity [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22perpetual.trades%22
    Explore at:
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    serotonin_data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: Portal Users Final Data Set Granularity

  3. g

    GIS Data | Asia & MENA | 150m x 150m Grids| Accurate and Granular...

    • datastore.gapmaps.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GapMaps, GIS Data | Asia & MENA | 150m x 150m Grids| Accurate and Granular Demographics & Point of Interest (POI) Data | Map Data | Demographic Data [Dataset]. https://datastore.gapmaps.com/products/gapmaps-global-gis-data-asia-mena-150m-x-150m-grids-cu-gapmaps
    Explore at:
    Dataset authored and provided by
    GapMaps
    Area covered
    Saudi Arabia, Indonesia, Singapore, Philippines, India, Malaysia
    Description

    GapMaps uses known population data combined with billions of mobile device location points to provide highly accurate and globally consistent GIS data at 150m grid levels across Asia and MENA. Understand who lives in a catchment, where they work and their spending potential.

  4. Z

    CMS High Granularity Calorimeter Trigger Cell Simulated Dataset (Part 1)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohan Shenoy; Javier Duarte; Christian Herwig; James Hirschauer; Daniel Noonan; Maurizio Pierini; Nhan Tran; Cristina Mantilla Suarez (2023). CMS High Granularity Calorimeter Trigger Cell Simulated Dataset (Part 1) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8338607
    Explore at:
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    UC San Diego
    CERN
    Fermilab
    Authors
    Rohan Shenoy; Javier Duarte; Christian Herwig; James Hirschauer; Daniel Noonan; Maurizio Pierini; Nhan Tran; Cristina Mantilla Suarez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of simulated events of electron-positron pairs (e+e−) with flat transverse momentum pT distribution pT ∈ [1,200] GeV, with Phase 2 conditions, 200 pileup, V11 geometry, HLT TDR Summer20 campaign The original dataset (CMS-internal).

    This derived dataset in ROOT format contains generator-level particle and simulated detector information. More information about how the dataset is derived is available at this TWiki (CMS-internal).

    A description of each variable is below.

        Variable
        Description
        Type
    
    
    
    
        run
        Run number
        int
    
    
        event
        Event number
        int
    
    
        lumi
        Luminosity section
        int
    
    
        gen_n
        Number of primary generated particles
        int
    
    
        gen_PUNumInt
        Number of pileup interactions
        int
    
    
        gen_TrueNumInt
        Number of true interactions
        float
    
    
        vtx_x
        Simulated primary vertex x position in cm
        float
    
    
        vtx_y
        Simulated primary vertex y position in cm
        float
    
    
        vtx_z
        Simulated primary vertex z position in cm
        float
    
    
        gen_eta
        Primary generated particle pseudorapidity η
        vector
    
    
        gen_phi
        Primary generated particle azimuthal angle ϕ
        vector
    
    
        gen_pt
        Primary generated particle transverse momentum pT in GeV
        vector
    
    
        gen_energy
        Primary generated particle energy in GeV
        vector
    
    
        gen_charge
        Initial generated particle charge
        vector
    
    
        gen_pdgid
        Primary generated particle PDG ID
        vector
    
    
        gen_status
        Primary generated particle generator status
        vector
    
    
        gen_daughters
        Primary generated particle daughters (empty)
        vector>
    
    
        genpart_eta
        Primary and secondary generated particle pseudorapidity η
        vector
    
    
        genpart_phi
        Primary and secondary generated particle azimuthal angle ϕ
        vector
    
    
        genpart_pt
        Primary and secondary generated particle transverse momentum pT in GeV
        vector
    
    
        genpart_energy
        Primary and secondary generated particle energy in GeV
        vector
    
    
        genpart_dvx
        Primary and secondary generated particle decay vertex x position in cm
        vector
    
    
        genpart_dvy
        Primary and secondary generated particle decay vertex y position in cm
        vector
    
    
        genpart_dvz
        Primary and secondary generated particle decay vertex z position in cm
        vector
    
    
        genpart_ovy
        Primary and secondary generated particle original vertex y position in cm
        vector
    
    
        genpart_ovz
        Primary and secondary generated particle original vertex z position in cm
        vector
    
    
        genpart_mother
        Primary and secondary generated particle parent particle index (-1 indicates no parent)
        vector
    
    
        genpart_exphi
        Primary and secondary generated particle azimuthal angle ϕ extrapolated to the corresponding HGCAL coordinate
        vector
    
    
        genpart_exeta
        Primary and secondary generated particle pseudorapidity η extrapolated to the corresponding HGCAL coordinate
        vector
    
    
        genpart_exx
        Primary and secondary generated particle decay vertex x extrapolated to the corresponding HGCAL coordinate
        vector
    
    
        genpart_exy
        Primary and secondary generated particle decay vertex y extrapolated to the corresponding HGCAL coordinate
        vector
    
    
        genpart_fbrem
        Primary and secondary generated particle decay vertex z extrapolated to the corresponding HGCAL coordinate
        vector
    
    
        genpart_pid
        Primary and secondary generated particle PDG ID
        vector
    
    
        genpart_gen
        Index of associated primary generated particle
        vector
    
    
        genpart_reachedEE
        Primary and secondary generated particle flag: 2 indicates that the particle reached the HGCAL, 1 indicates the particle reached the barrel calorimeter, and 0 indicates other cases
        vector
    
    
        genpart_fromBeamPipe
        Deprecated variable, always true
        vector
    
    
        genpart_posx
        Primary and secondary generated particle position x coordinate in cm
        vector>
    
    
        genpart_posy
        Primary and secondary generated particle position y coordinate in cm
        vector>
    
    
        genpart_posz
        Primary and secondary generated particle position z coordinate in cm
        vector>
    
    
        ts_n
        Number of trigger sums
        int
    
    
        ts_id
        Trigger sum ID
        vector
    
    
        ts_subdet
        Trigger sum subdetector
        vector
    
    
        ts_zside
        Trigger sum endcap (plus or minus endcap)
        vector
    
    
        ts_layer
        Trigger sum layer ID
        vector
    
    
        ts_wafer
        Trigger sum wafer ID
        vector
    
    
        ts_wafertype
        Trigger sum wafer type: 0 indicates fine divisions of wafer with 120 μm thick silicon, 1 indicates coarse divisions of wafer with 200 μm thick silicon, and 2 indicates coarse divisions of wafer with 300 μm thick silicon
        vector
    
    
        ts_data
        Trigger sum ADC value
        vector
    
    
        ts_pt
        Trigger sum transverse momentum in GeV
        vector
    
    
        ts_mipPt
        Trigger sum energy in units of transverse MIP
        vector
    
    
        ts_energy
        Trigger sum energy in GeV
        vector
    
    
        ts_eta
        Trigger sum pseudorapidity η
        vector
    
    
        ts_phi
        Trigger sum azimuthal angle ϕ
        vector
    
    
        ts_x
        Trigger sum x position in cm
        vector
    
    
        ts_y
        Trigger sum y position in cm
        vector
    
    
        ts_z
        Trigger sum z position in cm
        vector
    
    
        tc_n
        Number of trigger cells
        int
    
    
        tc_id
        Trigger cell unique ID
        vector
    
    
        tc_subdet
        Trigger cell subdetector ID (EE, EH silicon, or EH scintillator)
        vector
    
    
        tc_zside
        Trigger cell endcap (plus or minus endcap)
        vector
    
    
        tc_layer
        Trigger cell layer number
        vector
    
    
        tc_waferu
        Trigger cell wafer u coordinate; u-axis points along  − x-axis
        vector
    
    
        tc_waferv
        Trigger cell wafer v coordinate; v-axis points at 60 degrees with respect to x-axis
        vector
    
    
        tc_wafertype
        Trigger cell wafer type: 0 indicates fine divisions of wafer with 120 μm thick silicon, 1 indicates coarse divisions of wafer with 200 μm thick silicon, and 2 indicates coarse divisions of wafer with 300 μm thick silicon)
    
    
    
        tc_cellu
        Trigger cell u coordinate within wafer; u-axis points along  − x-axis
        vector
    
    
        tc_cellv
        Trigger cell v coordinate within wafer; v-axis points at 60 degrees with respect to x-axis
        vector
    
    
        tc_data
        Trigger cell ADC data at 21-bit precision after decoding from 7-bit encoding
        vector
    
    
        tc_uncompressedCharge
        Trigger cell ADC data at full precision before compression
        vector
    
    
        tc_compressedCharge
        Trigger cell ADC data compressed into 7-bit encoding
        vector
    
    
        tc_pt
        Trigger cell transverse momentum pT in GeV
        vector
    
    
        tc_mipPt
        Trigger cell energy in units of transverse MIPs
        vector
    
    
        tc_energy
        Trigger cell energy in GeV
        vector
    
    
        tc_simenergy
        Trigger cell energy from simulated particles in GeV
        vector
    
    
        tc_eta
        Trigger cell pseudorapidity η
        vector
    
    
        tc_phi
        Trigger cell azimuthal angle ϕ
        vector
    
    
        tc_x
        Trigger cell x position in cm
        vector
    
    
        tc_y
        Trigger cell y position in cm
        vector
    
    
        tc_z
        Trigger cell z position in cm
        vector
    
    
        tc_cluster_id
        ID of the 2D cluster in which the trigger cell is clustered
        vector
    
    
        tc_multicluster_id
        ID of the 3D cluster in which the trigger cell is clustered
        vector
    
    
        tc_multicluster_pt
        Transverse momentum pT in GeV of the 3D cluster in which the trigger cell is clustered
        vector
    
  5. Z

    Supporting data for "Granularity of model input data impacts estimates of...

    • data.niaid.nih.gov
    • repository.soilwise-he.eu
    • +2more
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiltshire, Serge; Clemins, Patrick J; Beckage, Brian (2024). Supporting data for "Granularity of model input data impacts estimates of carbon storage in soils" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11261490
    Explore at:
    Dataset updated
    Jun 11, 2024
    Authors
    Wiltshire, Serge; Clemins, Patrick J; Beckage, Brian
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    The exchange of carbon between the soil and the atmosphere is an important factor in climate change. Soil organic carbon (SOC) storage is sensitive to land management, soil properties, and climatic conditions, and these data serve as key inputs to computer models projecting SOC change. Farmland has been identified as a sink for atmospheric carbon, and we have previously estimated the potential for SOC sequestration in agricultural soils in Vermont, USA using the Rothamsted Carbon Model. However, fine spatial-scale (high granularity) input data are not always available, which can limit the skill of SOC projections. For example, climate projections are often only available at scales of 10s to 100s of km2. To overcome this, we use a climate projection dataset downscaled to <1 km2 (~18,000 cells). We compare SOC from runs forced by high granularity input data to runs forced by aggregated data averaged over the 11,690 km2 study region. We spin up and run the model individually for each cell in the fine-scale runs and for the region in the aggregated runs factorially over three agricultural land uses and four Global Climate Models.

    In this repository are the downscaled climate input data that drive the RothC model, as well as the model outputs for each GCM.

  6. a

    Portsmouth Water Drinking Water Quality Data 2022 2023 2024

    • hub.arcgis.com
    • streamwaterdata.co.uk
    • +1more
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AHughes_Portsmouth (2025). Portsmouth Water Drinking Water Quality Data 2022 2023 2024 [Dataset]. https://hub.arcgis.com/datasets/d3165fd17d624b22a9900d47677dfa45
    Explore at:
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    AHughes_Portsmouth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).

    Key Definitions

    Aggregation

    Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes

    Anonymisation

    Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy

    Dataset

    Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.

    Determinand

    A constituent or property of drinking water which can be determined or estimated.

    DWI

    Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”

    DWI Determinands

    Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.

    Granularity

    Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours

    ID

    Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.

    LSOA

    Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.

    ONS

    Office for National Statistics

    Open Data Triage

    The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <

    Sample

    A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.

    Schema

    Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.

    Units

    Standard measurements used to quantify and compare different physical quantities.

    Water Quality

    The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.

    Data History

    Data Origin

    These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.

    Data Triage Considerations

    Granularity

    Is it useful to share results as averages or individual?

    We decided to share as individual results as the lowest level of granularity

    Anonymisation

    It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:

    <!--·
    Water Supply Zone (WSZ) - Limits interoperability with other datasets

    <!--·
    Postcode – Some postcodes contain very few households and may not offer necessary anonymisation

    <!--·
    Postal Sector – Deemed not granular enough in highly populated areas

    <!--·
    Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas

    <!--·
    MSOA – Deemed not granular enough

    <!--·
    LSOA – Agreed as a recognised standard appropriate for England and Wales

    <!--·
    Data Zones – Agreed as a recognised standard appropriate for Scotland

    Data Specifications

    Each dataset will cover a calendar year of samples

    This dataset will be published annually

    Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016

    The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.

    Context

    Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset

    Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.

    Some samples are tested on site and others are sent to scientific laboratories.

    Data Publish Frequency

    Annually

    Data Triage Review Frequency

    Annually unless otherwise requested

    Supplementary information

    Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.

    <!--1.
    Drinking Water Inspectorate Standards and Regulations:

    <!--2.
    https://www.dwi.gov.uk/drinking-water-standards-and-regulations/

    <!--3.
    LSOA (England and Wales) and Data Zone (Scotland):

    <!--4. https://www.nrscotland.gov.uk/files/geography/2011-census/geography-bckground-info-comparison-of-thresholds.pdf

    <!--5.
    Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics (ons.gov.uk)

    <!--[6.
    Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output Area to Local Authority District (August 2023) Lookup in the UK (statistics.gov.uk)

    <!--7.
    Legislation history: Legislation - Drinking Water Inspectorate (dwi.gov.uk)

  7. Data from: PDB Dataset

    • kaggle.com
    zip
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xie shan (2024). PDB Dataset [Dataset]. https://www.kaggle.com/datasets/xieshan/pdb-dataset
    Explore at:
    zip(668460 bytes)Available download formats
    Dataset updated
    Dec 12, 2024
    Authors
    xie shan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The data set covers information such as temperature load. The time span is from March 1, 2003 to December 31, 2014. There are 103776 data pieces in total, and the data sampling granularity is 1 hour. This data set has long historical data over a time span, which can verify the prediction performance of the model in the context of large-scale historical data.

  8. Data from: Dimension-agnostic and granularity-based spatially variable gene...

    • springernature.figshare.com
    bin
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juexin Wang; Jinpu Li; Skyler T. Kramer; Li Su; Yuzhou chang; Chunhui Xu; Michael T. Eadon; Krzsztof Kiryluk; Qin Ma; Dong Xu (2023). Dimension-agnostic and granularity-based spatially variable gene identification using BSP [Dataset]. http://doi.org/10.6084/m9.figshare.24187923.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Juexin Wang; Jinpu Li; Skyler T. Kramer; Li Su; Yuzhou chang; Chunhui Xu; Michael T. Eadon; Krzsztof Kiryluk; Qin Ma; Dong Xu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data used in the paper 'Dimension-agnostic and granularity-based spatially variable gene identification using BSP'

  9. i

    tecnalia/humanet

    • impactcybertrust.org
    Updated Jun 8, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Data Source (2012). tecnalia/humanet [Dataset]. http://doi.org/10.23721/100/1478897
    Explore at:
    Dataset updated
    Jun 8, 2012
    Authors
    External Data Source
    Description

    Our study analyzes the limitations of Bluetooth-based trace acquisition initiatives carried out until now in terms of granularity and reliability. We then go on to propose an optimal configuration for the acquisition of proximity traces and movement information using a fine-tuned Bluetooth system based on custom HW. With this system and based on such a configuration, we have carried out an intensive human trace acquisition experiment resulting in a proximity and mobility database of more than 5 million traces with a minimum granularity of 5 s. ; josemari.cabero@tecnalia.com

  10. e

    Electricity Maps Data Portal: Granular historical electricity data

    • earth.org.uk
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Electricity Maps (2025). Electricity Maps Data Portal: Granular historical electricity data [Dataset]. https://www.earth.org.uk/bibliography/EMdata.html
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Electricity Maps
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Our data portal allows you to download historical location-based electricity data with hourly granularity for free. Data includes consumption-based emissions factors from both direct operations and life cycle analysis (LCA) for the years 2021-2023. Electricity Maps wants to accelerate decarbonization by making carbon accounting easier and more accurate. The data portal empowers companies to do more accurate and granular carbon accounting by replacing yearly values with monthly, daily, or hourly.

  11. Pollution PM2.5 data London 2019 Jan to Apr

    • kaggle.com
    zip
    Updated Jun 11, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siddharth Nobell (2020). Pollution PM2.5 data London 2019 Jan to Apr [Dataset]. https://www.kaggle.com/siddharthnobell/pollution-pm25-data-london-2019-jan-to-apr
    Explore at:
    zip(126972 bytes)Available download formats
    Dataset updated
    Jun 11, 2020
    Authors
    Siddharth Nobell
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Area covered
    London
    Description

    Dataset

    This dataset was created by Siddharth Nobell

    Released under GPL 2

    Contents

  12. HPC-ODA Dataset Collection

    • data.europa.eu
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). HPC-ODA Dataset Collection [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3701440?locale=pt
    Explore at:
    unknown(1483441742)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of several real-world use cases in the field of Operational Data Analytics (ODA) for the improvement of reliability and energy efficiency. The datasets are composed of monitoring sensor data, acquired from the components of different HPC systems depending on the specific use case. Two tools, whose overhead is proven to be very light, were used to acquire data in HPC-ODA: these are the DCDB and LDMS monitoring frameworks. The aim of HPC-ODA is to provide several vertical slices (here named segments) of the monitoring data available in a large-scale HPC installation. The segments all have different granularities, in terms of data sources and time scale, and provide several use cases on which models and approaches to data processing can be evaluated. While having a production dataset from a whole HPC system - from the infrastructure down to the CPU core level - at a fine time granularity would be ideal, this is often not feasible due to the confidentiality of the data, as well as the sheer amount of storage space required. HPC-ODA includes 5 different segments: Power Consumption Prediction: a fine-granularity dataset that was collected from a single compute node in a HPC system. It contains both node-level data as well as per-CPU core metrics, and can be used to perform regression tasks such as power consumption prediction. Fault Detection: a medium-granularity dataset that was collected from a single compute node while it was subjected to fault injection. It contains only node-level data, as well as the labels for both the applications and faults being executed on the HPC node in time. This dataset can be used to perform fault classification. Application Classification: a medium-granularity dataset that was collected from 16 compute nodes in a HPC system while running different parallel MPI applications. Data is at the compute node level, separated for each of them, and is paired with the labels of the applications being executed. This dataset can be used for tasks such as application classification. Infrastructure Management: a coarse-granularity dataset containing cluster-wide data from a HPC system, about its warm water cooling system as well as power consumption. The data is at the rack level, and can be used for regression tasks such as outlet water temperature or removed heat prediction. Cross-architecture: a medium-granularity dataset that is a variant of the Application Classification one, and shares the same ODA use case. Here, however, single-node configurations of the applications were executed on three different compute node types with different CPU architectures. This dataset can be used to perform cross-architecture application classification, or performance comparison studies. The HPC-ODA dataset collection includes a readme document containing all necessary usage information, as well as a lightweight Python framework to carry out the ODA tasks described for each dataset.

  13. I

    Data for Appendix 7 PMID Duplication in the Union List of "Analyzing the...

    • databank.illinois.edu
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corinne McCumber; Malik Oyewale Salami (2025). Data for Appendix 7 PMID Duplication in the Union List of "Analyzing the consistency of retraction indexing" [Dataset]. http://doi.org/10.13012/B2IDB-7805651_V1
    Explore at:
    Dataset updated
    Nov 19, 2025
    Authors
    Corinne McCumber; Malik Oyewale Salami
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Science Foundation (NSF)
    University of Illinois Urbana-Champaign Center for Advanced Study
    Alfred P. Sloan Foundation
    University of Wisconsin-Madison College of Letters & Science
    Description

    This project investigates retraction indexing agreement among data sources: BCI, BIOABS, CCC, Compendex, Crossref, GEOBASE, MEDLINE, PubMed, Retraction Watch, Scopus, and Web of Science Core. Post-retraction citation may be partly due to authors’ and publishers' challenges in systematically identifying retracted publications. To investigate retraction indexing quality, we investigate the agreement in indexing retracted publications between 11 database sources, restricting to their coverage, resulting in a union list of 85,392 unique items. This dataset highlights items that went through a DOI augmentation process to have PubMed added as a source and that have duplicated PMIDs, indicating data quality issues.

  14. s

    United Utilities Domestic Drinking Water Quality 2023-2024

    • streamwaterdata.co.uk
    • hub.arcgis.com
    Updated Sep 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UnitedUtilities3 (2025). United Utilities Domestic Drinking Water Quality 2023-2024 [Dataset]. https://www.streamwaterdata.co.uk/items/da952fcae81b4c4aa82c384f14e50dbc
    Explore at:
    Dataset updated
    Sep 23, 2025
    Dataset authored and provided by
    UnitedUtilities3
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    Data Origin:Samples were taken from customer taps. They were then analysed, and the results were uploaded to a database. This dataset is an extract from this database.Data Triage Considerations:Granularity:We decided to share as individual results at the lowest level of granularity.Anonymisation:It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed: Water Supply Zone (WSZ) - Limits interoperability with other datasets Postcode – Some postcodes contain very few households and may not offer necessary anonymisation Postal Sector – Deemed not granular enough in highly populated areas Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas MSOA – Deemed not granular enough LSOA – Agreed as a recognised standard appropriate for England and Wales Data Zones – Agreed as a recognised standard appropriate for Scotland Data Specifications:Each dataset will cover a calendar year of samplesThis dataset will be published annuallyThe Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate Context:Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset. Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area. Some samples are tested on site and others are sent to scientific laboratories.Prior to undertaking analysis on any new instruments or utilising new analytical techniques, the laboratory undertakes validation of the equipment to ensure it continues to meet the regulatory requirements. This means that the limit of quantification may change for the method either increasing or decreasing from the previous value. Any results below the limit of quantification will be reported as < with a number. For example, a limit of quantification change from <0.68 mg/l to <2.4 mg/l does not mean that there has been a deterioration in the quality of the water supplied. Data Publishing Frequency:AnnuallySupplementary information:Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset: Drinking Water Inspectorate Standards and Regulations Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output (February 2024)Legislation history: Legislation - Drinking Water InspectorateInformation about lead pipes: Lead pipes and lead in your water - United UtilitiesDataset Schema:SAMPLE_ID: Identity of the sampleSAMPLE_DATE: The date the sample was takenDETERMINAND: The determinand being measuredDWI_CODE: The corresponding DWI code for the determinandUNITS: The expression of resultsOPERATOR: The measurement operator for limit of detectionRESULT: The test resultsLSOA: Lower Super Output Area (population weighted centroids used by the Office for National Statistics (ONS) for geo-anonymisation)

  15. Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • zenodo.org
    • data.europa.eu
    zip
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit 

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema 

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys 

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    {
      _id: 
  16. Age of Empires 2: DE Match Data

    • kaggle.com
    Updated Nov 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nico Elbert (2022). Age of Empires 2: DE Match Data [Dataset]. https://www.kaggle.com/datasets/nicoelbert/aoe-matchups/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nico Elbert
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset contains roughly 225.000 matches played in Age of Empires 2: Definitive Edition in different granularity and connected Master Data. The current version contains 3 levels: -Match Level: featuring Match Id, Map, Map Size, Duration, Mean Elo, Civilizations, Starting Positions and Outcomes with one row per game -Time Slice Level: contains the aggregated commands of type "Queue","Build" and "Research" made until a certain time in the game, with one row per game and one file per time slice. The games are sliced in 120 second slices. -Input Level: contains data about all made decisions in a game, with one row per input and one file per game.

    The information were collected by scraping and parsing AoE2:DE matches, using https://github.com/happyleavesaoc/aoc-mgz. The code for the underlying work can be found in https://github.com/nicoelbert/rtsgamestates.

    Stay posted, for any questions feel free to get in touch.

  17. D

    Time‑Series Database For Network Telemetry Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Time‑Series Database For Network Telemetry Market Research Report 2033 [Dataset]. https://dataintelo.com/report/timeseries-database-for-network-telemetry-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Time‑Series Database for Network Telemetry Market Outlook



    According to our latest research, the global Time-Series Database for Network Telemetry market size in 2024 reached USD 1.23 billion, reflecting the rapid adoption of advanced database solutions for real-time network management. The market is experiencing robust expansion, with a CAGR of 19.7% projected over the forecast period. By 2033, the market is expected to attain a value of USD 5.94 billion, driven by the imperative need for scalable, high-performance data management platforms to support increasingly complex network infrastructures. The primary growth factor is the surge in network traffic, the proliferation of IoT devices, and the escalating demand for actionable network insights in real time.




    A key driver behind the exponential growth of the Time-Series Database for Network Telemetry market is the unprecedented expansion of digital transformation initiatives across industries. Enterprises and service providers are generating massive volumes of telemetry data from network devices, applications, and endpoints. Traditional relational databases are ill-equipped to handle the high velocity and granularity of time-stamped data required for effective network telemetry. Time-series databases, purpose-built for this data type, enable organizations to ingest, process, and analyze millions of data points per second, facilitating proactive network management. The shift towards cloud-native architectures, edge computing, and the adoption of 5G networks further amplify the need for efficient telemetry data storage and analytics, reinforcing the critical role of time-series databases in modern network operations.




    Another significant growth factor is the rising complexity of network environments, spurred by the advent of hybrid and multi-cloud deployments. As organizations embrace distributed infrastructures and software-defined networking, the challenge of monitoring, diagnosing, and optimizing network performance becomes more acute. Time-series databases for network telemetry empower IT teams with the ability to correlate historical and real-time data, detect anomalies, and automate fault management. This capability is particularly vital for sectors such as telecommunications, IT service providers, and large enterprises, where network downtime or performance degradation can have substantial financial and reputational repercussions. The integration of artificial intelligence and machine learning with time-series databases is also enabling advanced predictive analytics, further enhancing operational efficiency and network reliability.




    The growing emphasis on network security and compliance is another pivotal factor fueling the adoption of time-series databases for network telemetry. With cyber threats becoming more sophisticated and regulatory requirements tightening, organizations must maintain comprehensive visibility into network activities and ensure rapid incident detection and response. Time-series databases provide the high-resolution data capture and retention necessary for security analytics, forensic investigations, and regulatory audits. As network telemetry evolves to encompass not only performance metrics but also security events and policy violations, the demand for scalable and secure time-series database solutions is expected to surge across both public and private sectors.




    From a regional perspective, North America currently dominates the Time-Series Database for Network Telemetry market, accounting for the largest revenue share in 2024. This leadership is attributed to the presence of major technology vendors, early adoption of advanced network management solutions, and substantial investments in digital infrastructure. However, the Asia Pacific region is poised for the fastest growth, with a projected CAGR of 22.4% through 2033, driven by rapid urbanization, expanding telecommunications networks, and increasing enterprise digitization. Europe and the Middle East & Africa are also witnessing steady growth, supported by government initiatives to modernize network infrastructure and enhance cybersecurity capabilities.



    Database Type Analysis



    The Database Type segment of the Time-Series Database for Network Telemetry market is bifurcated into Open Source and Commercial solutions, each catering to distinct

  18. C

    China CN: Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average...

    • ceicdata.com
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2024). China CN: Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm [Dataset]. https://www.ceicdata.com/en/china/rmb-hs26-ores-slag-and-ash/cn-import-hs-8-nonagglomerated-iron-ores-and-concentrates-average-granularity08mm
    Explore at:
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2024 - Dec 1, 2024
    Area covered
    China
    Description

    China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data was reported at 10,822.925 RMB mn in Mar 2025. This records an increase from the previous number of 9,197.844 RMB mn for Feb 2025. China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data is updated monthly, averaging 7,378.321 RMB mn from Jan 2015 (Median) to Mar 2025, with 123 observations. The data reached an all-time high of 14,690.047 RMB mn in May 2021 and a record low of 1,856.803 RMB mn in Feb 2016. China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data remains active status in CEIC and is reported by General Administration of Customs. The data is categorized under China Premium Database’s International Trade – Table CN.JKF: RMB: HS26: Ores, Slag and Ash.

  19. d

    National Youth in Transition Database - Outcomes Survey

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Mar 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ACF (2025). National Youth in Transition Database - Outcomes Survey [Dataset]. https://catalog.data.gov/dataset/national-youth-in-transition-database-outcomes-survey
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    ACF
    Description

    States report information from two reporting populations: (1) The Served Population which is information on all youth receiving at least one independent living services paid or provided by the Chafee Program agency, and (2) Youth completing the NYTD Survey. States survey youth regarding six outcomes: financial self-sufficiency, experience with homelessness, educational attainment, positive connections with adults, high-risk behaviors, and access to health insurance. States collect outcomes information by conducting a survey of youth in foster care on or around their 17th birthday, also referred to as the baseline population. States will track these youth as they age and conduct a new outcome survey on or around the youth's 19th birthday; and again on or around the youth's 21st birthday, also referred to as the follow-up population. States will collect outcomes information on these older youth at ages 19 or 21 regardless of their foster care status or whether they are still receiving independent living services from the State. Depending on the size of the State's foster care youth population, some States may conduct a random sample of the baseline population of the 17-year-olds that participate in the outcomes survey so that they can follow a smaller group of youth as they age. All States will collect and report outcome information on a new baseline population cohort every three years. Units of Response: Current and former youth in foster care Type of Data: Survey Tribal Data: No Periodicity: Annual Demographic Indicators: Ethnicity;Race;Sex SORN: Not Applicable Data Use Agreement: https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm Data Use Agreement Location: https://www.ndacan.acf.hhs.gov/datasets/order_forms/termsofuseagreement.pdf Granularity: Individual Spatial: United States Geocoding: State

  20. Data from: Learning to Reason over Multi-Granularity Knowledge Graph for...

    • zenodo.org
    bin, zip
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yansheng Li; Yu Wang; Yansheng Li; Yu Wang (2025). Learning to Reason over Multi-Granularity Knowledge Graph for Zero-shot Urban Land-Use Mapping [Dataset]. http://doi.org/10.5281/zenodo.11311869
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Aug 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yansheng Li; Yu Wang; Yansheng Li; Yu Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project provides the datasets required for the mKGR paper, including the original spatial vector files and the constructed knowledge graph files. You can directly download KnowledgeGraph.zip for training and validation, or download OriShapefile.zip to build the MKG from scratch.

    Additionally, we provide two products generated by mKGR for the entirety of China: ChinaLandUse.gpkg and China15min.gpkg, which are the land-use map products of China and the 15-minute city walkability products of China, respectively.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
Organization logo

Data from: Wikipedia Category Granularity (WikiGrain) data

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
txt, csvAvailable download formats
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jürgen Lerner; Jürgen Lerner
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

The WikiGrain Data is analyzed in the paper

Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

===============================================================
Individual files (tables in comma-separated-values-format):

---------------------------------------------------------------
* article_info.csv contains the following variables:

- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.

- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.


---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:

- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

- "title.of.tlc"
(string) Title of the TLC in which the article is contained.

---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:

- "id"
Article id.

- "is.FA"
Boolean indicator for whether the article is featured.

- "log1p.length"
Length measured by the number of bytes.

- "age"
Age measured by the time since the first edit.

- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.

- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.

- "log1p.number.of.contributors"
Number of unique contributors to the article.

- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').

- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').

- "number.of.level.1.sections"
Number of first level sections in the article.

- "number.of.level.2.sections"
Number of second level sections in the article.

- "number.of.categories"
Number of categories the article is in.

- "log1p.average.size.of.categories"
Average size of the categories the article is in.

- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.

- "log1p.number.of.external.references"
Number of external references given in the article.

- "log1p.number.of.images"
Number of images in the article.

- "log1p.number.of.templates"
Number of templates that the article uses.

- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.

- "granularity"
As in article_info.csv (but normalized to standard deviation one).

Search
Clear search
Close search
Google apps
Main menu