Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).
The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.
The WikiGrain Data is analyzed in the paper
Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.
===============================================================
Individual files (tables in comma-separated-values-format):
---------------------------------------------------------------
* article_info.csv contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.
- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.
- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.
- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.
- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.
---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.
- "title.of.tlc"
(string) Title of the TLC in which the article is contained.
---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:
- "id"
Article id.
- "is.FA"
Boolean indicator for whether the article is featured.
- "log1p.length"
Length measured by the number of bytes.
- "age"
Age measured by the time since the first edit.
- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.
- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.
- "log1p.number.of.contributors"
Number of unique contributors to the article.
- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').
- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').
- "number.of.level.1.sections"
Number of first level sections in the article.
- "number.of.level.2.sections"
Number of second level sections in the article.
- "number.of.categories"
Number of categories the article is in.
- "log1p.average.size.of.categories"
Average size of the categories the article is in.
- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.
- "log1p.number.of.external.references"
Number of external references given in the article.
- "log1p.number.of.images"
Number of images in the article.
- "log1p.number.of.templates"
Number of templates that the article uses.
- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.
- "granularity"
As in article_info.csv (but normalized to standard deviation one).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: Portal Users Final Data Set Granularity
Facebook
TwitterGapMaps uses known population data combined with billions of mobile device location points to provide highly accurate and globally consistent GIS data at 150m grid levels across Asia and MENA. Understand who lives in a catchment, where they work and their spending potential.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of simulated events of electron-positron pairs (e+e−) with flat transverse momentum pT distribution pT ∈ [1,200] GeV, with Phase 2 conditions, 200 pileup, V11 geometry, HLT TDR Summer20 campaign The original dataset (CMS-internal).
This derived dataset in ROOT format contains generator-level particle and simulated detector information. More information about how the dataset is derived is available at this TWiki (CMS-internal).
A description of each variable is below.
Variable
Description
Type
run
Run number
int
event
Event number
int
lumi
Luminosity section
int
gen_n
Number of primary generated particles
int
gen_PUNumInt
Number of pileup interactions
int
gen_TrueNumInt
Number of true interactions
float
vtx_x
Simulated primary vertex x position in cm
float
vtx_y
Simulated primary vertex y position in cm
float
vtx_z
Simulated primary vertex z position in cm
float
gen_eta
Primary generated particle pseudorapidity η
vector
gen_phi
Primary generated particle azimuthal angle ϕ
vector
gen_pt
Primary generated particle transverse momentum pT in GeV
vector
gen_energy
Primary generated particle energy in GeV
vector
gen_charge
Initial generated particle charge
vector
gen_pdgid
Primary generated particle PDG ID
vector
gen_status
Primary generated particle generator status
vector
gen_daughters
Primary generated particle daughters (empty)
vector>
genpart_eta
Primary and secondary generated particle pseudorapidity η
vector
genpart_phi
Primary and secondary generated particle azimuthal angle ϕ
vector
genpart_pt
Primary and secondary generated particle transverse momentum pT in GeV
vector
genpart_energy
Primary and secondary generated particle energy in GeV
vector
genpart_dvx
Primary and secondary generated particle decay vertex x position in cm
vector
genpart_dvy
Primary and secondary generated particle decay vertex y position in cm
vector
genpart_dvz
Primary and secondary generated particle decay vertex z position in cm
vector
genpart_ovy
Primary and secondary generated particle original vertex y position in cm
vector
genpart_ovz
Primary and secondary generated particle original vertex z position in cm
vector
genpart_mother
Primary and secondary generated particle parent particle index (-1 indicates no parent)
vector
genpart_exphi
Primary and secondary generated particle azimuthal angle ϕ extrapolated to the corresponding HGCAL coordinate
vector
genpart_exeta
Primary and secondary generated particle pseudorapidity η extrapolated to the corresponding HGCAL coordinate
vector
genpart_exx
Primary and secondary generated particle decay vertex x extrapolated to the corresponding HGCAL coordinate
vector
genpart_exy
Primary and secondary generated particle decay vertex y extrapolated to the corresponding HGCAL coordinate
vector
genpart_fbrem
Primary and secondary generated particle decay vertex z extrapolated to the corresponding HGCAL coordinate
vector
genpart_pid
Primary and secondary generated particle PDG ID
vector
genpart_gen
Index of associated primary generated particle
vector
genpart_reachedEE
Primary and secondary generated particle flag: 2 indicates that the particle reached the HGCAL, 1 indicates the particle reached the barrel calorimeter, and 0 indicates other cases
vector
genpart_fromBeamPipe
Deprecated variable, always true
vector
genpart_posx
Primary and secondary generated particle position x coordinate in cm
vector>
genpart_posy
Primary and secondary generated particle position y coordinate in cm
vector>
genpart_posz
Primary and secondary generated particle position z coordinate in cm
vector>
ts_n
Number of trigger sums
int
ts_id
Trigger sum ID
vector
ts_subdet
Trigger sum subdetector
vector
ts_zside
Trigger sum endcap (plus or minus endcap)
vector
ts_layer
Trigger sum layer ID
vector
ts_wafer
Trigger sum wafer ID
vector
ts_wafertype
Trigger sum wafer type: 0 indicates fine divisions of wafer with 120 μm thick silicon, 1 indicates coarse divisions of wafer with 200 μm thick silicon, and 2 indicates coarse divisions of wafer with 300 μm thick silicon
vector
ts_data
Trigger sum ADC value
vector
ts_pt
Trigger sum transverse momentum in GeV
vector
ts_mipPt
Trigger sum energy in units of transverse MIP
vector
ts_energy
Trigger sum energy in GeV
vector
ts_eta
Trigger sum pseudorapidity η
vector
ts_phi
Trigger sum azimuthal angle ϕ
vector
ts_x
Trigger sum x position in cm
vector
ts_y
Trigger sum y position in cm
vector
ts_z
Trigger sum z position in cm
vector
tc_n
Number of trigger cells
int
tc_id
Trigger cell unique ID
vector
tc_subdet
Trigger cell subdetector ID (EE, EH silicon, or EH scintillator)
vector
tc_zside
Trigger cell endcap (plus or minus endcap)
vector
tc_layer
Trigger cell layer number
vector
tc_waferu
Trigger cell wafer u coordinate; u-axis points along − x-axis
vector
tc_waferv
Trigger cell wafer v coordinate; v-axis points at 60 degrees with respect to x-axis
vector
tc_wafertype
Trigger cell wafer type: 0 indicates fine divisions of wafer with 120 μm thick silicon, 1 indicates coarse divisions of wafer with 200 μm thick silicon, and 2 indicates coarse divisions of wafer with 300 μm thick silicon)
tc_cellu
Trigger cell u coordinate within wafer; u-axis points along − x-axis
vector
tc_cellv
Trigger cell v coordinate within wafer; v-axis points at 60 degrees with respect to x-axis
vector
tc_data
Trigger cell ADC data at 21-bit precision after decoding from 7-bit encoding
vector
tc_uncompressedCharge
Trigger cell ADC data at full precision before compression
vector
tc_compressedCharge
Trigger cell ADC data compressed into 7-bit encoding
vector
tc_pt
Trigger cell transverse momentum pT in GeV
vector
tc_mipPt
Trigger cell energy in units of transverse MIPs
vector
tc_energy
Trigger cell energy in GeV
vector
tc_simenergy
Trigger cell energy from simulated particles in GeV
vector
tc_eta
Trigger cell pseudorapidity η
vector
tc_phi
Trigger cell azimuthal angle ϕ
vector
tc_x
Trigger cell x position in cm
vector
tc_y
Trigger cell y position in cm
vector
tc_z
Trigger cell z position in cm
vector
tc_cluster_id
ID of the 2D cluster in which the trigger cell is clustered
vector
tc_multicluster_id
ID of the 3D cluster in which the trigger cell is clustered
vector
tc_multicluster_pt
Transverse momentum pT in GeV of the 3D cluster in which the trigger cell is clustered
vector
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
The exchange of carbon between the soil and the atmosphere is an important factor in climate change. Soil organic carbon (SOC) storage is sensitive to land management, soil properties, and climatic conditions, and these data serve as key inputs to computer models projecting SOC change. Farmland has been identified as a sink for atmospheric carbon, and we have previously estimated the potential for SOC sequestration in agricultural soils in Vermont, USA using the Rothamsted Carbon Model. However, fine spatial-scale (high granularity) input data are not always available, which can limit the skill of SOC projections. For example, climate projections are often only available at scales of 10s to 100s of km2. To overcome this, we use a climate projection dataset downscaled to <1 km2 (~18,000 cells). We compare SOC from runs forced by high granularity input data to runs forced by aggregated data averaged over the 11,690 km2 study region. We spin up and run the model individually for each cell in the fine-scale runs and for the region in the aggregated runs factorially over three agricultural land uses and four Global Climate Models.
In this repository are the downscaled climate input data that drive the RothC model, as well as the model outputs for each GCM.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).
Key Definitions
Aggregation
Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes
Anonymisation
Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy
Dataset
Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.
Determinand
A constituent or property of drinking water which can be determined or estimated.
DWI
Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”
DWI Determinands
Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.
Granularity
Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours
ID
Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.
LSOA
Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.
ONS
Office for National Statistics
Open Data Triage
The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <
Sample
A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.
Schema
Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.
Units
Standard measurements used to quantify and compare different physical quantities.
Water Quality
The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.
Data History
Data Origin
These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.
Data Triage Considerations
Granularity
Is it useful to share results as averages or individual?
We decided to share as individual results as the lowest level of granularity
Anonymisation
It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:
<!--·
Water Supply Zone (WSZ) - Limits interoperability
with other datasets
<!--·
Postcode – Some postcodes contain very few
households and may not offer necessary anonymisation
<!--·
Postal Sector – Deemed not granular enough in
highly populated areas
<!--·
Rounded Co-ordinates – Not a recognised standard
and may cause overlapping areas
<!--·
MSOA – Deemed not granular enough
<!--·
LSOA – Agreed as a recognised standard appropriate
for England and Wales
<!--·
Data Zones – Agreed as a recognised standard
appropriate for Scotland
Data Specifications
Each dataset will cover a calendar year of samples
This dataset will be published annually
Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016
The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.
Context
Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset
Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.
Some samples are tested on site and others are sent to scientific laboratories.
Data Publish Frequency
Annually
Data Triage Review Frequency
Annually unless otherwise requested
Supplementary information
Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.
<!--1.
Drinking Water
Inspectorate Standards and Regulations:
<!--2.
https://www.dwi.gov.uk/drinking-water-standards-and-regulations/
<!--3.
LSOA (England
and Wales) and Data Zone (Scotland):
<!--5.
Description
for LSOA boundaries by the ONS: Census
2021 geographies - Office for National Statistics (ons.gov.uk)
<!--[6.
Postcode to
LSOA lookup tables: Postcode
to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer
Super Output Area to Local Authority District (August 2023) Lookup in the UK
(statistics.gov.uk)
<!--7.
Legislation history: Legislation -
Drinking Water Inspectorate (dwi.gov.uk)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The data set covers information such as temperature load. The time span is from March 1, 2003 to December 31, 2014. There are 103776 data pieces in total, and the data sampling granularity is 1 hour. This data set has long historical data over a time span, which can verify the prediction performance of the model in the context of large-scale historical data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data used in the paper 'Dimension-agnostic and granularity-based spatially variable gene identification using BSP'
Facebook
TwitterOur study analyzes the limitations of Bluetooth-based trace acquisition initiatives carried out until now in terms of granularity and reliability. We then go on to propose an optimal configuration for the acquisition of proximity traces and movement information using a fine-tuned Bluetooth system based on custom HW. With this system and based on such a configuration, we have carried out an intensive human trace acquisition experiment resulting in a proximity and mobility database of more than 5 million traces with a minimum granularity of 5 s. ; josemari.cabero@tecnalia.com
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Our data portal allows you to download historical location-based electricity data with hourly granularity for free. Data includes consumption-based emissions factors from both direct operations and life cycle analysis (LCA) for the years 2021-2023. Electricity Maps wants to accelerate decarbonization by making carbon accounting easier and more accurate. The data portal empowers companies to do more accurate and granular carbon accounting by replacing yearly values with monthly, daily, or hourly.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset was created by Siddharth Nobell
Released under GPL 2
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of several real-world use cases in the field of Operational Data Analytics (ODA) for the improvement of reliability and energy efficiency. The datasets are composed of monitoring sensor data, acquired from the components of different HPC systems depending on the specific use case. Two tools, whose overhead is proven to be very light, were used to acquire data in HPC-ODA: these are the DCDB and LDMS monitoring frameworks. The aim of HPC-ODA is to provide several vertical slices (here named segments) of the monitoring data available in a large-scale HPC installation. The segments all have different granularities, in terms of data sources and time scale, and provide several use cases on which models and approaches to data processing can be evaluated. While having a production dataset from a whole HPC system - from the infrastructure down to the CPU core level - at a fine time granularity would be ideal, this is often not feasible due to the confidentiality of the data, as well as the sheer amount of storage space required. HPC-ODA includes 5 different segments: Power Consumption Prediction: a fine-granularity dataset that was collected from a single compute node in a HPC system. It contains both node-level data as well as per-CPU core metrics, and can be used to perform regression tasks such as power consumption prediction. Fault Detection: a medium-granularity dataset that was collected from a single compute node while it was subjected to fault injection. It contains only node-level data, as well as the labels for both the applications and faults being executed on the HPC node in time. This dataset can be used to perform fault classification. Application Classification: a medium-granularity dataset that was collected from 16 compute nodes in a HPC system while running different parallel MPI applications. Data is at the compute node level, separated for each of them, and is paired with the labels of the applications being executed. This dataset can be used for tasks such as application classification. Infrastructure Management: a coarse-granularity dataset containing cluster-wide data from a HPC system, about its warm water cooling system as well as power consumption. The data is at the rack level, and can be used for regression tasks such as outlet water temperature or removed heat prediction. Cross-architecture: a medium-granularity dataset that is a variant of the Application Classification one, and shares the same ODA use case. Here, however, single-node configurations of the applications were executed on three different compute node types with different CPU architectures. This dataset can be used to perform cross-architecture application classification, or performance comparison studies. The HPC-ODA dataset collection includes a readme document containing all necessary usage information, as well as a lightweight Python framework to carry out the ODA tasks described for each dataset.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This project investigates retraction indexing agreement among data sources: BCI, BIOABS, CCC, Compendex, Crossref, GEOBASE, MEDLINE, PubMed, Retraction Watch, Scopus, and Web of Science Core. Post-retraction citation may be partly due to authors’ and publishers' challenges in systematically identifying retracted publications. To investigate retraction indexing quality, we investigate the agreement in indexing retracted publications between 11 database sources, restricting to their coverage, resulting in a union list of 85,392 unique items. This dataset highlights items that went through a DOI augmentation process to have PubMed added as a source and that have duplicated PMIDs, indicating data quality issues.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Origin:Samples were taken from customer taps. They were then analysed, and the results were uploaded to a database. This dataset is an extract from this database.Data Triage Considerations:Granularity:We decided to share as individual results at the lowest level of granularity.Anonymisation:It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed: Water Supply Zone (WSZ) - Limits interoperability with other datasets Postcode – Some postcodes contain very few households and may not offer necessary anonymisation Postal Sector – Deemed not granular enough in highly populated areas Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas MSOA – Deemed not granular enough LSOA – Agreed as a recognised standard appropriate for England and Wales Data Zones – Agreed as a recognised standard appropriate for Scotland Data Specifications:Each dataset will cover a calendar year of samplesThis dataset will be published annuallyThe Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate Context:Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset. Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area. Some samples are tested on site and others are sent to scientific laboratories.Prior to undertaking analysis on any new instruments or utilising new analytical techniques, the laboratory undertakes validation of the equipment to ensure it continues to meet the regulatory requirements. This means that the limit of quantification may change for the method either increasing or decreasing from the previous value. Any results below the limit of quantification will be reported as < with a number. For example, a limit of quantification change from <0.68 mg/l to <2.4 mg/l does not mean that there has been a deterioration in the quality of the water supplied. Data Publishing Frequency:AnnuallySupplementary information:Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset: Drinking Water Inspectorate Standards and Regulations Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output (February 2024)Legislation history: Legislation - Drinking Water InspectorateInformation about lead pipes: Lead pipes and lead in your water - United UtilitiesDataset Schema:SAMPLE_ID: Identity of the sampleSAMPLE_DATE: The date the sample was takenDETERMINAND: The determinand being measuredDWI_CODE: The corresponding DWI code for the determinandUNITS: The expression of resultsOPERATOR: The measurement operator for limit of detectionRESULT: The test resultsLSOA: Lower Super Output Area (population weighted centroids used by the Office for National Statistics (ONS) for geo-anonymisation)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset contains roughly 225.000 matches played in Age of Empires 2: Definitive Edition in different granularity and connected Master Data. The current version contains 3 levels: -Match Level: featuring Match Id, Map, Map Size, Duration, Mean Elo, Civilizations, Starting Positions and Outcomes with one row per game -Time Slice Level: contains the aggregated commands of type "Queue","Build" and "Research" made until a certain time in the game, with one row per game and one file per time slice. The games are sliced in 120 second slices. -Input Level: contains data about all made decisions in a game, with one row per input and one file per game.
The information were collected by scraping and parsing AoE2:DE matches, using https://github.com/happyleavesaoc/aoc-mgz. The code for the underlying work can be found in https://github.com/nicoelbert/rtsgamestates.
Stay posted, for any questions feel free to get in touch.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Time-Series Database for Network Telemetry market size in 2024 reached USD 1.23 billion, reflecting the rapid adoption of advanced database solutions for real-time network management. The market is experiencing robust expansion, with a CAGR of 19.7% projected over the forecast period. By 2033, the market is expected to attain a value of USD 5.94 billion, driven by the imperative need for scalable, high-performance data management platforms to support increasingly complex network infrastructures. The primary growth factor is the surge in network traffic, the proliferation of IoT devices, and the escalating demand for actionable network insights in real time.
A key driver behind the exponential growth of the Time-Series Database for Network Telemetry market is the unprecedented expansion of digital transformation initiatives across industries. Enterprises and service providers are generating massive volumes of telemetry data from network devices, applications, and endpoints. Traditional relational databases are ill-equipped to handle the high velocity and granularity of time-stamped data required for effective network telemetry. Time-series databases, purpose-built for this data type, enable organizations to ingest, process, and analyze millions of data points per second, facilitating proactive network management. The shift towards cloud-native architectures, edge computing, and the adoption of 5G networks further amplify the need for efficient telemetry data storage and analytics, reinforcing the critical role of time-series databases in modern network operations.
Another significant growth factor is the rising complexity of network environments, spurred by the advent of hybrid and multi-cloud deployments. As organizations embrace distributed infrastructures and software-defined networking, the challenge of monitoring, diagnosing, and optimizing network performance becomes more acute. Time-series databases for network telemetry empower IT teams with the ability to correlate historical and real-time data, detect anomalies, and automate fault management. This capability is particularly vital for sectors such as telecommunications, IT service providers, and large enterprises, where network downtime or performance degradation can have substantial financial and reputational repercussions. The integration of artificial intelligence and machine learning with time-series databases is also enabling advanced predictive analytics, further enhancing operational efficiency and network reliability.
The growing emphasis on network security and compliance is another pivotal factor fueling the adoption of time-series databases for network telemetry. With cyber threats becoming more sophisticated and regulatory requirements tightening, organizations must maintain comprehensive visibility into network activities and ensure rapid incident detection and response. Time-series databases provide the high-resolution data capture and retention necessary for security analytics, forensic investigations, and regulatory audits. As network telemetry evolves to encompass not only performance metrics but also security events and policy violations, the demand for scalable and secure time-series database solutions is expected to surge across both public and private sectors.
From a regional perspective, North America currently dominates the Time-Series Database for Network Telemetry market, accounting for the largest revenue share in 2024. This leadership is attributed to the presence of major technology vendors, early adoption of advanced network management solutions, and substantial investments in digital infrastructure. However, the Asia Pacific region is poised for the fastest growth, with a projected CAGR of 22.4% through 2033, driven by rapid urbanization, expanding telecommunications networks, and increasing enterprise digitization. Europe and the Middle East & Africa are also witnessing steady growth, supported by government initiatives to modernize network infrastructure and enhance cybersecurity capabilities.
The Database Type segment of the Time-Series Database for Network Telemetry market is bifurcated into Open Source and Commercial solutions, each catering to distinct
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data was reported at 10,822.925 RMB mn in Mar 2025. This records an increase from the previous number of 9,197.844 RMB mn for Feb 2025. China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data is updated monthly, averaging 7,378.321 RMB mn from Jan 2015 (Median) to Mar 2025, with 123 observations. The data reached an all-time high of 14,690.047 RMB mn in May 2021 and a record low of 1,856.803 RMB mn in Feb 2016. China Import: HS 8: Non-Agglomerated Iron Ores and Concentrates, Average Granularity<0.8mm data remains active status in CEIC and is reported by General Administration of Customs. The data is categorized under China Premium Database’s International Trade – Table CN.JKF: RMB: HS26: Ores, Slag and Ash.
Facebook
TwitterStates report information from two reporting populations: (1) The Served Population which is information on all youth receiving at least one independent living services paid or provided by the Chafee Program agency, and (2) Youth completing the NYTD Survey. States survey youth regarding six outcomes: financial self-sufficiency, experience with homelessness, educational attainment, positive connections with adults, high-risk behaviors, and access to health insurance. States collect outcomes information by conducting a survey of youth in foster care on or around their 17th birthday, also referred to as the baseline population. States will track these youth as they age and conduct a new outcome survey on or around the youth's 19th birthday; and again on or around the youth's 21st birthday, also referred to as the follow-up population. States will collect outcomes information on these older youth at ages 19 or 21 regardless of their foster care status or whether they are still receiving independent living services from the State. Depending on the size of the State's foster care youth population, some States may conduct a random sample of the baseline population of the 17-year-olds that participate in the outcomes survey so that they can follow a smaller group of youth as they age. All States will collect and report outcome information on a new baseline population cohort every three years. Units of Response: Current and former youth in foster care Type of Data: Survey Tribal Data: No Periodicity: Annual Demographic Indicators: Ethnicity;Race;Sex SORN: Not Applicable Data Use Agreement: https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm Data Use Agreement Location: https://www.ndacan.acf.hhs.gov/datasets/order_forms/termsofuseagreement.pdf Granularity: Individual Spatial: United States Geocoding: State
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project provides the datasets required for the mKGR paper, including the original spatial vector files and the constructed knowledge graph files. You can directly download KnowledgeGraph.zip for training and validation, or download OriShapefile.zip to build the MKG from scratch.
Additionally, we provide two products generated by mKGR for the entirety of China: ChinaLandUse.gpkg and China15min.gpkg, which are the land-use map products of China and the 15-minute city walkability products of China, respectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).
The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.
The WikiGrain Data is analyzed in the paper
Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.
===============================================================
Individual files (tables in comma-separated-values-format):
---------------------------------------------------------------
* article_info.csv contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.
- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.
- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.
- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.
- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.
---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.
- "title.of.tlc"
(string) Title of the TLC in which the article is contained.
---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:
- "id"
Article id.
- "is.FA"
Boolean indicator for whether the article is featured.
- "log1p.length"
Length measured by the number of bytes.
- "age"
Age measured by the time since the first edit.
- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.
- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.
- "log1p.number.of.contributors"
Number of unique contributors to the article.
- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').
- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').
- "number.of.level.1.sections"
Number of first level sections in the article.
- "number.of.level.2.sections"
Number of second level sections in the article.
- "number.of.categories"
Number of categories the article is in.
- "log1p.average.size.of.categories"
Average size of the categories the article is in.
- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.
- "log1p.number.of.external.references"
Number of external references given in the article.
- "log1p.number.of.images"
Number of images in the article.
- "log1p.number.of.templates"
Number of templates that the article uses.
- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.
- "granularity"
As in article_info.csv (but normalized to standard deviation one).