Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundA key issue in the analysis of many spatial processes is the choice of an appropriate scale for the analysis. Smaller geographical units are generally preferable for the study of human phenomena because they are less likely to cause heterogeneous groups to be conflated. However, it can be harder to obtain data for small units and small-number problems can frustrate quantitative analysis. This research presents a new approach that can be used to estimate the most appropriate scale at which to aggregate point data to areas.Data and methodsThe proposed method works by creating a number of regular grids with iteratively smaller cell sizes (increasing grid resolution) and estimating the similarity between two realisations of the point pattern at each resolution. The method is applied first to simulated point patterns and then to real publicly available crime data from the city of Vancouver, Canada. The crime types tested are residential burglary, commercial burglary, theft from vehicle and theft of bike.FindingsThe results provide evidence for the size of spatial unit that is the most appropriate for the different types of crime studied. Importantly, the results are dependent on both the number of events in the data and the degree of spatial clustering, so a single ‘appropriate’ scale is not identified. The method is nevertheless useful as a means of better estimating what spatial scale might be appropriate for a particular piece of analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-throughput multi-omics studies and corresponding network analyses of multi-omic data have rapidly expanded their impact over the last 10 years. As biological features of different types (e.g. transcripts, proteins, metabolites) interact within cellular systems, the greatest amount of knowledge can be gained from networks that incorporate multiple types of -omic data. However, biological and technical sources of variation diminish the ability to detect cross-type associations, yielding networks dominated by communities comprised of nodes of the same type. We describe here network building methods that can maximize edges between nodes of different data types leading to integrated networks, networks that have a large number of edges that link nodes of different–omic types (transcripts, proteins, lipids etc). We systematically rank several network inference methods and demonstrate that, in many cases, using a random forest method, GENIE3, produces the most integrated networks. This increase in integration does not come at the cost of accuracy as GENIE3 produces networks of approximately the same quality as the other network inference methods tested here. Using GENIE3, we also infer networks representing antibody-mediated Dengue virus cell invasion and receptor-mediated Dengue virus invasion. A number of functional pathways showed centrality differences between the two networks including genes responding to both GM-CSF and IL-4, which had a higher centrality value in an antibody-mediated vs. receptor-mediated Dengue network. Because a biological system involves the interplay of many different types of molecules, incorporating multiple data types into networks will improve their use as models of biological systems. The methods explored here are some of the first to specifically highlight and address the challenges associated with how such multi-omic networks can be assembled and how the greatest number of interactions can be inferred from different data types. The resulting networks can lead to the discovery of new host response patterns and interactions during viral infection, generate new hypotheses of pathogenic mechanisms and confirm mechanisms of disease.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A realistic synthetic French insurance dataset specifically designed for practicing data cleaning, transformation, and analytics with PySpark and other big data tools. This dataset contains intentional data quality issues commonly found in real-world insurance data.
Perfect for practicing data cleaning and transformation:
2024-01-15, 15/01/2024, 01/15/20241250.50€, €1250.50, 1250.50 EUR, $1375.551250.501250.50 eurosM, F, Male, Female, empty strings150 HP, 150hp, 150 CV, 111 kW, missing valuesto_date() and date parsing functionsregexp_replace() for price cleaningwhen().otherwise() conditional logiccast() for data type conversionsfillna() and dropna() strategiesRealistic insurance business rules implemented: - Age-based premium adjustments - Geographic risk zone pricing - Product-specific claim patterns - Seasonal claim distributions - Client lifecycle status transitions
Intermediate - Suitable for learners with basic Python/SQL knowledge ready to tackle real-world data challenges.
Generated with realistic French business context and intentional quality issues for educational purposes. All data is synthetic and does not represent real individuals or companies.
Facebook
TwitterThis dataset combines comprehensive data from multiple sources, providing an integrated view of encryption techniques, user behavior patterns, privacy measures, and updated user profiles. It is designed for applications in data privacy, behavioral analysis, and user management.
1. Anonymization and Encryption Data:
Details on encryption types, algorithms, key lengths, and associated timestamps.
Useful for analyzing encryption standards and their effectiveness in anonymization.
2. Behavioral Data Collection:
Captures user behavior patterns, including types of behaviors, frequency, and duration.
Includes timestamps for trend analysis and anomaly detection.
3. Privacy Encryption Data:
Provides information on privacy types, encryption levels, and additional metadata.
Helps in evaluating the adequacy of privacy measures and encryption practices.
4. Updated User ID Dataset:
Contains updated user details, including unique IDs, names, phone numbers, and email addresses.
Acts as a reference for linking user profiles to behavioral and encryption data.
Applications:
Data Privacy and Security: Analyze encryption algorithms and privacy measures to ensure data protection.
Behavioral Analysis: Identify trends, patterns, and anomalies in user behavior over time.
User Management: Utilize user profiles for linking behaviors and encryption activities to individual identities.
Research and Development: Aid in developing robust systems for anonymization, privacy, and user analytics.
This dataset is structured for multi-purpose use cases, making it a valuable resource for researchers, data analysts, and developers working on privacy, security, and behavioral systems.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
*****Documentation Process***** 1. Data Preparation: - Upload the data into Power Query to assess quality and identify duplicate values, if any. - Verify data quality and types for each column, addressing any miswriting or inconsistencies. 2. Data Management: - Duplicate the original data sheet for future reference and label the new sheet as the "Working File" to preserve the integrity of the original dataset. 3. Understanding Metrics: - Clarify the meaning of column headers, particularly distinguishing between Impressions and Reach, and comprehend how Engagement Rate is calculated. - Engagement Rate formula: Total likes, comments, and shares divided by Reach. 4. Data Integrity Assurance: - Recognize that Impressions should outnumber Reach, reflecting total views versus unique audience size. - Investigate discrepancies between Reach and Impressions to ensure data integrity, identifying and resolving root causes for accurate reporting and analysis. 5. Data Correction: - Collaborate with the relevant team to rectify data inaccuracies, specifically addressing the discrepancy between Impressions and Reach. - Engage with the concerned team to understand the root cause of discrepancies between Impressions and Reach. - Identify instances where Impressions surpass Reach, potentially attributable to data transformation errors. - Following the rectification process, meticulously adjust the dataset to reflect the corrected Impressions and Reach values accurately. - Ensure diligent implementation of the corrections to maintain the integrity and reliability of the data. - Conduct a thorough recalculation of the Engagement Rate post-correction, adhering to rigorous data integrity standards to uphold the credibility of the analysis. 6. Data Enhancement: - Categorize Audience Age into three groups: "Senior Adults" (45+ years), "Mature Adults" (31-45 years), and "Adolescent Adults" (<30 years) within a new column named "Age Group." - Split date and time into separate columns using the text-to-columns option for improved analysis. 7. Temporal Analysis: - Introduce a new column for "Weekend and Weekday," renamed as "Weekday Type," to discern patterns and trends in engagement. - Define time periods by categorizing into "Morning," "Afternoon," "Evening," and "Night" based on time intervals. 8. Sentiment Analysis: - Populate blank cells in the Sentiment column with "Mixed Sentiment," denoting content containing both positive and negative sentiments or ambiguity. 9. Geographical Analysis: - Group countries and obtain additional continent data from an online source (e.g., https://statisticstimes.com/geography/countries-by-continents.php). - Add a new column for "Audience Continent" and utilize XLOOKUP function to retrieve corresponding continent data.
*****Drawing Conclusions and Providing a Summary*****
Facebook
TwitterSupplemental FiguresThis file includes three supplemental figures which are related to the paper. The figure legends are given below each figure.Spplemental Figure.docSpplemental Figure.pdfSupplemental TablesThis file includes four supplemental tables which are relevant to the paper. The supplemental table 1 lists the locus name, GenBank accession number, primer sequence and annealing temperature for all loci studied in this research. Supplemental table 2 is about the diversities revealed in the pair wise comparison including extended and single deletion, insertion, SNP, polymorphic sites, and polymorphic base pairs. Supplemental table 3 is about SNPs, polymorphic sites, and polymorphic base pairs between A and D ancestral genomes. Supplemental table 4 is about diversities revealed in the three way comparison including extended and single deletion, insertion, SNP, polymorphic sites, and polymorphic base pairs.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset provides a detailed overview of fuel consumption and CO2 emissions for various ship types operating in Nigerian waterways. It includes data on ship types, routes, engine efficiency, fuel consumption, Month and emissions, making it suitable for environmental impact studies, maritime operations optimization, and predictive modeling.
Documentation for Ship Fuel Consumption & CO2 Emission Analysis
This project analyzes fuel consumption and CO2 emissions of various ship types operating in Nigerian waterways. By exploring the fuel efficiency and environmental impact of these vessels, we aim to provide actionable insights for optimizing maritime operations and reducing emissions.
The dataset used in this project contains the following columns: - Ship Type: Categorizes ships into four main types: Fishing Trawler, Oil Service Boat, Surfer Boat, and Tanker Ship. - Fuel Consumption (Liters): The total fuel consumed by each ship type during operations. CO2 Emission (Kg): The amount of carbon dioxide emitted based on fuel consumption. - Other Variables: Supporting data used for exploratory analysis. This dataset was generated to simulate realistic maritime operations in Nigeria, taking into account common ship types, fuel usage patterns, and emissions.
Predictive Analysis A machine learning model can be explored to predict fuel consumption and CO2 emissions based on ship type and operational factors. This would aid in forecasting and planning for greener maritime logistics.
Visualization Key visualizations in this project include: - Bar Charts: Compare average fuel consumption and CO2 emissions across ship types. - Correlation Matrix: Highlights the strong relationship between fuel and emissions. - ANOVA Plots: Illustrate statistical differences between groups.
Usage This dataset and project are valuable for: - Maritime operators looking to optimize fuel efficiency. - Environmental agencies monitoring CO2 emissions. - Data scientists exploring use cases in transportation and environmental sustainability.
Files Included 1. Dataset: The raw data used for analysis. 2. Jupyter Notebook: Contains the complete code for data cleaning, analysis, and visualization. 3. Images: Realistic representations of the ship types analyzed. 4. Documentation: This document for reference.
Acknowledgments This project was developed by ** FIJAB J. ADEKUNLE** as part of a portfolio project in data analysis. Special thanks to the Kaggle platform for hosting the dataset and analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample data (five types of features of one participant)
Facebook
Twitterhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material. - Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. - Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval. - Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference. - Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.
Facebook
TwitterThis dataset is support materials for the publication "Crop type classification, trends, and patterns of central California agricultural fields from 2005 – 2020". This data release is comprised of two child datasets. The first dataset, 'Labeled_CropType_Points', is a shapefile that consists of randomly selected point locations in which crop types were verified using high resolution imagery for each examined year across the study period (2005 - 2020). The second dataset, 'Central_CA_Classified_Croplands', is also a shapefile, but contains polygons of 9 classified crop types derived from a random forest machine learning classifier for central California for each examined year across the study period (2005 - 2020).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Recent studies have pointed out to the existence of two El Niño (EN) types: Eastern Pacific or Canonical (EP) EN and Central Pacific or Modoki (CP) EN. In the present study, the observed and simulated data in three models of the Coupled Model Intercomparison Project phase 5 (CMIP5) were used to evaluate the impacts of two EN types on the South American precipitation from June-August of the EN onset year to March-May of the following year. The Centre National de Recherches Météorologiques (CNRM-CM5) model presented a better performance in reproducing the observed SST anomaly patterns for the CP and EP EN types. The observed precipitation anomaly pattern associated with the EN events was better represented during the austral summer. In the case of the EP EN, such pattern features wetness (dryness) in southeastern (northern-northwestern) South America. The CNRM-CM5 and Hadley Centre Global Environmental Model (HadGEM2-ES) models reproduced this pattern. The Max Planck Institute Earth System Model (MPI-ESM-LR) model reproduced the dryness over northern, but not the rainfall increasing in southeastern and the rainfall reduction in northwestern of the continent. In the case of the CP EN, the observed impact on the South American rainfall during the austral summer featured rainfall scarcity (excess) in northern and northwestern (southeastern) South America. The models reproduced this pattern, however, the HadGEM2-ES and MPI-ESM-LR models showed lower rainfall over northeastern Brazil than the observed one. The EN teleconnection differences explain the differences of the simulated patterns.
Facebook
TwitterPatterns and Limitations of Urban Human Mobility Resilience under the Influence of Multiple Types of Natural Disaster (Original Data)The file includes the location data from 15 natural disaster events that are used for this research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The parameter a is chosen so that each point pattern contains approximately 3,000 points. Parameters b and c determine the amount of clustering; larger numbers produce more clustering.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
According to the Motor Simulation Theory, cognitive states such as kinesthetic motor imagery activate the motor system in a similar way to overt motor execution. Action simulation involved in motor imagery can be implicitly triggered when individuals unconsciously simulate an action, as is the case in Hand Laterality Judgement Task (HLJT). Studies employing the HLJT often use various depictions of hands, which may potentially influence behavioural measures such as response times. The present study recruited 70 younger adults who mentally simulated both realistic and line drawing representations of hands using the HLJT. The results indicated that (1) mental transformations were quicker with line drawing depictions than with realistic hands, (2) faster response times were observed for the back of the hand compared to the palm, and (3) when comparing line drawings to real hands, quicker response times were noted for 0° and 90°L orientations. The results suggest that when compared to line drawings, realistic hands have slower response times for both simple (0°) and challenging (90°L) mental transformations. Overall, behavioural measures may vary between realistic hands and line drawings, underscoring the importance of considering this distinction when utilizing the HLJT.
Facebook
Twitterhttps://www.rioxx.net/licenses/all-rights-reserved/https://www.rioxx.net/licenses/all-rights-reserved/
Data for graph Illus. 4.31. Pottery deposition patterns in different context types (for comparison with Hayton/Shiptonthorpe) by Phase, shown by relative frequencies of sherds. Context types used are those in common with Hayton and Shiptonthorpe publications.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This three-band, 30-m resolution raster contains sagebrush vegetation types, soil temperature/moisture regime classes, and large fire frequencies across greater sage-grouse population areas within the Colorado Plateau sage-grouse management zone. Sagebrush vegetation types were defined by grouping together similar vegetation types from the LANDFIRE biophysical settings layer. Soil moisture and temperature regimes were from an USDA-NRCS analysis of soil types across the greater sage-grouse range. Fire frequencies were derived from fire severity rasters created by the Monitoring Trends in Burn Severity program. The area of analysis included the greater sage-grouse populations areas within specific management zones. Methods used to derive these data are detailed in the report [Brooks, M.L., Matchett, J.R., Shinneman, D.J., and Coates, P.S., 2015, Fire patterns in the range of greater sage-grouse, 1984-2013; Implications for conservation and management: U.S. Geological Survey Open-Fil ...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Overgrazing and climate change are the main causes of grassland degradation, and grazing exclusion is one of the most common measures for restoring degraded grasslands worldwide. Soil fungi can respond rapidly to environmental stresses, but the response of different grassland types to grazing control has not been uniformly determined. Three grassland types (temperate desert, temperate steppe grassland, and mountain meadow) that were closed for grazing exclusion for nine years were used to study the effects of grazing exclusion on soil nutrients as well as fungal community structure in the three grassland types. The results showed that (1) in the 0–5 cm soil layer, grazing exclusion significantly affected the soil water content of the three grassland types (P<0.05), and the pH, total phosphorous (TP) and nitrogen-to-phosphorous ratio (N/P) changed significantly in all three grassland types (P<0.05). Significant changes in soil nutrients in the 5–10 cm soil layer after grazing exclusion occurred in the mountain meadow grasslands (P<0.05), but not in the temperate desert and temperate steppe grasslands. (2) For the different grassland types, Archaeorhizomycetes was most abundant in the montane meadows, and Dothideomycetes was most abundant in the temperate desert grasslands and was significantly more abundant than in the remaining two grassland types (P<0.05). Grazing exclusion let to insignificant changes in the dominant soil fungal phyla and in α diversity but significant changes in the β diversity of soil fungi (P<0.05). (3) Grazing exclusion areas have higher mean clustering coefficients and modularity classes than grazing areas. In particular, the highest modularity class is found in temperate steppe grassland grazing exclusion areas. (4) We also found that pH is the main driving factor affecting soil fungal community structure, that plant coverage is a key environmental factor affecting soil community composition, and that grazing exclusion indirectly affects soil fungal communities by affecting soil nutrients. The above results suggest that grazing exclusion may regulate microbial ecological processes by changing the soil fungal β diversity in the three grassland types. Grazing exclusion is not conducive to the recovery of soil nutrients in areas with mountain meadow but improves the stability of soil fungi in temperate steppe grassland. Therefore, the type of degraded grassland should be considered when formulating suitable restoration programmes when grazing exclusion measures are implemented. The results of this study provide new insights into the response of soil fungal communities to grazing exclusion, providing a theoretical basis for the management of degraded grassland restoration.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Various types of pattern formation and self-organization phenomena can be observed in biological, chemical, and geochemical systems due to the interaction of reaction with diffusion. The appearance of static precipitation patterns was reported first by Liesegang in 1896. Traveling waves and dynamically changing patterns can also exist in reaction−diffusion systems: the Belousov−Zhabotinsky reaction provides a classical example for these phenomena. Until now, no experimental evidence had been found for the presence of such dynamical patterns in precipitation systems. Pattern formation phenomena, as a result of precipitation front coupling with traveling waves, are investigated in a new simple reaction−diffusion system that is based on the precipitation and complex formation of aluminum hydroxide. A unique kind of self-organization, the spontaneous appearance of traveling waves, and spiral formation inside a precipitation front is reported. The newly designed system is a simple one (we need just two inorganic reactants, and the experimental setup is simple), in which dynamically changing pattern formation can be observed. This work could show a new perspective in precipitation pattern formation and geochemical self-organization.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dynamics of societal material stocks such as buildings and infrastructures and their spatial patterns drive surging resource use and emissions. Two main types of data are currently used to map stocks, night-time lights (NTL) from Earth-observing (EO) satellites and cadastral information. We present an alternative approach for broad-scale material stock mapping based on freely available high-resolution EO imagery and OpenStreetMap data. Maps of built-up surface area, building height, and building types were derived from optical Sentinel-2 and radar Sentinel-1 satellite data to map patterns of material stocks for Austria and Germany. Using material intensity factors, we calculated the mass of different types of buildings and infrastructures, distinguishing eight types of materials, at 10 m spatial resolution. The total mass of buildings and infrastructures in 2018 amounted to ∼5 Gt in Austria and ∼38 Gt in Germany (AT: ∼540 t/cap, DE: ∼450 t/cap). Cross-checks with independent data sources at various scales suggested that the method may yield more complete results than other data sources but could not rule out possible overestimations. The method yields thematic differentiations not possible with NTL, avoids the use of costly cadastral data, and is suitable for mapping larger areas and tracing trends over time.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundA key issue in the analysis of many spatial processes is the choice of an appropriate scale for the analysis. Smaller geographical units are generally preferable for the study of human phenomena because they are less likely to cause heterogeneous groups to be conflated. However, it can be harder to obtain data for small units and small-number problems can frustrate quantitative analysis. This research presents a new approach that can be used to estimate the most appropriate scale at which to aggregate point data to areas.Data and methodsThe proposed method works by creating a number of regular grids with iteratively smaller cell sizes (increasing grid resolution) and estimating the similarity between two realisations of the point pattern at each resolution. The method is applied first to simulated point patterns and then to real publicly available crime data from the city of Vancouver, Canada. The crime types tested are residential burglary, commercial burglary, theft from vehicle and theft of bike.FindingsThe results provide evidence for the size of spatial unit that is the most appropriate for the different types of crime studied. Importantly, the results are dependent on both the number of events in the data and the degree of spatial clustering, so a single ‘appropriate’ scale is not identified. The method is nevertheless useful as a means of better estimating what spatial scale might be appropriate for a particular piece of analysis.