There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effect of outlier identification method on the numbers of sites where outliers are identified, for the Post-Infant Dentition data-subset.
There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations. Moving these petabytes of data over the network to a single _location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one _location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for "cifar100-outlier"
📚 This dataset is an enriched version of the CIFAR-100 Dataset. The workflow is described in the medium article: Changes of Embeddings during Fine-Tuning of Transformers.
Explore the Dataset
The open source data curation tool Renumics Spotlight allows you to explorer this dataset. You can find a Hugging Face Space running Spotlight with this dataset here: https://huggingface.co/spaces/renumics/cifar100-outlier.
Or you can… See the full description on the dataset page: https://huggingface.co/datasets/renumics/cifar100-outlier.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES KANISHKA BHADURI, KAMALIKA DAS, AND PETR VOTAVA** Abstract. There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets ate physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of physical illnesses that are linked with obesity and inactivity. Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to:- Asthma (in persons of all ages)- Cancer (in persons of all ages)- Chronic kidney disease (in adults aged 18+)- Coronary heart disease (in persons of all ages)- Diabetes mellitus (in persons aged 17+)- Hypertension (in persons of all ages)- Stroke and transient ischaemic attack (in persons of all ages)This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.For each of the above illnesses, the percentage of each MSOA’s population with that illness was estimated. This was achieved by calculating a weighted average based on:- The percentage of the MSOA area that was covered by each GP practice’s catchment area- Of the GPs that covered part of that MSOA: the percentage of patients registered with each GP that have that illnessThe estimated percentage of each MSOA’s population with each illness was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with each illness, within the relevant age range.For each illness, each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have that illnessB) the NUMBER of people within that MSOA who are estimated to have that illnessAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA predicted to have that illness, compared to other MSOAs. In other words, those are areas where a large number of people are predicted to suffer from an illness, and where those people make up a large percentage of the population, indicating there is a real issue with that illness within the population and the investment of resources to address that issue could have the greatest benefits.The scores for each of the 7 illnesses were added together then converted to a relative score between 1 – 0 (1 = worst, 0 = best), to give an overall score for each MSOA: a score close to 1 would indicate that an area has high predicted levels of all obesity/inactivity-related illnesses, and these are areas where the local population could benefit the most from interventions to address those illnesses. A score close to 0 would indicate very low predicted levels of obesity/inactivity-related illnesses and therefore interventions might not be required.LIMITATIONS1. GPs do not have catchments that are mutually exclusive from each other: they overlap, with some geographic areas being covered by 30+ practices. This dataset should be viewed in combination with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset to identify where there are areas that are covered by multiple GP practices but at least one of those GP practices did not provide data. Results of the analysis in these areas should be interpreted with caution, particularly if the levels of obesity/inactivity-related illnesses appear to be significantly lower than the immediate surrounding areas.2. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).3. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.4. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of obesity/inactivity-related illnesses, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of these illnesses. TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:- Health and wellbeing statistics (GP-level, England): Missing data and potential outliersDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.
We present a set of novel algorithms which we call sequenceMiner, that detect and characterize anomalies in large sets of high-dimensional symbol sequences that arise from recordings of switch sensors in the cockpits of commercial airliners. While the algorithms we present are general and domain-independent, we focus on a specific problem that is critical to determining system-wide health of a fleet of aircraft. The approach taken uses unsupervised clustering of sequences using the normalized length of he longest common subsequence (nLCS) as a similarity measure, followed by a detailed analysis of outliers to detect anomalies. In this method, an outlier sequence is defined as a sequence that is far away from a cluster. We present new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence is deemed to be an outlier. The algorithm provides a coherent description to an analyst of the anomalies in the sequence when compared to more normal sequences.
The final section of the paper demonstrates the effectiveness of sequenceMiner for anomaly detection on a real set of discrete sequence data from a fleet of commercial airliners.
We show that sequenceMiner discovers actionable and operationally significant safety events. We also compare our innovations with standard HiddenMarkov Models, and show that our methods are superior
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geostatistics analyzes and predicts the values associated with spatial or spatial-temporal phenomena. It incorporates the spatial (and in some cases temporal) coordinates of the data within the analyses. It is a practical means of describing spatial patterns and interpolating values for locations where samples were not taken (and measures the uncertainty of those values, which is critical to informed decision making). This archive contains results of geostatistical analysis of COVID-19 case counts for all available US counties. Test results were obtained with ArcGIS Pro (ESRI). Sources are state health departments, which are scraped and aggregated by the Johns Hopkins Coronavirus Resource Center and then pre-processed by MappingSupport.com.
This update of the Zenodo dataset (version 6) consists of three compressed archives containing geostatistical analyses of SARS-CoV-2 testing data. This dataset utilizes many of the geostatistical techniques used in previous versions of this Zenodo archive, but has been significantly expanded to include analyses of up-to-date U.S. COVID-19 case data (from March 24th to September 8th, 2020):
Archive #1: “1.Geostat. Space-Time analysis of SARS-CoV-2 in the US (Mar24-Sept6).zip” – results of a geostatistical analysis of COVID-19 cases incorporating spatially-weighted hotspots that are conserved over one-week timespans. Results are reported starting from when U.S. COVID-19 case data first became available (March 24th, 2020) for 25 consecutive 1-week intervals (March 24th through to September 6th, 2020). Hotspots, where found, are reported in each individual state, rather than the entire continental United States.
Archive #2: "2.Geostat. Spatial analysis of SARS-CoV-2 in the US (Mar24-Sept8).zip" – the results from geostatistical spatial analyses only of corrected COVID-19 case data for the continental United States, spanning the period from March 24th through September 8th, 2020. The geostatistical techniques utilized in this archive includes ‘Hot Spot’ analysis and ‘Cluster and Outlier’ analysis.
Archive #3: "3.Kriging and Densification of SARS-CoV-2 in LA and MA.zip" – this dataset provides preliminary kriging and densification analysis of COVID-19 case data for certain dates within the U.S. states of Louisiana and Massachusetts.
These archives consist of map files (as both static images and as animations) and data files (including text files which contain the underlying data of said map files [where applicable]) which were generated when performing the following Geostatistical analyses: Hot Spot analysis (Getis-Ord Gi*) [‘Archive #1’: consecutive weeklong Space-Time Hot Spot analysis; ‘Archive #2’: daily Hot Spot Analysis], Cluster and Outlier analysis (Anselin Local Moran's I) [‘Archive #2’], Spatial Autocorrelation (Global Moran's I) [‘Archive #2’], and point-to-point comparisons with Kriging and Densification analysis [‘Archive #3’].
The Word document provided ("Description-of-Archive.Updated-Geostatistical-Analysis-of-SARS-CoV-2 (version 6).docx") details the contents of each file and folder within these three archives and gives general interpretations of these results.
Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink†problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset.
National, regional
Households
Sample survey data [ssd]
The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for Round 2 consisted of the following sections
Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains analyses and summaries of hydrochemistry data for the Galilee subregion, and includes an additional quality assurance of the source hydrochemistry and waterlevel data to remove anomalous and outlier values.
Several bores were removed from the 'chem master sheet' in the QLD Hydrochemistry QA QC GAL v02 (GUID: e3fb6c9b-e224-4d2e-ad11-4bcba882b0af) dataset based on their TDS values. Bores with high or unrealistic TDS that were removed are found at the bottom of the 'updated data' sheet.
Outlier water level values from the JK GAL Bore Waterlevels v01 (GUID: 2f8fe7e6-021f-4070-9f63-aa996b77469d) dataset were identified and removed. Those bores are identified in the 'outliers not used' sheet
Pivot tables were created to summarise data, and create various histograms for analysis and interpretation. These are found in the 'chemistry histogram', 'Pivot tables', 'summaries'.
Bioregional Assessment Programme (2016) Hydrochemistry analysis of the Galilee subregion. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/fd944f9f-14f6-4e20-bb8a-61d1116412ec.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From QLD DNRM Hydrochemistry with QA/QC
Derived From QLD Hydrochemistry QA QC GAL v02
Derived From QLD DNRM Galilee Mine Groundwater Bores - Water Levels
Derived From Galilee bore water levels v01
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014
Derived From RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original)
Derived From Geoscience Australia, 1 second SRTM Digital Elevation Model (DEM)
Derived From Carmichael Coal Mine and Rail Project Environmental Impact Statement
Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of diabetes mellitus in persons (aged 17+). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to diabetes mellitus in persons (aged 17+).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (aged 17+) with diabetes mellitus was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with diabetes mellitus was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with depression, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have diabetes mellitusB) the NUMBER of people within that MSOA who are estimated to have diabetes mellitusAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have diabetes mellitus, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from diabetes mellitus, and where those people make up a large percentage of the population, indicating there is a real issue with diabetes mellitus within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of diabetes mellitus, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of diabetes mellitus.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.
SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of stroke and transient ischaemic attack (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to stroke and transient ischaemic attack (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) to have suffered a stroke or transient ischaemic attack was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population to have suffered a stroke or transient ischaemic attack was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA who have suffered a stroke or transient ischaemic attack, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have had a stroke or transient ischaemic attackB) the NUMBER of people within that MSOA who are estimated to have had a stroke or transient ischaemic attackAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have had a stroke or transient ischaemic attack, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from stroke and transient ischaemic attack, and where those people make up a large percentage of the population, indicating there is a real issue with stroke and transient ischaemic attack within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of stroke and transient ischaemic attack, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of stroke and transient ischaemic attack.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.
Anadromous Atlantic salmon (Salmo salar) is a species of major conservation and management concern in North America, where population abundance has been declining over the past 30 years. Effective conservation actions require the delineation of conservation units to appropriately reflect the spatial scale of intraspecific variation and local adaptation. Towards this goal, we used the most comprehensive genetic and genomic database for Atlantic salmon to date, covering the entire North American range of the species. The database included microsatellite data from 9,142 individuals from 149 sampling locations and data from a medium-density SNP array providing genotypes for >3,000 SNPs for 50 sampling locations. We used neutral and putatively selected loci to integrate adaptive information in the definition of conservation units. Bayesian clustering with the microsatellite dataset and with neutral SNPs identified regional groupings largely consistent with previously published regional as...
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).