34 datasets found
  1. f

    MNIST dataset for Outliers Detection - [ MNIST4OD ]

    • figshare.com
    application/gzip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    figshare
    Authors
    Giovanni Stilo; Bardh Prenkaj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

  2. f

    Data from: Error and anomaly detection for intra-participant time-series...

    • tandf.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    David R. Mullineaux; Gareth Irwin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.

  3. r

    KMASH Data Repository for outlier detection

    • research-repository.rmit.edu.au
    • researchdata.edu.au
    • +1more
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sevvandi Kandanaarachchi; Mario Andres Munoz Acosta; Kate Smith-Miles; Rob J Hyndman (2023). KMASH Data Repository for outlier detection [Dataset]. http://doi.org/10.26180/5c6253c0b3323
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    RMIT University
    Authors
    Sevvandi Kandanaarachchi; Mario Andres Munoz Acosta; Kate Smith-Miles; Rob J Hyndman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The zip files contains 12338 datasets for outlier detection investigated in the following papers:(1) Instance space analysis for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Kate Smith-Miles (2) On normalization and algorithm selection for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Rob J. Hyndman, Kate Smith-MilesSome of these datasets were originally discussed in the paper: On the evaluation of unsupervised outlier detection:measures, datasets and an empirical studyAuthors : G. O. Campos, A, Zimek, J. Sander, R. J.G.B. Campello, B. Micenkova, E. Schubert, I. Assent, M.E. Houle.

  4. f

    Data from: A Diagnostic Procedure for Detecting Outliers in Linear...

    • tandf.figshare.com
    • figshare.com
    txt
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.

  5. d

    Data from: Statistical context dictates the relationship between...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Aug 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank (2019). Statistical context dictates the relationship between feedback-related EEG signals and learning [Dataset]. http://doi.org/10.5061/dryad.570pf8n
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 21, 2019
    Dataset provided by
    Dryad
    Authors
    Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank
    Time period covered
    2019
    Description

    201_Cannon_FILT_altLow_STIM.matpreprocessed EEG data from subject 201203_Cannon_FILT_altLow_STIM.matCleaned EEG data from participant 203204_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 204205_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 205206_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 206207_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 207210_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 210211_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 211212_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 212213_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 213214_Cannon_FILT_altLow_STIM.mat215_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 215216_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 216229_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 229233_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for particip...

  6. c

    Asthma (in persons of all ages): England

    • data.catchmentbasedapproach.org
    Updated Apr 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Asthma (in persons of all ages): England [Dataset]. https://data.catchmentbasedapproach.org/datasets/1c87a458b35d4df38e0744ae039b8e0e
    Explore at:
    Dataset updated
    Apr 6, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of asthma (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to asthma (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with asthma was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with asthma was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with asthma, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have asthmaB) the NUMBER of people within that MSOA who are estimated to have asthmaAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have asthma, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from asthma, and where those people make up a large percentage of the population, indicating there is a real issue with asthma within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of asthma, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of asthma.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  7. Data set used in "Generalized statistics: applications to data inverse...

    • figshare.com
    txt
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sérgio da Silva (2023). Data set used in "Generalized statistics: applications to data inverse problems with outlier-resistance" [Dataset]. http://doi.org/10.6084/m9.figshare.21878394.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 12, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Sérgio da Silva
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data set used in "Generalized statistics: applications to data inverse problems with outlier-resistance"

  8. c

    Cancer (in persons of all ages): England

    • data.catchmentbasedapproach.org
    • hub.arcgis.com
    Updated Apr 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Cancer (in persons of all ages): England [Dataset]. https://data.catchmentbasedapproach.org/datasets/cancer-in-persons-of-all-ages-england
    Explore at:
    Dataset updated
    Apr 6, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of cancer (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to cancer (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with cancer was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with cancer was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with cancer, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have cancerB) the NUMBER of people within that MSOA who are estimated to have cancerAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have cancer, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from cancer, and where those people make up a large percentage of the population, indicating there is a real issue with cancer within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of cancer, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of cancer.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.MSOA boundaries: © Office for National Statistics licensed under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2021.Population data: Mid-2019 (June 30) Population Estimates for Middle Layer Super Output Areas in England and Wales. © Office for National Statistics licensed under the Open Government Licence v3.0. © Crown Copyright 2020.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital; © Office for National Statistics licensed under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2021. © Crown Copyright 2020.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  9. d

    Stream water-quality summary statistics and outliers, streamwater load...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Stream water-quality summary statistics and outliers, streamwater load models and yield estimates, and peak flow modeling parameters for 13 watersheds in Gwinnett County, Georgia [Dataset]. https://catalog.data.gov/dataset/stream-water-quality-summary-statistics-and-outliers-streamwater-load-models-and-yield-est
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Gwinnett County
    Description

    Data release includes the following five data tables: (1) water-quality constituent outliers that were removed from the calibration of regression models used to estimate streamwater solute loads, (2) parameters used to model peak streamflow recurrence intervals, (3) models used to estimate streamwater constituent loads, (4) statistical summaries of water-quality observations, and (5) estimated annual streamwater constituent yields. An associated metadata file is included for each of the five data tables.

  10. R code

    • figshare.com
    txt
    Updated Jun 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 5, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Christine Dodge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers

  11. f

    Data from: Multivariate Outliers and the O3 Plot

    • tandf.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antony Unwin (2023). Multivariate Outliers and the O3 Plot [Dataset]. http://doi.org/10.6084/m9.figshare.7792115.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Antony Unwin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identifying and dealing with outliers is an important part of data analysis. A new visualization, the O3 plot, is introduced to aid in the display and understanding of patterns of multivariate outliers. It uses the results of identifying outliers for every possible combination of dataset variables to provide insight into why particular cases are outliers. The O3 plot can be used to compare the results from up to six different outlier identification methods. There is anRpackage OutliersO3 implementing the plot. The article is illustrated with outlier analyses of German demographic and economic data. Supplementary materials for this article are available online.

  12. c

    Obesity in adults (ages 18 plus): England

    • data.catchmentbasedapproach.org
    Updated May 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Obesity in adults (ages 18 plus): England [Dataset]. https://data.catchmentbasedapproach.org/datasets/obesity-in-adults-ages-18-plus-england
    Explore at:
    Dataset updated
    May 25, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of obesity in adults (aged 18+). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to obesity in adults (aged 18+).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s adult population (aged 18+) that are obese was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s adult population that are obese was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA that are obese, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the adult population within that MSOA who are estimated to be obeseB) the NUMBER of adults within that MSOA who are estimated to be obeseAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to be obese compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people are obese, and where those people make up a large percentage of the population, indicating there is a real issue with obesity within the adult population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. This dataset also shows rural areas (with little or no population) that do not officially fall into any GP catchment area and for which there were no statistics regarding adult obesity (although this will not affect the results of this analysis if there are no people living in those areas).2. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of adult obesity, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of adult obesity.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  13. c

    Hypertension (in persons of all ages): England

    • data.catchmentbasedapproach.org
    • hub.arcgis.com
    Updated Apr 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Hypertension (in persons of all ages): England [Dataset]. https://data.catchmentbasedapproach.org/datasets/hypertension-in-persons-of-all-ages-england
    Explore at:
    Dataset updated
    Apr 7, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of hypertension (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to hypertension (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with hypertension was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with hypertension was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with hypertension , within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have hypertension B) the NUMBER of people within that MSOA who are estimated to have hypertension An average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have hypertension , compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from hypertension, and where those people make up a large percentage of the population, indicating there is a real issue with hypertension within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of hypertension, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of hypertension .TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  14. d

    Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic Unit Codes, 2008-2023 [Dataset]. https://catalog.data.gov/dataset/monthly-openet-image-collections-v2-0-summarized-by-12-digit-hydrologic-unit-codes-2008-20
    Explore at:
    Dataset updated
    Nov 23, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This dataset provides monthly summaries of evapotranspiration (ET) data from OpenET v2.0 image collections for the period 2008-2023 for all National Watershed Boundary Dataset subwatersheds (12-digit hydrologic unit codes [HUC12s]) in the US that overlap the spatial extent of OpenET datasets. For each HUC12, this dataset contains spatial aggregation statistics (minimum, mean, median, and maximum) for each of the ET variables from each of the publicly available image collections from OpenET for the six available models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop) and the Ensemble image collection, which is a pixel-wise ensemble of all 6 individual models after filtering and removal of outliers according to the median absolute deviation approach (Melton and others, 2022). Data are available in this data release in two different formats: comma-separated values (CSV) and parquet, a high-performance format that is optimized for storage and processing of columnar data. CSV files containing data for each 4-digit HUC are grouped by 2-digit HUCs for easier access of regional data, and the single parquet file provides convenient access to the entire dataset. For each of the ET models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop), variables in the model-specific CSV data files include: -huc12: The 12-digit hydrologic unit code -ET: Actual evapotranspiration (in millimeters) over the HUC12 area in the month calculated as the sum of daily ET interpolated between Landsat overpasses -statistic: Max, mean, median, or min. Statistic used in the spatial aggregation within each HUC12. For example, maximum ET is the maximum monthly pixel ET value occurring within the HUC12 boundary after summing daily ET in the month -year: 4-digit year -month: 2-digit month -count: Number of Landsat overpasses included in the ET calculation in the month -et_coverage_pct: Integer percentage of the HUC12 with ET data, which can be used to determine how representative the ET statistic is of the entire HUC12 -count_coverage_pct: Integer percentage of the HUC12 with count data, which can be different than the et_coverage_pct value because the “count” band in the source image collection extends beyond the “et” band in the eastern portion of the image collection extent For the Ensemble data, these additional variables are included in the CSV files: -et_mad: Ensemble ET value, computed as the mean of the ensemble after filtering outliers using the median absolute deviation (MAD) -et_mad_count: The number of models used to compute the ensemble ET value after filtering for outliers using the MAD -et_mad_max: The maximum value in the ensemble range, after filtering for outliers using the MAD -et_mad_min: The minimum value in the ensemble range, after filtering for outliers using the MAD -et_sam: A simple arithmetic mean (across the 6 models) of actual ET average without outlier removal Below are the locations of each OpenET image collection used in this summary: DisALEXI: https://developers.google.com/earth-engine/datasets/catalog/OpenET_DISALEXI_CONUS_GRIDMET_MONTHLY_v2_0 eeMETRIC: https://developers.google.com/earth-engine/datasets/catalog/OpenET_EEMETRIC_CONUS_GRIDMET_MONTHLY_v2_0 geeSEBAL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_GEESEBAL_CONUS_GRIDMET_MONTHLY_v2_0 PT-JPL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_PTJPL_CONUS_GRIDMET_MONTHLY_v2_0 SIMS: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SIMS_CONUS_GRIDMET_MONTHLY_v2_0 SSEBop: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SSEBOP_CONUS_GRIDMET_MONTHLY_v2_0 Ensemble: https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0

  15. c

    Depression (in adults aged 18 and over): England

    • data.catchmentbasedapproach.org
    • hub.arcgis.com
    Updated Apr 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Depression (in adults aged 18 and over): England [Dataset]. https://data.catchmentbasedapproach.org/datasets/depression-in-adults-aged-18-and-over-england
    Explore at:
    Dataset updated
    Apr 6, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of depression in adults (aged 18+). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to depression in adults (aged 18+).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (aged 18+) with depression was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with depression was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with depression, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have depressionB) the NUMBER of people within that MSOA who are estimated to have depressionAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have depression, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from depression, and where those people make up a large percentage of the population, indicating there is a real issue with depression within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of depression, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of depression.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  16. Z

    Dataset - Uncertainty Reduction in Biochemical Kinetic Models: Enforcing...

    • data.niaid.nih.gov
    Updated Feb 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Moret (2021). Dataset - Uncertainty Reduction in Biochemical Kinetic Models: Enforcing Desired Model Properties [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3240299
    Explore at:
    Dataset updated
    Feb 4, 2021
    Dataset provided by
    Michael Moret
    Ljubisa Miskovic
    Jonas Béal
    Vassily Hatzimanikatis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data needed to reproduce the results from the manuscript “Uncertainty Reduction in Biochemical Kinetic Models: Enforcing Desired Model Properties" by L. Miskovic, J. Beal, M. Moret, and V. Hatzimanikatis

    1. Data generated with the ORACLE workflow that was used in the iSCHRUNK training:

    Classification label vectors for the three analyzed metabolic concentration cases:

    Reference case: class_vector_train_ref.mat

    Extreme1 case: class_vector_train_ex1.mat

    Extreme2 case: class_vector_train_ex2.mat

    Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.

    Reference case: training_set_ref.mat

    Extreme1 case: training_set_ex1.mat

    Extreme2 case: training_set_ex2.mat

    Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.

    Reference case: ccXTR_ref.mat

    Extreme1 case: ccXTR_ex1.mat

    Extreme2 case: ccXTR_ex2.mat

    Thermodynamics-based Flux Analysis (TFA) models for the three cases:

    Reference case: tfa_ref.mat

    Extreme1 case: tfa_ex1.mat

    Extreme2 case: tfa_ex2.mat

    1. Validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Figure 4).

    Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.

    ccXTR_ValidNeg.mat

    Parameter sets used in validation

    validation_set_neg.mat

    1. Validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Table 3).

    Negative control:

    Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.

    Reference case: ccXTR_ValidRef_neg_agg.mat

    Extreme1 case: ccXTR_ValidEx1_neg_agg.mat

    Extreme2 case: ccXTR_ValidEx2_neg_agg.mat

    Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.

    Reference case: validation_set_ref_neg_agg.mat

    Extreme1 case: validation_set_ref_neg_agg.mat

    Extreme2 case: tvalidation_set_ref_neg_agg.mat

    Positive control:

    Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.

    Reference case: ccXTR_ValidRef_pos_agg.mat

    Extreme1 case: ccXTR_ValidEx1_pos_agg.mat

    Extreme2 case: ccXTR_ValidEx2_pos_agg.mat

    Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.

    Reference case: validation_set_ref_pos_agg.mat

    Extreme1 case: validation_set_ex1_pos_agg.mat

    Extreme2 case: validation_set_ex2_pos_agg.mat

    1. Reassignment study: validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Figure 6 and Table 4).

    Negative control:

    Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes. For the statistics and the figures we have used the population with removed outliers.

    Reference case: ccXTR_Valid_reassignment_neg.mat

    Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.

    Reference case: validation_set_neg_reassignment.mat

    Positive control:

    Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes. For the statistics and the figures we have used the population with removed outliers.

    Reference case: ccXTR_Valid_reassignment_pos.mat

    Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.

    Reference case: validation_set_pos_reassignment.mat

  17. Data from: Fast robust SUR with economical and actuarial applications

    • search.datacite.org
    • wiley.figshare.com
    Updated Jul 14, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mia Hubert; Tim Verdonck (2016). Data from: Fast robust SUR with economical and actuarial applications [Dataset]. http://doi.org/10.6084/m9.figshare.3408073
    Explore at:
    Dataset updated
    Jul 14, 2016
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Wiley
    Authors
    Mia Hubert; Tim Verdonck
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The seemingly unrelated regression (SUR) model is a generalization of a linear regression model consisting of more than one equation, where the error terms of these equations are contemporaneously correlated. The standard Feasible Generalized Linear Squares (FGLS) estimator is efficient as it takes into account the covariance structure of the errors, but it is also very sensitive to outliers. The robust SUR estimator of Bilodeau and Duchesne (Canadian Journal of Statistics, 28:277-288, 2000) can accommodate outliers, but it is hard to compute. First we propose a fast algorithm, FastSUR, for its computation and show its good performance in a simulation study. We then provide diagnostics for outlier detection and illustrate them on a real data set from economics. Next we apply our FastSUR algorithm in the framework of stochastic loss reserving for general insurance. We focus on the General Multivariate Chain Ladder (GMCL) model that employs SUR to estimate its parameters. Consequently, this multivariate stochastic reserving method takes into account the contemporaneous correlations among run-off triangles and allows structural connections between these triangles. We plug in our FastSUR algorithm into the GMCL model to obtain a robust version.

  18. Predictive Validity Data Set

    • figshare.com
    txt
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Abeyta (2022). Predictive Validity Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.17030021.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 18, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Antonio Abeyta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.

  19. c

    Levels of obesity and inactivity related illnesses (physical illnesses):...

    • data.catchmentbasedapproach.org
    Updated Apr 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Levels of obesity and inactivity related illnesses (physical illnesses): Summary (England) [Dataset]. https://data.catchmentbasedapproach.org/datasets/levels-of-obesity-and-inactivity-related-illnesses-physical-illnesses-summary-england
    Explore at:
    Dataset updated
    Apr 7, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of physical illnesses that are linked with obesity and inactivity. Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to:- Asthma (in persons of all ages)- Cancer (in persons of all ages)- Chronic kidney disease (in adults aged 18+)- Coronary heart disease (in persons of all ages)- Diabetes mellitus (in persons aged 17+)- Hypertension (in persons of all ages)- Stroke and transient ischaemic attack (in persons of all ages)This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.For each of the above illnesses, the percentage of each MSOA’s population with that illness was estimated. This was achieved by calculating a weighted average based on:- The percentage of the MSOA area that was covered by each GP practice’s catchment area- Of the GPs that covered part of that MSOA: the percentage of patients registered with each GP that have that illnessThe estimated percentage of each MSOA’s population with each illness was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with each illness, within the relevant age range.For each illness, each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have that illnessB) the NUMBER of people within that MSOA who are estimated to have that illnessAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA predicted to have that illness, compared to other MSOAs. In other words, those are areas where a large number of people are predicted to suffer from an illness, and where those people make up a large percentage of the population, indicating there is a real issue with that illness within the population and the investment of resources to address that issue could have the greatest benefits.The scores for each of the 7 illnesses were added together then converted to a relative score between 1 – 0 (1 = worst, 0 = best), to give an overall score for each MSOA: a score close to 1 would indicate that an area has high predicted levels of all obesity/inactivity-related illnesses, and these are areas where the local population could benefit the most from interventions to address those illnesses. A score close to 0 would indicate very low predicted levels of obesity/inactivity-related illnesses and therefore interventions might not be required.LIMITATIONS1. GPs do not have catchments that are mutually exclusive from each other: they overlap, with some geographic areas being covered by 30+ practices. This dataset should be viewed in combination with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset to identify where there are areas that are covered by multiple GP practices but at least one of those GP practices did not provide data. Results of the analysis in these areas should be interpreted with caution, particularly if the levels of obesity/inactivity-related illnesses appear to be significantly lower than the immediate surrounding areas.2. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).3. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.4. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of obesity/inactivity-related illnesses, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of these illnesses. TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:- Health and wellbeing statistics (GP-level, England): Missing data and potential outliersDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  20. Extended 1.0 Dataset of "Concentration and Geospatial Modelling of Health...

    • zenodo.org
    bin, csv, pdf
    Updated Sep 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Domjan; Peter Domjan; Viola Angyal; Viola Angyal; Istvan Vingender; Istvan Vingender (2024). Extended 1.0 Dataset of "Concentration and Geospatial Modelling of Health Development Offices' Accessibility for the Total and Elderly Populations in Hungary" [Dataset]. http://doi.org/10.5281/zenodo.13826993
    Explore at:
    bin, pdf, csvAvailable download formats
    Dataset updated
    Sep 23, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peter Domjan; Peter Domjan; Viola Angyal; Viola Angyal; Istvan Vingender; Istvan Vingender
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 23, 2024
    Area covered
    Hungary
    Description

    Introduction

    We are enclosing the database used in our research titled "Concentration and Geospatial Modelling of Health Development Offices' Accessibility for the Total and Elderly Populations in Hungary", along with our statistical calculations. For the sake of reproducibility, further information can be found in the file Short_Description_of_Data_Analysis.pdf and Statistical_formulas.pdf

    The sharing of data is part of our aim to strengthen the base of our scientific research. As of March 7, 2024, the detailed submission and analysis of our research findings to a scientific journal has not yet been completed.

    The dataset was expanded on 23rd September 2024 to include SPSS statistical analysis data, a heatmap, and buffer zone analysis around the Health Development Offices (HDOs) created in QGIS software.

    Short Description of Data Analysis and Attached Files (datasets):

    Our research utilised data from 2022, serving as the basis for statistical standardisation. The 2022 Hungarian census provided an objective basis for our analysis, with age group data available at the county level from the Hungarian Central Statistical Office (KSH) website. The 2022 demographic data provided an accurate picture compared to the data available from the 2023 microcensus. The used calculation is based on our standardisation of the 2022 data. For xlsx files, we used MS Excel 2019 (version: 1808, build: 10406.20006) with the SOLVER add-in.

    Hungarian Central Statistical Office served as the data source for population by age group, county, and regions: https://www.ksh.hu/stadat_files/nep/hu/nep0035.html, (accessed 04 Jan. 2024.) with data recorded in MS Excel in the Data_of_demography.xlsx file.

    In 2022, 108 Health Development Offices (HDOs) were operational, and it's noteworthy that no developments have occurred in this area since 2022. The availability of these offices and the demographic data from the Central Statistical Office in Hungary are considered public interest data, freely usable for research purposes without requiring permission.

    The contact details for the Health Development Offices were sourced from the following page (Hungarian National Population Centre (NNK)): https://www.nnk.gov.hu/index.php/efi (n=107). The Semmelweis University Health Development Centre was not listed by NNK, hence it was separately recorded as the 108th HDO. More information about the office can be found here: https://semmelweis.hu/egeszsegfejlesztes/en/ (n=1). (accessed 05 Dec. 2023.)

    Geocoordinates were determined using Google Maps (N=108): https://www.google.com/maps. (accessed 02 Jan. 2024.) Recording of geocoordinates (latitude and longitude according to WGS 84 standard), address data (postal code, town name, street, and house number), and the name of each HDO was carried out in the: Geo_coordinates_and_names_of_Hungarian_Health_Development_Offices.csv file.

    The foundational software for geospatial modelling and display (QGIS 3.34), an open-source software, can be downloaded from:

    https://qgis.org/en/site/forusers/download.html. (accessed 04 Jan. 2024.)

    The HDOs_GeoCoordinates.gpkg QGIS project file contains Hungary's administrative map and the recorded addresses of the HDOs from the

    Geo_coordinates_and_names_of_Hungarian_Health_Development_Offices.csv file,

    imported via .csv file.

    The OpenStreetMap tileset is directly accessible from www.openstreetmap.org in QGIS. (accessed 04 Jan. 2024.)

    The Hungarian county administrative boundaries were downloaded from the following website: https://data2.openstreetmap.hu/hatarok/index.php?admin=6 (accessed 04 Jan. 2024.)

    HDO_Buffers.gpkg is a QGIS project file that includes the administrative map of Hungary, the county boundaries, as well as the HDO offices and their corresponding buffer zones with a radius of 7.5 km.

    Heatmap.gpkg is a QGIS project file that includes the administrative map of Hungary, the county boundaries, as well as the HDO offices and their corresponding heatmap (Kernel Density Estimation).

    A brief description of the statistical formulas applied is included in the Statistical_formulas.pdf.

    Recording of our base data for statistical concentration and diversification measurement was done using MS Excel 2019 (version: 1808, build: 10406.20006) in .xlsx format.

    • Aggregated number of HDOs by county: Number_of_HDOs.xlsx
    • Standardised data (Number of HDOs per 100,000 residents): Standardized_data.xlsx
    • Calculation of the Lorenz curve: Lorenz_curve.xlsx
    • Calculation of the Gini index: Gini_Index.xlsx
    • Calculation of the LQ index: LQ_Index.xlsx
    • Calculation of the Herfindahl-Hirschman Index: Herfindahl_Hirschman_Index.xlsx
    • Calculation of the Entropy index: Entropy_Index.xlsx
    • Regression and correlation analysis calculation: Regression_correlation.xlsx

    Using the SPSS 29.0.1.0 program, we performed the following statistical calculations with the databases Data_HDOs_population_without_outliers.sav and Data_HDOs_population.sav:

    • Regression curve estimation with elderly population and number of HDOs, excluding outlier values (Types of analyzed equations: Linear, Logarithmic, Inverse, Quadratic, Cubic, Compound, Power, S, Growth, Exponential, Logistic, with summary and ANOVA analysis table): Curve_estimation_elderly_without_outlier.spv
    • Pearson correlation table between the total population, elderly population, and number of HDOs per county, excluding outlier values such as Budapest and Pest County: Pearson_Correlation_populations_HDOs_number_without_outliers.spv.
    • Dot diagram including total population and number of HDOs per county, excluding outlier values such as Budapest and Pest Counties: Dot_HDO_total_population_without_outliers.spv.
    • Dot diagram including elderly (64<) population and number of HDOs per county, excluding outlier values such as Budapest and Pest Counties: Dot_HDO_elderly_population_without_outliers.spv
    • Regression curve estimation with total population and number of HDOs, excluding outlier values (Types of analyzed equations: Linear, Logarithmic, Inverse, Quadratic, Cubic, Compound, Power, S, Growth, Exponential, Logistic, with summary and ANOVA analysis table): Curve_estimation_without_outlier.spv
    • Dot diagram including elderly (64<) population and number of HDOs per county: Dot_HDO_elderly_population.spv
    • Dot diagram including total population and number of HDOs per county: Dot_HDO_total_population.spv
    • Pearson correlation table between the total population, elderly population, and number of HDOs per county: Pearson_Correlation_populations_HDOs_number.spv
    • Regression curve estimation with total population and number of HDOs, (Types of analyzed equations: Linear, Logarithmic, Inverse, Quadratic, Cubic, Compound, Power, S, Growth, Exponential, Logistic, with summary and ANOVA analysis table): Curve_estimation_total_population.spv

    For easier readability, the files have been provided in both SPV and PDF formats.

    The translation of these supplementary files into English was completed on 23rd Sept. 2024.

    If you have any further questions regarding the dataset, please contact the corresponding author: domjan.peter@phd.semmelweis.hu

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
application/gzipAvailable download formats
Dataset updated
May 17, 2024
Dataset provided by
figshare
Authors
Giovanni Stilo; Bardh Prenkaj
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

Search
Clear search
Close search
Google apps
Main menu