18 datasets found
  1. d

    Data from: Privacy Preserving Outlier Detection through Random Nonlinear...

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +2more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.

  2. Multi-Domain Outlier Detection Dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha (2022). Multi-Domain Outlier Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.5941339
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha
    Description

    The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:

    1. Astrophysics - detecting anomalous observations in the Dark Energy Survey (DES) catalog (data type: feature vectors)
    2. Planetary science - selecting novel geologic targets for follow-up observation onboard the Mars Science Laboratory (MSL) rover (data type: grayscale images)
    3. Earth science: detecting anomalous samples in satellite time series corresponding to ground-truth observations of maize crops (data type: time series/feature vectors)
    4. Fashion-MNIST/MNIST: benchmark task to detect anomalous MNIST images among Fashion-MNIST images (data type: grayscale images)

    Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).

    To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:

    Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.

  3. d

    Integrated Building Health Management

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Integrated Building Health Management [Dataset]. https://catalog.data.gov/dataset/integrated-building-health-management
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.

  4. Enhanced US-GAAP Financial Statement Data Set

    • kaggle.com
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vadim Vanak (2025). Enhanced US-GAAP Financial Statement Data Set [Dataset]. https://www.kaggle.com/datasets/vadimvanak/step-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vadim Vanak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset builds upon "Financial Statement Data Sets" by incorporating several key improvements to enhance the accuracy and usability of US-GAAP financial data from SEC filings of U.S. exchange-listed companies. Drawing on submissions from January 2009 onward, the enhanced dataset aims to provide analysts with a cleaner, more consistent dataset by addressing common challenges found in the original data.

    Key Enhancements:

    1. Outlier Detection and Correction: Outliers in the original dataset have been systematically identified and corrected, providing more reliable financial figures.
    2. Amendment Adjustments: In cases where SEC rules allow amendment filings to only include delta figures, full figures from the original submissions have been carried over for consistency, facilitating more straightforward analysis.
    3. Missing Figure Estimation: Using calculation arcs from the US-GAAP taxonomy, missing financial figures have been computed where possible, ensuring greater completeness.
    4. Data Structuring: Financial figures that previously appeared as separate rows have been consolidated into single rows with new columns, offering a cleaner structure.

    Scope:

    • Data Scope: The dataset is restricted to figures reported under US-GAAP standards, with the exception of EntityCommonStockSharesOutstanding and EntityPublicFloat.
    • Currency and Units: The dataset exclusively includes figures reported in USD or shares, ensuring uniformity and comparability. It excludes ratios and non-financial metrics to maintain focus on financial data.
    • Company Selection: The dataset is limited to companies with U.S. exchange tickers, providing a concentrated analysis of publicly traded firms within the United States.
    • Submission Types: The dataset only incorporates data from 10-Q, 10-K, 10-Q/A, and 10-K/A filings, ensuring consistency in the type of financial reports analyzed.

    Dataset Features:

    • Refined Financial Data: Accurate and consistent figures by addressing reporting issues, corrections for outliers, and data consolidation.
    • Enhanced Usability: By handling amendment submissions and leveraging GAAP taxonomies, the dataset offers a more analysis-friendly structure.
    • Improved Completeness: Where original submissions had gaps in reporting, this dataset fills those gaps using calculated figures based on accounting principles.

    The source code for data extraction is available here

  5. o

    Controlled Anomalies Time Series (CATS) Dataset

    • explore.openaire.eu
    • zenodo.org
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646896
    Explore at:
    Dataset updated
    Feb 16, 2023
    Authors
    Patrick Fleith
    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies. The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]: Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including: 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment. 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna. 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc. 5 million timestamps. Sensors readings are at 1Hz sampling frequency. 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour. 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection). 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments. Different types of anomalies to understand what anomaly types can be detected by different approaches. Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data. Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies. Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation. Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise. No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline. [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602” About Solenix Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  6. S

    CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost...

    • scidb.cn
    • observatorio-cientifico.ua.es
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Marquez-Carpintero; Sergio Suescun-Ferrandiz; Monica Pina-Navarro; Francisco Gomez-Donoso; Miguel Cazorla (2024). CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors [Dataset]. http://doi.org/10.57760/sciencedb.08377
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Luis Marquez-Carpintero; Sergio Suescun-Ferrandiz; Monica Pina-Navarro; Francisco Gomez-Donoso; Miguel Cazorla
    Description

    Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data

    /instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }

  7. P

    ionosphere Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fei Tony Liu; Kai Ming Ting; Zhi-Hua Zhou, ionosphere Dataset [Dataset]. https://paperswithcode.com/dataset/ionosphere
    Explore at:
    Authors
    Fei Tony Liu; Kai Ming Ting; Zhi-Hua Zhou
    Description

    The original ionosphere dataset from UCI machine learning repository is a binary classification dataset with dimensionality 34. There is one attribute having values all zeros, which is discarded. So the total number of dimensions are 33. The ‘bad’ class is considered as outliers class and the ‘good’ class as inliers.

  8. A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023):...

    • figshare.com
    csv
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin (2025). A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023): 2.82Million Record Resource for Empirical and ML-Based Research [Dataset]. http://doi.org/10.6084/m9.figshare.27800394.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    figshare
    Authors
    Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data DescriptionWater Quality Parameters: Ammonia, BOD, DO, Orthophosphate, pH, Temperature, Nitrogen, Nitrate.Countries/Regions: United States, Canada, Ireland, England, China.Years Covered: 1940-2023.Data Records: 2.82 million.Definition of ColumnsCountry: Name of the water-body region.Area: Name of the area in the region.Waterbody Type: Type of the water-body source.Date: Date of the sample collection (dd-mm-yyyy).Ammonia (mg/l): Ammonia concentration.Biochemical Oxygen Demand (BOD) (mg/l): Oxygen demand measurement.Dissolved Oxygen (DO) (mg/l): Concentration of dissolved oxygen.Orthophosphate (mg/l): Orthophosphate concentration.pH (pH units): pH level of water.Temperature (°C): Temperature in Celsius.Nitrogen (mg/l): Total nitrogen concentration.Nitrate (mg/l): Nitrate concentration.CCME_Values: Calculated water quality index values using the CCME WQI model.CCME_WQI: Water Quality Index classification based on CCME_Values.Data Directory Description:Category 1: DatasetCombined Data: This folder contains two CSV files: Combined_dataset.csv and Summary.xlsx. The Combined_dataset.csv file includes all eight water quality parameter readings across five countries, with additional data for initial preprocessing steps like missing value handling, outlier detection, and other operations. It also contains the CCME Water Quality Index calculation for empirical analysis and ML-based research. The Summary.xlsx provides a brief description of the datasets, including data distributions (e.g., maximum, minimum, mean, standard deviation).Combined_dataset.csvSummary.xlsxCountry-wise Data: This folder contains separate country-based datasets in CSV files. Each file includes the eight water quality parameters for regional analysis. The Summary_country.xlsx file presents country-wise dataset descriptions with data distributions (e.g., maximum, minimum, mean, standard deviation).England_dataset.csvCanada_dataset.csvUSA_dataset.csvIreland_dataset.csvChina_dataset.csvSummary_country.xlsxCategory 2: CodeData processing and harmonization code (e.g., Language Conversion, Date Conversion, Parameter Naming and Unit Conversion, Missing Value Handling, WQI Measurement and Classification).Data_Processing_Harmonnization.ipynbThe code used for Technical Validation (e.g., assessing the Data Distribution, Outlier Detection, Water Quality Trend Analysis, and Vrifying the Application of the Dataset for the ML Models).Technical_Validation.ipynbCategory 3: Data Collection SourcesThis category includes links to the selected dataset sources, which were used to create the dataset and are provided for further reconstruction or data formation. It contains links to various data collection sources.DataCollectionSources.xlsxOriginal Paper Title: A Comprehensive Dataset of Surface Water Quality Spanning 1940-2023 for Empirical and ML Adopted ResearchAbstractAssessment and monitoring of surface water quality are essential for food security, public health, and ecosystem protection. Although water quality monitoring is a known phenomenon, little effort has been made to offer a comprehensive and harmonized dataset for surface water at the global scale. This study presents a comprehensive surface water quality dataset that preserves spatio-temporal variability, integrity, consistency, and depth of the data to facilitate empirical and data-driven evaluation, prediction, and forecasting. The dataset is assembled from a range of sources, including regional and global water quality databases, water management organizations, and individual research projects from five prominent countries in the world, e.g., the USA, Canada, Ireland, England, and China. The resulting dataset consists of 2.82 million measurements of eight water quality parameters that span 1940 - 2023. This dataset can support meta-analysis of water quality models and can facilitate Machine Learning (ML) based data and model-driven investigation of the spatial and temporal drivers and patterns of surface water quality at a cross-regional to global scale.Note: Cite this repository and the original paper when using this dataset.

  9. g

    ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • elki-project.github.io
    • zenodo.org
    Updated Sep 2, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erich Schubert; Arthur Zimek (2011). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
    Explore at:
    Dataset updated
    Sep 2, 2011
    Dataset provided by
    TU Dortmund University
    University of Southern Denmark, Denmark
    Authors
    Erich Schubert; Arthur Zimek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.

  10. c

    Cancer (in persons of all ages): England

    • data.catchmentbasedapproach.org
    • hub.arcgis.com
    Updated Apr 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Cancer (in persons of all ages): England [Dataset]. https://data.catchmentbasedapproach.org/datasets/cancer-in-persons-of-all-ages-england
    Explore at:
    Dataset updated
    Apr 6, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of cancer (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to cancer (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with cancer was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with cancer was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with cancer, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have cancerB) the NUMBER of people within that MSOA who are estimated to have cancerAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have cancer, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from cancer, and where those people make up a large percentage of the population, indicating there is a real issue with cancer within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of cancer, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of cancer.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.MSOA boundaries: © Office for National Statistics licensed under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2021.Population data: Mid-2019 (June 30) Population Estimates for Middle Layer Super Output Areas in England and Wales. © Office for National Statistics licensed under the Open Government Licence v3.0. © Crown Copyright 2020.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital; © Office for National Statistics licensed under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2021. © Crown Copyright 2020.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  11. COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 2 consisted of the following sections

    Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

  12. c

    Asthma (in persons of all ages): England

    • data.catchmentbasedapproach.org
    Updated Apr 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Asthma (in persons of all ages): England [Dataset]. https://data.catchmentbasedapproach.org/maps/1c87a458b35d4df38e0744ae039b8e0e_0/about
    Explore at:
    Dataset updated
    Apr 6, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of asthma (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to asthma (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with asthma was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with asthma was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with asthma, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have asthmaB) the NUMBER of people within that MSOA who are estimated to have asthmaAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have asthma, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from asthma, and where those people make up a large percentage of the population, indicating there is a real issue with asthma within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of asthma, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of asthma.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  13. a

    Hypertension (in persons of all ages): England

    • hub.arcgis.com
    • data.catchmentbasedapproach.org
    Updated Apr 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Hypertension (in persons of all ages): England [Dataset]. https://hub.arcgis.com/maps/theriverstrust::hypertension-in-persons-of-all-ages-england
    Explore at:
    Dataset updated
    Apr 7, 2021
    Dataset authored and provided by
    The Rivers Trust
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of hypertension (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to hypertension (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with hypertension was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with hypertension was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with hypertension , within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have hypertension B) the NUMBER of people within that MSOA who are estimated to have hypertension An average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have hypertension , compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from hypertension, and where those people make up a large percentage of the population, indicating there is a real issue with hypertension within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of hypertension, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of hypertension .TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  14. a

    Levels of obesity and inactivity related illnesses (physical illnesses):...

    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    • data.catchmentbasedapproach.org
    • +1more
    Updated Apr 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Levels of obesity and inactivity related illnesses (physical illnesses): Summary (England) [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/maps/theriverstrust::levels-of-obesity-and-inactivity-related-illnesses-physical-illnesses-summary-england
    Explore at:
    Dataset updated
    Apr 7, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of physical illnesses that are linked with obesity and inactivity. Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to:- Asthma (in persons of all ages)- Cancer (in persons of all ages)- Chronic kidney disease (in adults aged 18+)- Coronary heart disease (in persons of all ages)- Diabetes mellitus (in persons aged 17+)- Hypertension (in persons of all ages)- Stroke and transient ischaemic attack (in persons of all ages)This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.For each of the above illnesses, the percentage of each MSOA’s population with that illness was estimated. This was achieved by calculating a weighted average based on:- The percentage of the MSOA area that was covered by each GP practice’s catchment area- Of the GPs that covered part of that MSOA: the percentage of patients registered with each GP that have that illnessThe estimated percentage of each MSOA’s population with each illness was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with each illness, within the relevant age range.For each illness, each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have that illnessB) the NUMBER of people within that MSOA who are estimated to have that illnessAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA predicted to have that illness, compared to other MSOAs. In other words, those are areas where a large number of people are predicted to suffer from an illness, and where those people make up a large percentage of the population, indicating there is a real issue with that illness within the population and the investment of resources to address that issue could have the greatest benefits.The scores for each of the 7 illnesses were added together then converted to a relative score between 1 – 0 (1 = worst, 0 = best), to give an overall score for each MSOA: a score close to 1 would indicate that an area has high predicted levels of all obesity/inactivity-related illnesses, and these are areas where the local population could benefit the most from interventions to address those illnesses. A score close to 0 would indicate very low predicted levels of obesity/inactivity-related illnesses and therefore interventions might not be required.LIMITATIONS1. GPs do not have catchments that are mutually exclusive from each other: they overlap, with some geographic areas being covered by 30+ practices. This dataset should be viewed in combination with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset to identify where there are areas that are covered by multiple GP practices but at least one of those GP practices did not provide data. Results of the analysis in these areas should be interpreted with caution, particularly if the levels of obesity/inactivity-related illnesses appear to be significantly lower than the immediate surrounding areas.2. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).3. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.4. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of obesity/inactivity-related illnesses, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of these illnesses. TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:- Health and wellbeing statistics (GP-level, England): Missing data and potential outliersDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  15. Oxygen Exposure for Benthic Megafauna near San

    • kaggle.com
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Oxygen Exposure for Benthic Megafauna near San [Dataset]. https://www.kaggle.com/datasets/thedevastator/oxygen-exposure-for-benthic-megafauna-near-san-d
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Oxygen Exposure for Benthic Megafauna near San Diego

    Spatially Varying Environmental Risk

    By [source]

    About this dataset

    This dataset captures an in-depth look into the environmental conditions of the underwater world off Southern California's coast. It provides invaluable information related to spatial risk variation, such as oxygen exposure levels, depths and habitat criteria of 53 species of benthic and epibenthic megafauna recorded during the three-year study. This data will provide insight into aquatic life dynamics and potentially generate improved management strategies for protecting these vital species. Moreover, due to the importance that waters play within our planet's fragile ecosystem, a proper understanding of their affairs could lead to greater marine sustainability in the long-term. Ultimately, this dataset may help answer our questions about how exactly ocean life is responding to intense human activity and its effects on today's seaside communities

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Download and install the dataset: The dataset contains two .csv files, each containing data from the three-year study on oxygen exposure for benthic and epibenthic megafauna off the coast of San Diego in Southern California. Download these two files to your computer and save them for further analysis.

    • Familiarize yourself with the datasets: Each file includes very detailed information about a particular variable related to the study (for example, SpeciesMetadata contains species-level information on 53 species of benthic and epibenthic megafauna). Read through each data sheet carefully in order to gain a better understanding of what's included in each column.

    • Clean up any outliers or missing values: Once you understand which columns are important for your analysis, you can begin cleaning up any outliers or missing values that may be present in your dataset. This is an important step as it will help ensure that further analysis is performed accurately.

    • Choose an appropriate visualization method: Depending on what type of results you want to show from your analysis, choose an appropriate visualization method (e.g., bar plot, scatterplot). Also consider if adding labeling such as color with respect to categories would improve legibility of figures you produce from this dataset during exploratory data analyses stages.

      5) Choose a statistical test suitable for this type of project: Once allyour visuals have been produced its time to interpret results using statistics tests depending on how many categorical variables are presentin the data set (i.e t-test or ANOVA). As well understand key outputs like p_values so experiment could effectively conclude if thereare significant differences between treatmentswhen comparing distributions among samples/populations being studied here.. Be sureto adjust mean size/sample size when performing statistic testsuitably accordingto determining adequate power when selecting applicable tests etc.

    Research Ideas

    • Comparing the effects of different environmental factors (depth, temperature, salinity etc.) on depth-specific distributions of oxygen and benthic megafauna.
    • Identifying and mapping vulnerable areas for benthic species based on environmental factors and oxygen exposure patterns.
    • Developing models to predict underlying spatial risk variables for endangered species to inform conservation efforts in the study area

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: ROVObservationData.csv

    File: SpeciesMetadata.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  16. f

    Data from: S1 Data -

    • plos.figshare.com
    zip
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Liu; Qian Ning; Guangwei Liu; Haonan Wang; Yixin Zhu; Miao Zhong (2025). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0318431.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Wei Liu; Qian Ning; Guangwei Liu; Haonan Wang; Yixin Zhu; Miao Zhong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Traditional subspace feature selection methods typically rely on a fixed distance to compute residuals between the original and feature reconstruction spaces. However, this approach struggles to adapt to diverse datasets and often fails to handle noise and outliers effectively. In this paper, we propose an unsupervised feature selection method named unsupervised feature selection algorithm based on -norm feature reconstruction (NFRFS). Employing a flexible norm to represent both the original space and the spatial distance of feature reconstruction, enhances adaptability and broadens its applicability by adjusting p. Additionally, adaptive graph learning is integrated into the feature selection process to preserve the local geometric structure of the data. Features exhibiting sparsity and low redundancy are selected through the regularization constraint of the inner product in the feature selection matrix. To demonstrate the effectiveness of the method, numerical studies were conducted on 14 benchmark datasets. Our results indicate that the method outperforms 10 unsupervised feature selection algorithms in terms of clustering performance.

  17. f

    S1 Data -

    • figshare.com
    txt
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Sajid; Kaleem Razzaq Malik; Ali Haider Khan; Sajid Iqbal; Abdullah A. Alaulamie; Qazi Mudassar Ilyas (2025). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0307718.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Muhammad Sajid; Kaleem Razzaq Malik; Ali Haider Khan; Sajid Iqbal; Abdullah A. Alaulamie; Qazi Mudassar Ilyas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes, a chronic metabolic condition characterised by persistently high blood sugar levels, necessitates early detection to mitigate its risks. Inadequate dietary choices can contribute to various health complications, emphasising the importance of personalised nutrition interventions. However, real-time selection of diets tailored to individual nutritional needs is challenging because of the intricate nature of foods and the abundance of dietary sources. Because diabetes is a chronic condition, patients with this illness must choose a healthy diet. Patients with diabetes frequently need to visit their doctor and rely on expensive medications to manage their condition. It is challenging to purchase medication for chronic illnesses on a regular basis in underdeveloped nations. Motivated by this concept, we suggest a hybrid model that, rather than depending solely on medication to evade a visit to the doctor, can first anticipate diabetes and then suggest a diet and exercise regimen. This research proposes an optimized approach by harnessing machine learning classifiers, including Random Forest, Support Vector Machine, and XGBoost, to develop a robust framework for accurate diabetes prediction. The study addresses the difficulties in predicting diabetes precisely from limited labeled data and outliers in diabetes datasets. Furthermore, a thorough food and exercise recommender system is unveiled, offering individualized and health-conscious nutrition recommendations based on user preferences and medical information. Leveraging efficient learning and inference techniques, the study achieves a meager error rate of less than 30% using an extensive dataset comprising over 100 million user-rated foods. This research underscores the significance of integrating machine learning classifiers with personalized nutritional recommendations to enhance diabetes prediction and management. The proposed framework has substantial potential to facilitate early detection, provide tailored dietary guidance, and alleviate the economic burden associated with diabetes-related healthcare expenses.

  18. f

    Data from: Investigating the contributors to hit-and-run crashes using...

    • figshare.com
    xlsx
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gen Li (2024). Investigating the contributors to hit-and-run crashes using gradient boosting decision trees [Dataset]. http://doi.org/10.6084/m9.figshare.27178305.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    figshare
    Authors
    Gen Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper uses the 2021 traffic crash data from the NHTSA CRSS as a sample for model training and validation. The CRSS data collects crash report data provided by police departments from all 50 states in the United States. It details various factors of each traffic crash, including crash information, driver information, vehicle information, road information, and environmental information.The crash accident data provided by CRSS include crash-related details such as the location, time, cause, type of crash, driver’s age, gender, attention level, injury status, risky driving behavior, vehicle type, usage, damage, and hit-and-run situations. However, due to the separate recording of the dataset and the presence of systematic errors and redundant information, the CRSS 2021 data undergo the following merging and filtering processes:1) Match and merge separately recorded data based on the unique case number "CASENUM" in the dataset.2) Records with missing values in critical variables (e.g., whether the crash involved a hit-and-run) were removed to avoid bias in the analysis. For non-critical variables, missing values were imputed using the mean or mode depending on the variable type. For continuous variables, such as speed limits, we used mean imputation. For categorical variables (e.g., weather, road surface conditions), mode imputation was applied.3) Noise in the dataset arises from both human error in crash reporting and random fluctuations in recorded variables. We used z-scores to detect and remove extreme outliers in numerical variables (e.g., speed limits, crash angle). Data points with a z-score beyond ±3 standard deviations were considered outliers and were excluded from the analysis. To handle noisy fluctuations in continuous variables (e.g., speed limits), we applied a symmetrical exponential moving average (EMA) filter.After processing, the CRSS 2021 data include a total of 54,187 crash accidents, among which there are 5,944 hit-and-run accidents, accounting for 10.97% of crash accidents. The hit-and-run and non-hit-and-run categories face a serious class imbalance issue, and data balancing processing is applied to the target variable during parameter calibration. Hit-and-run crashes constitute a relatively small proportion of total crashes in the dataset, leading to class imbalance in the binary classification target. To address this issue, we utilized the resampling techniques available in the data mining software. Specifically, random undersampling was applied to the majority class (non-hit-and-run crashes), while Synthetic Minority Over-sampling Technique (SMOTE) was used for the minority class. This ensured balanced class distribution in the training set, improving model performance and preventing the classifier from being biased toward the majority class.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion

Data from: Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion

Related Article
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description

Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.

Search
Clear search
Close search
Google apps
Main menu