32 datasets found
  1. d

    Data from: Privacy Preserving Outlier Detection through Random Nonlinear...

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.

  2. f

    DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...

    • frontiersin.figshare.com
    pdf
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu (2023). DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning tree generation with applications on medical data.pdf [Dataset]. http://doi.org/10.3389/fphys.2023.1233341.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.

  3. MNIST dataset for Outliers Detection - [ MNIST4OD ]

    • figshare.com
    application/gzip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Giovanni Stilo; Bardh Prenkaj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

  4. d

    Data from: Mining Distance-Based Outliers in Near Linear Time

    • catalog.data.gov
    • datasets.ai
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

  5. d

    Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  6. ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • zenodo.org
    • elki-project.github.io
    • +1more
    application/gzip
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2022
    Description

    These data sets were originally created for the following publications:

    M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
    Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
    In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

    H.-P. Kriegel, E. Schubert, A. Zimek
    Evaluation of Multiple Clustering Solutions
    In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

    The outlier data set versions were introduced in:

    E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
    On Evaluation of Outlier Rankings and Outlier Scores
    In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

    They are derived from the original image data available at https://aloi.science.uva.nl/

    The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

    Additional information is available at: https://elki-project.github.io/datasets/multi_view

    The following views are currently available:

    Feature typeDescriptionFiles
    Object numberSparse 1000 dimensional vectors that give the true object assignmentobjs.arff.gz
    RGB color histogramsStandard RGB color histograms (uniform binning)aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
    HSV color histogramsStandard HSV/HSB color histograms in various binningsaloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
    Color similiarityAverage similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
    Haralick featuresFirst 13 Haralick features (radius 1 pixel)aloi-haralick-1.csv.gz
    Front to backVectors representing front face vs. back faces of individual objectsfront.arff.gz
    Basic lightVectors indicating basic light situationslight.arff.gz
    Manual annotationsManually annotated object groups of semantically related objects such as cupsmanual1.arff.gz

    Outlier Detection Versions

    Additionally, we generated a number of subsets for outlier detection:

    Feature typeDescriptionFiles
    RGB HistogramsDownsampled to 100000 objects (553 outliers)aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
    Downsampled to 75000 objects (717 outliers)aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
    Downsampled to 50000 objects (1508 outliers)aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
  7. Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis
    Explore at:
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Canada, Germany, Mexico, United States, Global
    Description

    Snapshot img

    Anomaly Detection Market Size 2025-2029

    The anomaly detection market size is forecast to increase by USD 4.44 billion at a CAGR of 14.4% between 2024 and 2029.

    The market is experiencing significant growth, particularly in the BFSI sector, as organizations increasingly prioritize identifying and addressing unusual patterns or deviations from normal business operations. The rising incidence of internal threats and cyber frauds necessitates the implementation of advanced anomaly detection tools to mitigate potential risks and maintain security. However, implementing these solutions comes with challenges, primarily infrastructural requirements. Ensuring compatibility with existing systems, integrating new technologies, and training staff to effectively utilize these tools pose significant hurdles for organizations.
    Despite these challenges, the potential benefits of anomaly detection, such as improved risk management, enhanced operational efficiency, and increased security, make it an essential investment for businesses seeking to stay competitive and agile in today's complex and evolving threat landscape. Companies looking to capitalize on this market opportunity must carefully consider these challenges and develop strategies to address them effectively. Cloud computing is a key trend in the market, as cloud-based solutions offer quick deployment, flexibility, and scalability.
    

    What will be the Size of the Anomaly Detection Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free Sample

    In the dynamic and evolving market, advanced technologies such as resource allocation, linear regression, pattern recognition, and support vector machines are increasingly being adopted for automated decision making. Businesses are leveraging these techniques to enhance customer experience through behavioral analytics, object detection, and sentiment analysis. Machine learning algorithms, including random forests, naive Bayes, decision trees, clustering algorithms, and k-nearest neighbors, are essential tools for risk management and compliance monitoring. AI-powered analytics, time series forecasting, and predictive modeling are revolutionizing business intelligence, while process optimization is achieved through the application of decision support systems, natural language processing, and predictive analytics.
    Computer vision, image recognition, logistic regression, and operational efficiency are key areas where principal component analysis and artificial technoogyneural networks contribute significantly. Speech recognition and operational efficiency are also benefiting from these advanced technologies, enabling businesses to streamline processes and improve overall performance.
    

    How is this Anomaly Detection Industry segmented?

    The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      Cloud
      On-premises
    
    
    Component
    
      Solution
      Services
    
    
    End-user
    
      BFSI
      IT and telecom
      Retail and e-commerce
      Manufacturing
      Others
    
    
    Technology
    
      Big data analytics
      AI and ML
      Data mining and business intelligence
    
    
    Geography
    
      North America
    
        US
        Canada
        Mexico
    
    
      Europe
    
        France
        Germany
        Spain
        UK
    
    
      APAC
    
        China
        India
        Japan
    
    
      Rest of World (ROW)
    

    By Deployment Insights

    The cloud segment is estimated to witness significant growth during the forecast period. The market is witnessing significant growth due to the increasing adoption of advanced technologies such as machine learning models, statistical methods, and real-time monitoring. These technologies enable the identification of anomalous behavior in real-time, thereby enhancing network security and data privacy. Anomaly detection algorithms, including unsupervised learning, reinforcement learning, and deep learning networks, are used to identify outliers and intrusions in large datasets. Data security is a major concern, leading to the adoption of data masking, data pseudonymization, data de-identification, and differential privacy.

    Data leakage prevention and incident response are critical components of an effective anomaly detection system. False positive and false negative rates are essential metrics to evaluate the performance of these systems. Time series analysis and concept drift are important techniques used in anomaly detection. Data obfuscation, data suppression, and data aggregation are other strategies employed to maintain data privacy. Companies such as Anodot, Cisco Systems Inc, IBM Corp, and SAS Institute Inc offer both cloud-based and on-premises anomaly detection solutions. These soluti

  8. f

    KAERI dataset.

    • plos.figshare.com
    xls
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jihwan Lee; Nam-Wook Cho (2023). KAERI dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0165972.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jihwan Lee; Nam-Wook Cho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    KAERI dataset.

  9. Privacy Preserving Outlier Detection through Random Nonlinear Data...

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • data.nasa.gov
    Updated Feb 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.

  10. c

    Data from: PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM...

    • s.cnmilf.com
    • data.nasa.gov
    • +2more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/padmini-a-peer-to-peer-distributed-astronomy-data-mining-system-and-a-case-study
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY TUSHAR MAHULE, KIRK BORNE, SANDIPAN DEY, SUGANDHA ARORA, AND HILLOL KARGUPTA** Abstract. Peer-to-Peer (P2P) networks are appealing for astronomy data mining from virtual observatories because of the large volume of the data, compute-intensive tasks, potentially large number of users, and distributed nature of the data analysis process. This paper offers a brief overview of PADMINI—a Peer-to-Peer Astronomy Data MINIng system. It also presents a case study on PADMINI for distributed outlier detection using astronomy data. PADMINI is a webbased system powered by Google Sky and distributed data mining algorithms that run on a collection of computing nodes. This paper offers a case study of the PADMINI evaluating the architecture and the performance of the overall system. Detailed experimental results are presented in order to document the utility and scalability of the system.

  11. g

    DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES

    • gimi9.com
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +3more
    Updated Sep 24, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2010). DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES [Dataset]. https://gimi9.com/dataset/data-gov_distributed-anomaly-detection-using-satellite-data-from-multiple-modalities/
    Explore at:
    Dataset updated
    Sep 24, 2010
    Description

    DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES KANISHKA BHADURI*, KAMALIKA DAS**, AND PETR VOTAVA*** Abstract. There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets ate physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.

  12. f

    Experiment Datasets.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jihwan Lee; Nam-Wook Cho (2023). Experiment Datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0165972.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jihwan Lee; Nam-Wook Cho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experiment Datasets.

  13. f

    LOF calculation time (seconds) comparison.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jihwan Lee; Nam-Wook Cho (2023). LOF calculation time (seconds) comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0165972.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jihwan Lee; Nam-Wook Cho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LOF calculation time (seconds) comparison.

  14. d

    Distributed Anomaly Detection Using Satellite Data From Multiple Modalities

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Distributed Anomaly Detection Using Satellite Data From Multiple Modalities [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-satellite-data-from-multiple-modalities-cf764
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.

  15. d

    Synthetic temporal dataset for temporal trend analysis and retrieval

    • search.dataone.org
    • datadryad.org
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Ao; Kara Schatz; Rada Chirkova (2024). Synthetic temporal dataset for temporal trend analysis and retrieval [Dataset]. http://doi.org/10.5061/dryad.q573n5trf
    Explore at:
    Dataset updated
    May 10, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Jing Ao; Kara Schatz; Rada Chirkova
    Time period covered
    May 7, 2024
    Description

    This repository contains a synthetic, temporal data set that was generated by the authors by sampling values from the Gaussian distribution. The dataset contains eight nontemporal dimensions, a temporal dimension, and a numerical measure attribute. The data set was generated according to the scheme and procedure detailed in this source paper: Kaufmann, M., Fischer, P.M., May, N., Tonder, A., Kossmann, D. (2014). TPC-BiH: A Benchmark for Bitemporal Databases. In: Performance Characterization and Benchmarking. TPCTC 2013. Lecture Notes in Computer Science, vol 8391. Springer, Cham. The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be o..., , , # Synthetic temporal dataset for temporal trend analysis and retrieval

    https://doi.org/10.5061/dryad.q573n5trf

    The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be of interest to understand which nontemporal dimensions are associated with the temporal trends of interest. To this end, the data set can be used for analyzing and locating temporal trends in the data cube induced by the data set, e.g., retrieving outlier temporal trends using an outlier detector.Â

    We generated the synthetic temporal data set [1], which contains up to 8 nontemporal dimensions, one temporal dimension, and a nume...

  16. u

    Association analysis of high-low outlier road intersection pedestrian...

    • zivahub.uct.ac.za
    xlsx
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection pedestrian crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25976875.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset provides comprehensive information on road intersection pedestrian crashes recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 10% of the total "high-low" outlier pedestrian road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,021, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 37,8 KBNumber of Files: The dataset contains a total of 624 association rulesDate Created: 24th May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection pedestrian crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,10 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-low" outlier road intersection pedestrian crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)

  17. u

    Association analysis of high-low outlier road intersection crashes within...

    • zivahub.uct.ac.za
    xlsx
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25975741.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset provides comprehensive information on road intersection crashes recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-low" outlier road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,05 to 0,0278, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 675 KBNumber of Files: The dataset contains a total of 10212 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 5% of the "high-low" outlier road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)

  18. d

    Orca

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Orca [Dataset]. https://catalog.data.gov/dataset/orca
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Orca is a data-driven, unsupervised anomaly detection algorithm that uses a distance-based approach. It uses a novel pruning rule that allows it to run in nearly linear time. Orca was co-developed by Stephen Bay of ISLE and Mark Schwabacher of NASA ARC. More information about Orca, including downloadable software, can be found here: http://stephenbay.net/orca/ A conference paper about Orca can be found here: https://dashlink.arc.nasa.gov/paper/mining-distance-based-outliers-in-near-linear-time/

  19. n

    Malaria disease and grading system dataset from public hospitals reflecting...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    Nasarawa State University
    Authors
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
    Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.

  20. u

    Association analysis of high-low outlier road intersection crashes involving...

    • zivahub.uct.ac.za
    xlsx
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes involving motorcycles that resulted in injuries within the CoCT in 2017, 2018 and 2019 [Dataset]. http://doi.org/10.25375/uct.25975882.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset provides comprehensive information on road intersection crashes involving motorcycles (Motor tricycle, Motorcycle: under 125cc, Motorcycle: Above 125cc, Quadru-cycle) that have resulted in injuries recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-low" outlier motorcycle road intersection crashes resulting in injuries for the years 2017, 2018 and 2019.The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,202, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 38,8 KBNumber of Files: The dataset contains a total of 426 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes involving a motorcycle resulting in injuries that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,10 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-low" outlier road intersection motorcycle crashes resulting in injuries were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2019

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion

Data from: Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion

Related Article
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description

Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.

Search
Clear search
Close search
Google apps
Main menu