32 datasets found

d
Data from: Privacy Preserving Outlier Detection through Random Nonlinear...
catalog.data.gov
data.amerigeoss.org
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.
f
DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...
frontiersin.figshare.com
pdf
Updated Oct 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu (2023). DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning tree generation with applications on medical data.pdf [Dataset]. http://doi.org/10.3389/fphys.2023.1233341.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fphys.2023.1233341.s001
Dataset updated
Oct 13, 2023
Dataset provided by
Frontiers
Authors
Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.
MNIST dataset for Outliers Detection - [ MNIST4OD ]
figshare.com
application/gzip
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9954986.v2
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Giovanni Stilo; Bardh Prenkaj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog.data.gov
datasets.ai
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

zenodo.org
elki-project.github.io
+1more

application/gzip

Updated May 2, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6355684

Dataset updated

May 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

2022

Description

These data sets were originally created for the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

The outlier data set versions were introduced in:

E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
On Evaluation of Outlier Rankings and Outlier Scores
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

They are derived from the original image data available at https://aloi.science.uva.nl/

The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

Additional information is available at: https://elki-project.github.io/datasets/multi_view

The following views are currently available:

Feature type	Description	Files
Object number	Sparse 1000 dimensional vectors that give the true object assignment	objs.arff.gz
RGB color histograms	Standard RGB color histograms (uniform binning)	aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms	Standard HSV/HSB color histograms in various binnings	aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity	Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)	aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features	First 13 Haralick features (radius 1 pixel)	aloi-haralick-1.csv.gz
Front to back	Vectors representing front face vs. back faces of individual objects	front.arff.gz
Basic light	Vectors indicating basic light situations	light.arff.gz
Manual annotations	Manually annotated object groups of semantically related objects such as cups	manual1.arff.gz

Outlier Detection Versions

Additionally, we generated a number of subsets for outlier detection:

Feature type	Description	Files
RGB Histograms	Downsampled to 100000 objects (553 outliers)	aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
	Downsampled to 75000 objects (717 outliers)	aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
	Downsampled to 50000 objects (1508 outliers)	aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

technavio.com

Updated Jun 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis

Explore at:

Dataset updated

Jun 18, 2025

Dataset provided by

TechNavio

Authors

Technavio

Time period covered

2021 - 2025

Area covered

Canada, Germany, Mexico, United States, Global

Description

Snapshot img

Anomaly Detection Market Size 2025-2029

The anomaly detection market size is forecast to increase by USD 4.44 billion at a CAGR of 14.4% between 2024 and 2029.

The market is experiencing significant growth, particularly in the BFSI sector, as organizations increasingly prioritize identifying and addressing unusual patterns or deviations from normal business operations. The rising incidence of internal threats and cyber frauds necessitates the implementation of advanced anomaly detection tools to mitigate potential risks and maintain security. However, implementing these solutions comes with challenges, primarily infrastructural requirements. Ensuring compatibility with existing systems, integrating new technologies, and training staff to effectively utilize these tools pose significant hurdles for organizations.
Despite these challenges, the potential benefits of anomaly detection, such as improved risk management, enhanced operational efficiency, and increased security, make it an essential investment for businesses seeking to stay competitive and agile in today's complex and evolving threat landscape. Companies looking to capitalize on this market opportunity must carefully consider these challenges and develop strategies to address them effectively. Cloud computing is a key trend in the market, as cloud-based solutions offer quick deployment, flexibility, and scalability.

What will be the Size of the Anomaly Detection Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample

In the dynamic and evolving market, advanced technologies such as resource allocation, linear regression, pattern recognition, and support vector machines are increasingly being adopted for automated decision making. Businesses are leveraging these techniques to enhance customer experience through behavioral analytics, object detection, and sentiment analysis. Machine learning algorithms, including random forests, naive Bayes, decision trees, clustering algorithms, and k-nearest neighbors, are essential tools for risk management and compliance monitoring. AI-powered analytics, time series forecasting, and predictive modeling are revolutionizing business intelligence, while process optimization is achieved through the application of decision support systems, natural language processing, and predictive analytics.
Computer vision, image recognition, logistic regression, and operational efficiency are key areas where principal component analysis and artificial technoogyneural networks contribute significantly. Speech recognition and operational efficiency are also benefiting from these advanced technologies, enabling businesses to streamline processes and improve overall performance.

How is this Anomaly Detection Industry segmented?

The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment

  Cloud
  On-premises


Component

  Solution
  Services


End-user

  BFSI
  IT and telecom
  Retail and e-commerce
  Manufacturing
  Others


Technology

  Big data analytics
  AI and ML
  Data mining and business intelligence


Geography

  North America

    US
    Canada
    Mexico


  Europe

    France
    Germany
    Spain
    UK


  APAC

    China
    India
    Japan


  Rest of World (ROW)

By Deployment Insights

The cloud segment is estimated to witness significant growth during the forecast period. The market is witnessing significant growth due to the increasing adoption of advanced technologies such as machine learning models, statistical methods, and real-time monitoring. These technologies enable the identification of anomalous behavior in real-time, thereby enhancing network security and data privacy. Anomaly detection algorithms, including unsupervised learning, reinforcement learning, and deep learning networks, are used to identify outliers and intrusions in large datasets. Data security is a major concern, leading to the adoption of data masking, data pseudonymization, data de-identification, and differential privacy.

Data leakage prevention and incident response are critical components of an effective anomaly detection system. False positive and false negative rates are essential metrics to evaluate the performance of these systems. Time series analysis and concept drift are important techniques used in anomaly detection. Data obfuscation, data suppression, and data aggregation are other strategies employed to maintain data privacy. Companies such as Anodot, Cisco Systems Inc, IBM Corp, and SAS Institute Inc offer both cloud-based and on-premises anomaly detection solutions. These soluti

f
KAERI dataset.
plos.figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jihwan Lee; Nam-Wook Cho (2023). KAERI dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0165972.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0165972.t002
Dataset updated
Jun 15, 2023
Dataset provided by
PLOS ONE
Authors
Jihwan Lee; Nam-Wook Cho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
KAERI dataset.
Privacy Preserving Outlier Detection through Random Nonlinear Data...
data.staging.idas-ds1.appdat.jsc.nasa.gov
data.nasa.gov
Updated Feb 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.
c
Data from: PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM...
s.cnmilf.com
data.nasa.gov
+2more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/padmini-a-peer-to-peer-distributed-astronomy-data-mining-system-and-a-case-study
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY TUSHAR MAHULE, KIRK BORNE, SANDIPAN DEY, SUGANDHA ARORA, AND HILLOL KARGUPTA** Abstract. Peer-to-Peer (P2P) networks are appealing for astronomy data mining from virtual observatories because of the large volume of the data, compute-intensive tasks, potentially large number of users, and distributed nature of the data analysis process. This paper offers a brief overview of PADMINI—a Peer-to-Peer Astronomy Data MINIng system. It also presents a case study on PADMINI for distributed outlier detection using astronomy data. PADMINI is a webbased system powered by Google Sky and distributed data mining algorithms that run on a collection of computing nodes. This paper offers a case study of the PADMINI evaluating the architecture and the performance of the overall system. Detailed experimental results are presented in order to document the utility and scalability of the system.
g
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES
gimi9.com
data.staging.idas-ds1.appdat.jsc.nasa.gov
+3more
Updated Sep 24, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2010). DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES [Dataset]. https://gimi9.com/dataset/data-gov_distributed-anomaly-detection-using-satellite-data-from-multiple-modalities/
Explore at:
Dataset updated
Sep 24, 2010
Description
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES KANISHKA BHADURI*, KAMALIKA DAS**, AND PETR VOTAVA*** Abstract. There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets ate physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
f
Experiment Datasets.
plos.figshare.com
xls
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jihwan Lee; Nam-Wook Cho (2023). Experiment Datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0165972.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0165972.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jihwan Lee; Nam-Wook Cho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Experiment Datasets.
f
LOF calculation time (seconds) comparison.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jihwan Lee; Nam-Wook Cho (2023). LOF calculation time (seconds) comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0165972.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0165972.t003
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Jihwan Lee; Nam-Wook Cho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LOF calculation time (seconds) comparison.
d
Distributed Anomaly Detection Using Satellite Data From Multiple Modalities
catalog.data.gov
datasets.ai
+3more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection Using Satellite Data From Multiple Modalities [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-satellite-data-from-multiple-modalities-cf764
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
d
Synthetic temporal dataset for temporal trend analysis and retrieval
search.dataone.org
datadryad.org
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Ao; Kara Schatz; Rada Chirkova (2024). Synthetic temporal dataset for temporal trend analysis and retrieval [Dataset]. http://doi.org/10.5061/dryad.q573n5trf
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.q573n5trf
Dataset updated
May 10, 2024
Dataset provided by
Dryad Digital Repository
Authors
Jing Ao; Kara Schatz; Rada Chirkova
Time period covered
May 7, 2024
Description
This repository contains a synthetic, temporal data set that was generated by the authors by sampling values from the Gaussian distribution. The dataset contains eight nontemporal dimensions, a temporal dimension, and a numerical measure attribute. The data set was generated according to the scheme and procedure detailed in this source paper: Kaufmann, M., Fischer, P.M., May, N., Tonder, A., Kossmann, D. (2014). TPC-BiH: A Benchmark for Bitemporal Databases. In: Performance Characterization and Benchmarking. TPCTC 2013. Lecture Notes in Computer Science, vol 8391. Springer, Cham.Â The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be o..., , , # Synthetic temporal dataset for temporal trend analysis and retrieval

https://doi.org/10.5061/dryad.q573n5trf

The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be of interest to understand which nontemporal dimensions are associated with the temporal trends of interest. To this end, the data set can be used for analyzing and locating temporal trends in the data cube induced by the data set, e.g., retrieving outlier temporal trends using an outlier detector.Â

We generated the synthetic temporal data set [1], which contains up to 8 nontemporal dimensions, one temporal dimension, and a nume...
u
Association analysis of high-low outlier road intersection pedestrian...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection pedestrian crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25976875.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25976875.v1
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on road intersection pedestrian crashes recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 10% of the total "high-low" outlier pedestrian road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,021, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 37,8 KBNumber of Files: The dataset contains a total of 624 association rulesDate Created: 24th May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection pedestrian crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,10 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-low" outlier road intersection pedestrian crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)
u
Association analysis of high-low outlier road intersection crashes within...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25975741.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25975741.v2
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on road intersection crashes recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-low" outlier road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,05 to 0,0278, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 675 KBNumber of Files: The dataset contains a total of 10212 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 5% of the "high-low" outlier road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)
d
Orca
catalog.data.gov
data.amerigeoss.org
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Orca [Dataset]. https://catalog.data.gov/dataset/orca
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Orca is a data-driven, unsupervised anomaly detection algorithm that uses a distance-based approach. It uses a novel pruning rule that allows it to run in nearly linear time. Orca was co-developed by Stephen Bay of ISLE and Mark Schwabacher of NASA ARC. More information about Orca, including downloadable software, can be found here: http://stephenbay.net/orca/ A conference paper about Orca can be found here: https://dashlink.arc.nasa.gov/paper/mining-distance-based-outliers-in-near-linear-time/
n
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
u
Association analysis of high-low outlier road intersection crashes involving...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes involving motorcycles that resulted in injuries within the CoCT in 2017, 2018 and 2019 [Dataset]. http://doi.org/10.25375/uct.25975882.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25975882.v1
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on road intersection crashes involving motorcycles (Motor tricycle, Motorcycle: under 125cc, Motorcycle: Above 125cc, Quadru-cycle) that have resulted in injuries recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-low" outlier motorcycle road intersection crashes resulting in injuries for the years 2017, 2018 and 2019.The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,202, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 38,8 KBNumber of Files: The dataset contains a total of 426 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes involving a motorcycle resulting in injuries that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,10 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-low" outlier road intersection motorcycle crashes resulting in injuries were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2019

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion

Data from: Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion

Explore at:

Dataset updated

Apr 10, 2025

Dataset provided by

Dashlink

Description

Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.

Clear search

Close search

Google apps

Main menu

Data from: Privacy Preserving Outlier Detection through Random Nonlinear...

DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Data from: Mining Distance-Based Outliers in Near Linear Time

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

KAERI dataset.

Privacy Preserving Outlier Detection through Random Nonlinear Data...

Data from: PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM...

DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES

Experiment Datasets.

LOF calculation time (seconds) comparison.

Distributed Anomaly Detection Using Satellite Data From Multiple Modalities

Synthetic temporal dataset for temporal trend analysis and retrieval

Association analysis of high-low outlier road intersection pedestrian...

Association analysis of high-low outlier road intersection crashes within...

Orca

Malaria disease and grading system dataset from public hospitals reflecting...

Association analysis of high-low outlier road intersection crashes involving...

Data from: Privacy Preserving Outlier Detection through Random Nonlinear Data DistortionSee More Versions

Data from: Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion