19 datasets found

Data from: Privacy Preserving Outlier Detection through Random Nonlinear...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+2more
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://data.nasa.gov/w/hdqp-dua8/default?cur=An0rOJGOjg-&from=AXc_nh0m3UE
Explore at:
tsv, csv, application/rssxml, application/rdfxml, json, xmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.
Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open...
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Feb 19, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Data from: PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+2more
application/rdfxml +5
Updated Jun 26, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY [Dataset]. https://data.nasa.gov/dataset/PADMINI-A-PEER-TO-PEER-DISTRIBUTED-ASTRONOMY-DATA-/r38j-jwis
Explore at:
csv, xml, application/rdfxml, application/rssxml, tsv, jsonAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY

TUSHAR MAHULE*, KIRK BORNE**, SANDIPAN DEY*, SUGANDHA ARORA*, AND HILLOL KARGUPTA***

Abstract. Peer-to-Peer (P2P) networks are appealing for astronomy data mining from virtual observatories because of the large volume of the data, compute-intensive tasks, potentially large number of users, and distributed nature of the data analysis process. This paper offers a brief overview of PADMINI—a Peer-to-Peer Astronomy Data MINIng system. It also presents a case study on PADMINI for distributed outlier detection using astronomy data. PADMINI is a webbased system powered by Google Sky and distributed data mining algorithms that run on a collection of computing nodes. This paper offers a case study of the PADMINI evaluating the architecture and the performance of the overall system. Detailed experimental results are presented in order to document the utility and scalability of the system.
f
Experiment Datasets.
plos.figshare.com
xls
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jihwan Lee; Nam-Wook Cho (2023). Experiment Datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0165972.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0165972.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jihwan Lee; Nam-Wook Cho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Experiment Datasets.
Z
ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...
data.niaid.nih.gov
elki-project.github.io
+1more
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schubert, Erich (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6355683
Explore at:
Dataset updated
May 2, 2024
Dataset provided by
Zimek, Arthur
Schubert, Erich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data sets were originally created for the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

H.-P. Kriegel, E. Schubert, A. Zimek Evaluation of Multiple Clustering Solutions In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

The outlier data set versions were introduced in:

E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel On Evaluation of Outlier Rankings and Outlier Scores In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

They are derived from the original image data available at https://aloi.science.uva.nl/

The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

Additional information is available at: https://elki-project.github.io/datasets/multi_view

The following views are currently available:

Feature type Description Files Object number Sparse 1000 dimensional vectors that give the true object assignment objs.arff.gz RGB color histograms Standard RGB color histograms (uniform binning) aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz HSV color histograms Standard HSV/HSB color histograms in various binnings aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz Color similiarity Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black) aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other) Haralick features First 13 Haralick features (radius 1 pixel) aloi-haralick-1.csv.gz Front to back Vectors representing front face vs. back faces of individual objects front.arff.gz Basic light Vectors indicating basic light situations light.arff.gz Manual annotations Manually annotated object groups of semantically related objects such as cups manual1.arff.gz

Outlier Detection Versions

Additionally, we generated a number of subsets for outlier detection:

Feature type Description Files RGB Histograms Downsampled to 100000 objects (553 outliers) aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz Downsampled to 75000 objects (717 outliers) aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz Downsampled to 50000 objects (1508 outliers) aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
u
Association analysis of high-low outlier road intersection crashes involving...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes involving public transport within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25976179.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25976179.v1
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on road intersection crashes involving public transport (Bus, Bus-train, Combi/minibusses, midibusses) recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 10% of the total "high-low" outlier public transport road intersection crashes for the years 2017, 2018, 2019, and 2021.The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,17, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 65,9 KBNumber of Files: The dataset contains a total of 1280 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes involving public transport that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,10 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-low" outlier road intersection public transport crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)
f
LOF calculation time (seconds) comparison.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LOF calculation time (seconds) comparison. [Dataset]. https://plos.figshare.com/articles/dataset/LOF_calculation_time_seconds_comparison_/4228313
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0165972.t003
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Jihwan Lee; Nam-Wook Cho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LOF calculation time (seconds) comparison.
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.nasa.gov
+1more
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
d
Synthetic temporal dataset for temporal trend analysis and retrieval
search-dev-2.test.dataone.org
data.niaid.nih.gov
+2more
Updated May 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Ao; Kara Schatz; Rada Chirkova (2024). Synthetic temporal dataset for temporal trend analysis and retrieval [Dataset]. http://doi.org/10.5061/dryad.q573n5trf
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.q573n5trf
Dataset updated
May 10, 2024
Dataset provided by
Dryad Digital Repository
Authors
Jing Ao; Kara Schatz; Rada Chirkova
Time period covered
May 7, 2024
Description
This repository contains a synthetic, temporal data set that was generated by the authors by sampling values from the Gaussian distribution. The dataset contains eight nontemporal dimensions, a temporal dimension, and a numerical measure attribute. The data set was generated according to the scheme and procedure detailed in this source paper: Kaufmann, M., Fischer, P.M., May, N., Tonder, A., Kossmann, D. (2014). TPC-BiH: A Benchmark for Bitemporal Databases. In: Performance Characterization and Benchmarking. TPCTC 2013. Lecture Notes in Computer Science, vol 8391. Springer, Cham.Â The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be o..., , , # Synthetic temporal dataset for temporal trend analysis and retrieval

https://doi.org/10.5061/dryad.q573n5trf

The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be of interest to understand which nontemporal dimensions are associated with the temporal trends of interest. To this end, the data set can be used for analyzing and locating temporal trends in the data cube induced by the data set, e.g., retrieving outlier temporal trends using an outlier detector.Â

We generated the synthetic temporal data set [1], which contains up to 8 nontemporal dimensions, one temporal dimension, and a nume...
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog-dev.data.gov
datasets.ai
+2more
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog-dev.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES
data.staging.idas-ds1.appdat.jsc.nasa.gov
data.nasa.gov
+3more
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/distributed-anomaly-detection-using-satellite-data-from-multiple-modalities
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES KANISHKA BHADURI, KAMALIKA DAS, AND PETR VOTAVA** Abstract. There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets ate physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
c
sequenceMiner algorithm
s.cnmilf.com
data.nasa.gov
+3more
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). sequenceMiner algorithm [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/sequenceminer-algorithm
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
Detecting and describing anomalies in large repositories of discrete symbol sequences. sequenceMiner has been open-sourced! Download the file below to try it out. sequenceMiner was developed to address the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. sequenceMiner works by performing unsupervised clustering (grouping) of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by a detailed analysis of outliers to detect anomalies. sequenceMiner utilizes a new hybrid algorithm for computing the LCS that has been shown to outperform existing algorithms by a factor of five. sequenceMiner also includes new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. This provides analysts with a coherent description of the anomalies identified in the sequence, and why they differ from more “normal” sequences. sequenceMiner was developed with funding from the NASA Aviation Safety Program. In the commercial aviation _domain, sequenceMiner can be used to discover atypical behavior in airline performance data that may have possible operational significance for safety analysts. But because the sequenceMiner approach is general and not restricted in any way to a _domain, and these algorithms can be applied in other fields where anomaly detection and event mining would be useful.
a
Visualize A Space Time Cube in 3D
hub.arcgis.com
gemelo-digital-en-arcgis-gemelodigital.hub.arcgis.com
Updated Dec 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Society for Conservation GIS (2020). Visualize A Space Time Cube in 3D [Dataset]. https://hub.arcgis.com/maps/acddde8dae114381889b436fa0ff4b2f
Explore at:
Dataset updated
Dec 3, 2020
Dataset authored and provided by
Society for Conservation GIS
Description
Stamp Out COVID-19An apple a day keeps the doctor away.Linda Angulo LopezDecember 3, 2020https://theconversation.com/coronavirus-where-do-new-viruses-come-from-136105SNAP Participation Rates, was explored and analysed on ArcGIS Pro, the results of which can help decision makers set up further SNAP-D initiatives.In the USA foods are stored in every State and U.S. territory and may be used by state agencies or local disaster relief organizations to provide food to shelters or people who are in need.US Food Stamp Program has been ExtendedThe Supplemental Nutrition Assistance Program, SNAP, is a State Organized Food Stamp Program in the USA and was put in place to help individuals and families during this exceptional time. State agencies may request to operate a Disaster Supplemental Nutrition Assistance Program (D-SNAP) .D-SNAP Interactive DashboardAlmost all States have set up Food Relief Programs, in response to COVID-19.Scroll Down to Learn more about the SNAP Participation Analysis & ResultsSNAP Participation AnalysisInitial results of yearly participation rates to geography show statistically significant trends, to get acquainted with the results, explore the following 3D Time Cube Map:Visualize A Space Time Cube in 3Dhttps://arcg.is/1q8LLPnetCDF ResultsWORKFLOW: a space-time cube was generated as a netCDF structure with the ArcGIS Pro Space-Time Mining Tool : Create a Space Time Cube from Defined Locations, other tools were then used to incorporate the spatial and temporal aspects of the SNAP County Participation Rate Feature to reveal and render statistically significant trends about Nutrition Assistance in the USA.Hot Spot Analysis Explore the results in 2D or 3D.2D Hot Spotshttps://arcg.is/1Pu5WH02D Hot Spot ResultsWORKFLOW: Hot Spot Analysis, with the Hot Spot Analysis Tool shows that there are various trends across the USA for instance the Southeastern States have a mixture of consecutive, intensifying, and oscillating hot spots.3D Hot Spotshttps://arcg.is/1b41T43D Hot Spot ResultsThese trends over time are expanded in the above 3D Map, by inspecting the stacked columns you can see the trends over time which give result to the overall Hot Spot Results.Not all counties have significant trends, symbolized as Never Significant in the Space Time Cubes.Space-Time Pattern Mining AnalysisThe North-central areas of the USA, have mostly diminishing cold spots.2D Space-Time Mininghttps://arcg.is/1PKPj02D Space Time Mining ResultsWORKFLOW: Analysis, with the Emerging Hot Spot Analysis Tool shows that there are various trends across the USA for instance the South-Eastern States have a mixture of consecutive, intensifying, and oscillating hot spots.Results ShowThe USA has counties with persistent malnourished populations, they depend on Food Aide.3D Space-Time Mininghttps://arcg.is/01fTWf3D Space Time Mining ResultsIn addition to obvious planning for consistent Hot-Hot Spot Areas, areas oscillating Hot-Cold and/or Cold-Hot Spots can be identified for further analysis to mitigate the upward trend in food insecurity in the USA, since 2009 which has become even worse since the outbreak of the COVID-19 pandemic.After Notes:(i) The Johns Hopkins University has an Interactive Dashboard of the Evolution of the COVID-19 Pandemic.Coronavirus COVID-19 (2019-nCoV)(ii) Since March 2020 in a Response to COVID-19, SNAP has had to extend its benefits to help people in need. The Food Relief is coordinated within States and by local and voluntary organizations to provide nutrition assistance to those most affected by a disaster or emergency.Visit SNAPs Interactive DashboardFood Relief has been extended, reach out to your state SNAP office, if you are in need.(iii) Follow these Steps to build an ArcGIS Pro StoryMap:Step 1: [Get Data][Open An ArcGIS Pro Project][Run a Hot Spot Analysis][Review analysis parameters][Interpret the results][Run an Outlier Analysis][Interpret the results]Step 2: [Open the Space-Time Pattern Mining 2 Map][Create a space-time cube][Visualize a space-time cube in 2D][Visualize a space-time cube in 3D][Run a Local Outlier Analysis][Visualize a Local Outlier Analysis in 3DStep 3: [Communicate Analysis][Identify your Audience & Takeaways][Create an Outline][Find Images][Prepare Maps & Scenes][Create a New Story][Add Story Elements][Add Maps & Scenes] [Review the Story][Publish & Share]A submission for the Esri MOOCSpatial Data Science: The New Frontier in AnalyticsLinda Angulo LopezLauren Bennett . Shannon Kalisky . Flora Vale . Alberto Nieto . Atma Mani . Kevin Johnston . Orhun Aydin . Ankita Bakshi . Vinay Viswambharan . Jennifer Bell & Nick Giner
Distributed Anomaly Detection Using Satellite Data From Multiple Modalities
data.nasa.gov
s.cnmilf.com
+3more
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Distributed Anomaly Detection Using Satellite Data From Multiple Modalities [Dataset]. https://data.nasa.gov/w/nq99-8i9w/default?cur=6qPJ9smcnds
Explore at:
application/rdfxml, csv, xml, json, tsv, application/rssxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
u
Association analysis of high-high cluster road intersection crashes...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-high cluster road intersection crashes involving public transport within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25975972.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25975972.v1
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on road intersection crashes involving public transport (Bus, Bus-train, Combi/minibusses, midibusses) recognised as "high-high" clusters within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 10% of the total "high-high" cluster public transport road intersection crashes for the years 2017, 2018, 2019, and 2021.The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,171, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 160 KBNumber of Files: The dataset contains a total of 1620 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-high" cluster fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes involving public transport that occurred within the "high-high" cluster fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,10 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-high" cluster road intersection public transport crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)
Privacy Preservation through Random Nonlinear Distortion
data.nasa.gov
catalog.data.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Privacy Preservation through Random Nonlinear Distortion [Dataset]. https://data.nasa.gov/dataset/Privacy-Preservation-through-Random-Nonlinear-Dist/ugj6-i2sw
Explore at:
csv, xml, application/rssxml, json, tsv, application/rdfxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Consider a scenario in which the data owner has some private or sensitive data and wants a data miner to access them for studying important patterns without revealing the sensitive information. Privacy-preserving data mining aims to solve this problem by randomly transforming the data prior to their release to the data miners. Previous works only considered the case of linear data perturbations - additive, multiplicative, or a combination of both - for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy-preserving anomaly detection from sensitive data sets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that, for specific cases, it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. The experiments conducted on real-life data sets demonstrate the effectiveness of the approach.
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
u
Association analysis of high-high cluster road intersection pedestrian...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-high cluster road intersection pedestrian crashes resulting in serious injuries and/or fatalities within the CoCT in 2017, 2018 and 2019 [Dataset]. http://doi.org/10.25375/uct.25976719.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25976719.v1
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on road intersection pedestrian crashes resulting in serious injuries and/or fatalities recognised as "high-high" clusters within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 23% of the total "high-high" cluster pedestrian road intersection crashes resulting in serious injuries and/or fatalities for the years 2017, 2018 and 2019. The dataset is meticulously organised according to confidence metric values presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 18,3 KBNumber of Files: The dataset contains a total of 258 association rulesDate Created: 24th May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-high" cluster fishnet grid cells from a cluster and outlier analysis, all the road intersection pedestrian crashes resulting in serious injuries and/or fatalities that occurred within the "high-high" cluster fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,20 support metric value. Consequently, commonly occurring crash attributes among at least 20% of the "high-high" cluster road intersection pedestrian crashes resulting in serious injuries and/or fatalities were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2019
s
Space-time pattern mining for recorded depression prevalence in England from...
eprints.soton.ac.uk
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsimpida, Dalia; Tsakiridi, Anastasia (2024). Space-time pattern mining for recorded depression prevalence in England from 2011 to 2022. An interactive map application [Dataset]. http://doi.org/10.5258/SOTON/D3250
Explore at:
Unique identifier
https://doi.org/10.5258/SOTON/D3250
Dataset updated
Apr 26, 2024
Dataset provided by
University of Southampton
Authors
Tsimpida, Dalia; Tsakiridi, Anastasia
Area covered
England
Description
ACCESS THE INTERACTIVE MAP VIA RELATED URLS AT THE BOTTOM OF THIS RECORD Space-Time Pattern Mining for recorded depression prevalence from 2011 to 2022, based on the Anselin Local Moran’s I algorithm. The unit of analysis of this geospatial analysis was the Lower Super Output Area (LSOA). There are 32,844 LSOAs across England, with an average population of 1500 people (Office for National Statistics, 2021). In all analyses, we used the LSOA boundaries published by the Office for National Statistics as at March 21, 2021 (Office for National Statistics, 2021). The diagnosed depression prevalence was derived using the data published by NHS Digital. Figures showing the recorded prevalence of depression in England by general practitioner (GP) practice are published annually in the Quality and Outcomes Framework (QOF) administrative dataset, which also reports how the QOF-recorded prevalence has changed since the previous year (NHS Digital, 2020, pp. 2019–2020). For this study, we combined all available data on depression published by NHS Digital and created time-series recorded depression for each LSOA from 2011 to 2022. The annual aggregate data on diagnoses of depression per LSOA has been calculated based on the weighted averages of the number of patients diagnosed with depression per LSOA divided by the total number of registered patients in each LSOA. In terms of coverage, the data for Quality and Outcomes Framework (QOF) have been collected annually at an aggregate level for each of the 6470 (97.5%) GP practices in England, with approximately 61 million registered patients aged 18 years and above; thus, the dataset offers nationwide insights. This online app depicts interactively the Cluster and Outlier Analysis, using the Anselin Local Moran’s I algorithm (Anselin, 1995), to identify local indicators of spatial association (LISA) and correct for spatial dependence. The conceptualisation of spatial relationships parameter value was set as the ‘Contiguity edges corners’, the standardisation option was set as ‘Row’, and the number of permutations was set as 999. The LISA refer to statistically significant spatial clusters of small areas with high values (high/high clusters) and low values (low/low clusters) of depression, as well as high and low spatial outliers in which a high value is surrounded by low values (high/low clusters), and outliers in which a low value is surrounded by high values (low/high clusters).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2018). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://data.nasa.gov/w/hdqp-dua8/default?cur=An0rOJGOjg-&from=AXc_nh0m3UE

Data from: Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion

Explore at:

tsv, csv, application/rssxml, application/rdfxml, json, xmlAvailable download formats

Dataset updated

Jun 26, 2018

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.

Clear search

Close search

Google apps

Main menu

Data from: Privacy Preserving Outlier Detection through Random Nonlinear...

Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open...

Data from: PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM...

Experiment Datasets.

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Association analysis of high-low outlier road intersection crashes involving...

LOF calculation time (seconds) comparison.

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

Synthetic temporal dataset for temporal trend analysis and retrieval

Data from: Mining Distance-Based Outliers in Near Linear Time

DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES

sequenceMiner algorithm

Visualize A Space Time Cube in 3D

Distributed Anomaly Detection Using Satellite Data From Multiple Modalities

Association analysis of high-high cluster road intersection crashes...

Privacy Preservation through Random Nonlinear Distortion

Malaria disease and grading system dataset from public hospitals reflecting...

Association analysis of high-high cluster road intersection pedestrian...

Space-time pattern mining for recorded depression prevalence in England from...

Data from: Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion