17 datasets found

d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.nasa.gov
+1more
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open...
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Feb 19, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
o
Controlled Anomalies Time Series (CATS) Dataset
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646896
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7646896
Dataset updated
Feb 16, 2023
Authors
Patrick Fleith
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies. The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]: Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including: 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment. 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna. 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc. 5 million timestamps. Sensors readings are at 1Hz sampling frequency. 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour. 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection). 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments. Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata. Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data. Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel. Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected). Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies. Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation. Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise. No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline. Change Log Version 2 Metadata: we include a metadata.csv with information about: Anomaly categories Root cause channel (signal in which the anomaly is first visible) Affected channel (signal in which the anomaly might propagate) through coupled system dynamics Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps. Two data files: CSV and parquet for convenience. [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive ...
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog-dev.data.gov
datasets.ai
+2more
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog-dev.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
d
Supporting data for \"A Standard Operating Procedure for Outlier Removal in...
search.dataone.org
dataverse.azure.uit.no
+1more
Updated Jul 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. https://search.dataone.org/view/sha256%3A08484b821e24ce46dbeb405a81e84d7457a8726456522e23d340739f2ff809ae
Explore at:
Dataset updated
Jul 29, 2024
Dataset provided by
DataverseNO
Authors
Holsbø, Einar
Description
This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details
d
Data from: Statistical context dictates the relationship between...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Aug 21, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank (2019). Statistical context dictates the relationship between feedback-related EEG signals and learning [Dataset]. http://doi.org/10.5061/dryad.570pf8n
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.570pf8n
Dataset updated
Aug 21, 2019
Dataset provided by
Dryad
Authors
Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank
Time period covered
2019
Description
201_Cannon_FILT_altLow_STIM.matpreprocessed EEG data from subject 201203_Cannon_FILT_altLow_STIM.matCleaned EEG data from participant 203204_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 204205_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 205206_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 206207_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 207210_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 210211_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 211212_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 212213_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 213214_Cannon_FILT_altLow_STIM.mat215_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 215216_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 216229_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 229233_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for particip...
f
Goodness-of-fit filtering in classical metric multidimensional scaling with...
tandf.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Graffelman (2023). Goodness-of-fit filtering in classical metric multidimensional scaling with large datasets [Dataset]. http://doi.org/10.6084/m9.figshare.11389830.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11389830.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Jan Graffelman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasizing the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.
P
PointDenoisingBenchmark Dataset
paperswithcode.com
opendatalab.com
Updated Jan 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov (2019). PointDenoisingBenchmark Dataset [Dataset]. https://paperswithcode.com/dataset/pointcleannet
Explore at:
Dataset updated
Jan 3, 2019
Authors
Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov
Description
The PointDenoisingBenchmark dataset features 28 different shapes, split into 18 training shapes and 10 test shapes.

PointDenoisingBenchmark for outliers removal: contains noisy point clouds with different levels of gaussian noise and the corresponding clean ground truths. PointDenoisingBenchmark for denoising: contains noisy point clouds with different levels of noise and density of outliers and the corresponding clean ground truths.
T
Data from: flic
tensorflow.org
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). flic [Dataset]. https://www.tensorflow.org/datasets/catalog/flic
Explore at:
Dataset updated
Jun 1, 2024
Description
From the paper: We collected a 5003 image dataset automatically from popular Hollywood movies. The images were obtained by running a state-of-the-art person detector on every tenth frame of 30 movies. People detected with high confidence (roughly 20K candidates) were then sent to the crowdsourcing marketplace Amazon Mechanical Turk to obtain groundtruthlabeling. Each image was annotated by five Turkers for $0.01 each to label 10 upperbody joints. The median-of-five labeling was taken in each image to be robust to outlier annotation. Finally, images were rejected manually by us if the person was occluded or severely non-frontal. We set aside 20% (1016 images) of the data for testing.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('flic', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/flic-small-2.0.0.png" alt="Visualization" width="500px">
m
Data from: The search for loci under selection: trends, biases and progress
figshare.mq.edu.au
researchdata.edu.au
+3more
bin
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collin W. Ahrens; Paul D. Rymer; Adam Stow; Jason Bragg; Shannon Dillon; Kate D. L. Umbers; Rachael Y. Dudaniec (2023). Data from: The search for loci under selection: trends, biases and progress [Dataset]. http://doi.org/10.5061/dryad.jq5g627
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jq5g627
Dataset updated
Jun 15, 2023
Dataset provided by
Macquarie University
Authors
Collin W. Ahrens; Paul D. Rymer; Adam Stow; Jason Bragg; Shannon Dillon; Kate D. L. Umbers; Rachael Y. Dudaniec
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Detecting genetic variants under selection using FST outlier analysis (OA) and environmental association analyses (EAA) are popular approaches that provide insight into the genetic basis of local adaptation. Despite the frequent use of OA and EAA approaches and their increasing attractiveness for detecting signatures of selection, their application to field-based empirical data have not been synthesized. Here, we review 66 empirical studies that use Single Nucleotide Polymorphisms (SNPs) in OA and EAA. We report trends and biases across biological systems, sequencing methods, approaches, parameters, environmental variables and their influence on detecting signatures of selection. We found striking variability in both the use and reporting of environmental data and statistical parameters. For example, linkage disequilibrium among SNPs and numbers of unique SNP associations identified with EAA were rarely reported. The proportion of putatively adaptive SNPs detected varied widely among studies, and decreased with the number of SNPs analyzed. We found that genomic sampling effort had a greater impact than biological sampling effort on the proportion of identified SNPs under selection. OA identified a higher proportion of outliers when more individuals were sampled, but this was not the case for EAA. To facilitate repeatability, interpretation and synthesis of studies detecting selection, we recommend that future studies consistently report geographic coordinates, environmental data, model parameters, linkage disequilibrium, and measures of genetic structure. Identifying standards for how OA and EAA studies are designed and reported will aid future transparency and comparability of SNP-based selection studies and help to progress landscape and evolutionary genomics.

Usage Notes Table S1 - Full data set.Data was collected by reading papers associated with environmental association analyses. Data includes location, species, methods used, genetic parameters of data sets reviewed, and analytical parameters of the analyses.Table S1_data.xlsxR code for mixed-effects linear modelsThe R code used to create the figures and estimate regressions of the data set.Ahrens et al 2018_MolEcol_review.R

Data for: "Model-free estimation of completeness, uncertainties, and...

zenodo.org

application/gzip

Updated Mar 14, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniel Schwalbe-Koda; Daniel Schwalbe-Koda; Sebastien Hamel; Sebastien Hamel; Babak Sadigh; Babak Sadigh; Fei Zhou; Fei Zhou; Vincenzo Lordi; Vincenzo Lordi (2025). Data for: "Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory" [Dataset]. http://doi.org/10.5281/zenodo.15025644

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15025644

Dataset updated

Mar 14, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Daniel Schwalbe-Koda; Daniel Schwalbe-Koda; Sebastien Hamel; Sebastien Hamel; Babak Sadigh; Babak Sadigh; Fei Zhou; Fei Zhou; Vincenzo Lordi; Vincenzo Lordi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Mar 14, 2025

Description

# Data for: Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory

This dataset contains the raw data to reproduce the paper:

D. Schwalbe-Koda, S. Hamel, B. Sadigh, F. Zhou, V. Lordi. "Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory". arXiv:2404.12367 (2024). DOI: [10.48550/arXiv.2404.12367](https://doi.org/10.48550/arXiv.2404.12367)

The raw data in `2025-quests-data.tar.gz` contains all the raw data to reproduce the paper.
The tarfile is sorted by section of the paper (01 through 05) and supplementary information (A01 through A11).
Its structure is the following:


```
  data/
  ├── 02-Aluminum
  ├── 02-GAP20
  ├── 02-rMD17
  ├── 04-TM23
  ├── 05-Cu
  ├── 05-Ta
  ├── A08-Denoiser
  ├── A11-Cu
  ├── A11-QTB
  └── A11-Sn
```

The tarfile contains files of the following formats:

- CSV files containing tables with the data for the analysis
- JSON files containing structured data for the analysis
- logfiles from LAMMPS simulations
- Extended XYZ files containing the results of MD trajectories or materials structure data

### Citing

If you use QUESTS or its data/examples in a publication, please cite the following paper:

```bibtex
@article{schwalbekoda2024information,
  title = {Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory},
  author = {Schwalbe-Koda, Daniel and Hamel, Sebastien and Sadigh, Babak and Zhou, Fei and Lordi, Vincenzo},
  year = {2024},
  journal = {arXiv:2404.12367},
  url = {https://arxiv.org/abs/2404.12367},
  doi = {10.48550/arXiv.2404.12367},
}
```

f
Clinical Examples of the Various Categories of Each Characteristic of...
plos.figshare.com
xls
Updated May 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghayath Janoudi; Mara Uzun (Rada); Deshayne B. Fell; Joel G. Ray; Angel M. Foster; Randy Giffen; Tammy Clifford; Mark C. Walker (2024). Clinical Examples of the Various Categories of Each Characteristic of Outlier. [Dataset]. http://doi.org/10.1371/journal.pdig.0000515.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000515.t001
Dataset updated
May 22, 2024
Dataset provided by
PLOS Digital Health
Authors
Ghayath Janoudi; Mara Uzun (Rada); Deshayne B. Fell; Joel G. Ray; Angel M. Foster; Randy Giffen; Tammy Clifford; Mark C. Walker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clinical Examples of the Various Categories of Each Characteristic of Outlier.
i
Outlier.corrected.winter.txt
doi.ipk-gatersleben.de
Updated Jun 19, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Norman Philipp; Stephan Weise; Markus Oppermann; Andreas Börner; Andreas Graner; Jens Keilwagen; Benjamin Kilian; Daniel Arend; Yusheng Zhao; Jochen Reif; Albert Wilhelm Schulthess; Norman Philipp; Stephan Weise; Markus Oppermann; Andreas Börner; Andreas Graner; Jens Keilwagen; Benjamin Kilian; Daniel Arend; Yusheng Zhao; Jochen Reif; Albert Wilhelm Schulthess (2019). Outlier.corrected.winter.txt [Dataset]. https://doi.ipk-gatersleben.de/DOI/d54cbb0c-ea39-453a-992f-a2d9e2f34553/31805ab5-8348-4053-8022-63d809fdb783/1
Explore at:
Dataset updated
Jun 19, 2019
Dataset provided by
e!DAL - Plant Genomics and Phenomics Research Data Repository (PGP), IPK Gatersleben, Seeland OT Gatersleben, Corrensstraße 3, 06466, Germany
Authors
Norman Philipp; Stephan Weise; Markus Oppermann; Andreas Börner; Andreas Graner; Jens Keilwagen; Benjamin Kilian; Daniel Arend; Yusheng Zhao; Jochen Reif; Albert Wilhelm Schulthess; Norman Philipp; Stephan Weise; Markus Oppermann; Andreas Börner; Andreas Graner; Jens Keilwagen; Benjamin Kilian; Daniel Arend; Yusheng Zhao; Jochen Reif; Albert Wilhelm Schulthess
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides historical phenotypic observations of 12,754 spring and winter wheat accessions (Triticum aestivum L.) gathered during 70 years of seed regeneration on field at the Federal ex situ Genebank of Agricultural and Horticultural Crops hosted by the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) in Gatersleben (Germany). Characterized by a highly non-orthogonal data structure, the following traits were recorded: (i) flowering time (FT) which corresponds to days after the 1st of January of each year for winter wheat, and days after the sowing date for spring wheat, (ii) plant height (PH) expressed in cm, and (iii) thousand grain weight (TGW) evaluated in g. The dataset also provides information about accession numbers, accession identifiers, sowing date, harvest year and origin country as well as monthly weather records for 63 regeneration years. The dataset and metadata are formatted using the ISA-Tab format (see subfolder /Original_data_ISATab). A previously described quality assessment pipeline has been used to derive outlier corrected data, which serve to compute the Best Linear Unbiased Estimates (BLUEs) allowing for the direct comparison of accessions across regeneration years (see subfolder /Processed_data). Example R-scripts for outlier correction and computation of BLUEs were included (see subfolder /R_scripts).
S
ML-CNPM2.5
scidb.cn
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yulong Fan; Lin Sun; Xirong Liu (2024). ML-CNPM2.5 [Dataset]. http://doi.org/10.57760/sciencedb.08635
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.08635
Dataset updated
Jun 13, 2024
Dataset provided by
Science Data Bank
Authors
Yulong Fan; Lin Sun; Xirong Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The features possibly affecting ground-based PM2.5 from 2014 to 2023 in China were collected to make up our first version of the ML-CNPM2.5. Thanks to our filling and calibrating methods, over 5 million samples (5,076,608) have been obtained, which is so more PM2.5 samples that have not been covered in previous studies, to our knowledge. To train and assess different models in terms of primary and higher accuracy ML-based models, the dataset including unfilled AOD, with 1790210-line records, is also issued since filled AOD always shows lower accuracy than unfilled. To distinguish the two datasets, the filled AOD dataset is named ML-CNPM2.5-A and the unfiled is named ML-CNPM2.5-B. There are twenty-four features contained in the ML-CNPM2.5 A, whereas twenty-three features in the ML-CNPM2.5-B. Most of the features directly affect or indirectly affect ground-based PM2.5 estimating using remote sensing and ML technology, thereby being widely used as the input of ML-based models. The distribution of each feature in the ML-CNPM2.5-A (ML-CNPM2.5-B) is revealed in Fig. 1 (Fig. 2). The Figures intuitively demonstrate each feature’s range of values, including median, quartile, and outlier. For example, the distribution of Terra MAIAC AOD is changed plainly after being calibrated, i.e., from the range of 0-8 calibrated to the range of 0-3, which is more realistic. The discrete features, including year, month, day, Doy and LUC, show even distribution in their range of values, indicating the equilibrium and comprehensiveness of our sample dataset. Detailed information about these features is listed in Table 2 (Table S1) for ML-CNPM2.5-A (CNPM2.5-B). Overall, our sample dataset includes commend features used widely in estimating PM2.5, with high-volume and comprehensive records, as big data ensures the training and validation of different models.
d
Robust logistic regression to narrow down the winner's curse for rare and...
b2find.dkrz.de
Updated Oct 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Robust logistic regression to narrow down the winner's curse for rare and recessive susceptibility variants [Source Code] - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/556bfd62-9c1f-5617-93b8-07ace2dceb08
Explore at:
Dataset updated
Oct 23, 2023
Description
Logistic regression is the most common technique used for genetic case-control association studies. A disadvantage of standard maximum likelihood estimators of the genotype relative risk (GRR) is their strong dependence on outlier subjects, for example, patients diagnosed at unusually young age. Robust methods are available to constrain outlier influence, but they are scarcely used in genetic studies. This article provides a non-intimidating introduction to robust logistic regression, and investigates its benefits and limitations in genetic association studies. We applied the bounded Huber and extended the R package ‘robustbase’ with the re-descending Hampel functions to down-weight outlier influence. Computer simulations were carried out to assess the type I error rate, mean squared error (MSE) and statistical power according to major characteristics of the genetic study and investigated markers. Simulations were complemented with the analysis of real data. Both standard and robust estimation controlled type I error rates. Standard logistic regression showed the highest power but standard GRR estimates also showed the largest bias and MSE, in particular for associated rare and recessive variants. For illustration, a recessive variant with a true GRR=6.32 and a minor allele frequency=0.05 investigated in a 1000 case/1000 control study by standard logistic regression resulted in power=0.60 and MSE=16.5. The corresponding figures for Huber-based estimation were power=0.51 and MSE=0.53. Overall, Hampel- and Huber-based GRR estimates did not differ much. Robust logistic regression may represent a valuable alternative to standard maximum likelihood estimation when the focus lies on risk prediction rather than identification of susceptibility variants.
d
Data from: Genetic architecture in a marine hybrid zone: comparing outlier...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Mar 16, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Genetic architecture in a marine hybrid zone: comparing outlier detection and genomic clines analysis in the bivalve Macoma balthica [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.70np2513
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.70np2513
Dataset updated
Mar 16, 2012
Dataset provided by
Dryad
Authors
Pieternella C. Luttikhuizen; Jan Drent; Katja T. C. A. Peijnenburg; Henk W. van der Veer; Kerstin Johannesson
Time period covered
2012
Area covered
Scandinavia, Europe
Description
Luttikhuizen_et_al_MolEcol_2012_datadryadAFLP data for field collected marine bivalves in shallow intertidal locations in NW Europe. The species is Macoma balthica, the Baltic clam. Please refer to the original publication for further information such as exact locations and local habitat characteristics. File contains data on 644 individuals from 21 different locations scored (presence/absence) for 90 AFLP markers.
e
City of Darwin Average Park Water Usage
esriaustraliahub.com.au
open-darwin.opendata.arcgis.com
+2more
Updated Aug 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jsilburn (2018). City of Darwin Average Park Water Usage [Dataset]. https://www.esriaustraliahub.com.au/maps/1a6091b606d94365998d157686971413
Explore at:
Dataset updated
Aug 28, 2018
Dataset authored and provided by
jsilburn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Many of Darwin's parks are connected to an automated irrigation system. This system is able to report water usage and other attributes. Note, the irrigation system has had parks added and removed and sensors damage and repaired over the years; therefore some parks have little or no usage. Outliers may also exist (sensors reporting incorrect usage, for example).The attached datasets also contain other month-to-month water usage data.Update frequency: TBA
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data

Explore at:

Dataset updated

Dec 7, 2023

Dataset provided by

Dashlink

Description

There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

Clear search

Close search

Google apps

Main menu

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open...

Controlled Anomalies Time Series (CATS) Dataset

Data from: Mining Distance-Based Outliers in Near Linear Time

Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

Data from: Statistical context dictates the relationship between...

Goodness-of-fit filtering in classical metric multidimensional scaling with...

PointDenoisingBenchmark Dataset

Data from: flic

Data from: The search for loci under selection: trends, biases and progress

Data for: "Model-free estimation of completeness, uncertainties, and...

Clinical Examples of the Various Categories of Each Characteristic of...

Outlier.corrected.winter.txt

ML-CNPM2.5

Robust logistic regression to narrow down the winner's curse for rare and...

Data from: Genetic architecture in a marine hybrid zone: comparing outlier...

City of Darwin Average Park Water Usage

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data