17 datasets found
  1. d

    Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

    • catalog.data.gov
    • data.nasa.gov
    • +1more
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  2. Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open...

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/mining-distance-based-outliers-in-near-linear-time
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

  3. o

    Controlled Anomalies Time Series (CATS) Dataset

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646896
    Explore at:
    Dataset updated
    Feb 16, 2023
    Authors
    Patrick Fleith
    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies. The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]: Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including: 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment. 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna. 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc. 5 million timestamps. Sensors readings are at 1Hz sampling frequency. 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour. 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection). 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments. Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata. Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data. Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel. Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected). Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies. Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation. Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise. No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline. Change Log Version 2 Metadata: we include a metadata.csv with information about: Anomaly categories Root cause channel (signal in which the anomaly is first visible) Affected channel (signal in which the anomaly might propagate) through coupled system dynamics Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps. Two data files: CSV and parquet for convenience. [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive ...

  4. d

    Data from: Mining Distance-Based Outliers in Near Linear Time

    • catalog-dev.data.gov
    • datasets.ai
    • +2more
    Updated Feb 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog-dev.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
    Explore at:
    Dataset updated
    Feb 22, 2025
    Dataset provided by
    Dashlink
    Description

    Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

  5. d

    Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

    • search.dataone.org
    • dataverse.azure.uit.no
    • +1more
    Updated Jul 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. https://search.dataone.org/view/sha256%3A08484b821e24ce46dbeb405a81e84d7457a8726456522e23d340739f2ff809ae
    Explore at:
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    DataverseNO
    Authors
    Holsbø, Einar
    Description

    This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

  6. d

    Data from: Statistical context dictates the relationship between...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Aug 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank (2019). Statistical context dictates the relationship between feedback-related EEG signals and learning [Dataset]. http://doi.org/10.5061/dryad.570pf8n
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 21, 2019
    Dataset provided by
    Dryad
    Authors
    Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank
    Time period covered
    2019
    Description

    201_Cannon_FILT_altLow_STIM.matpreprocessed EEG data from subject 201203_Cannon_FILT_altLow_STIM.matCleaned EEG data from participant 203204_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 204205_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 205206_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 206207_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 207210_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 210211_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 211212_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 212213_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 213214_Cannon_FILT_altLow_STIM.mat215_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 215216_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 216229_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 229233_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for particip...

  7. f

    Goodness-of-fit filtering in classical metric multidimensional scaling with...

    • tandf.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Graffelman (2023). Goodness-of-fit filtering in classical metric multidimensional scaling with large datasets [Dataset]. http://doi.org/10.6084/m9.figshare.11389830.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Jan Graffelman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasizing the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.

  8. P

    PointDenoisingBenchmark Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jan 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov (2019). PointDenoisingBenchmark Dataset [Dataset]. https://paperswithcode.com/dataset/pointcleannet
    Explore at:
    Dataset updated
    Jan 3, 2019
    Authors
    Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov
    Description

    The PointDenoisingBenchmark dataset features 28 different shapes, split into 18 training shapes and 10 test shapes.

    PointDenoisingBenchmark for outliers removal: contains noisy point clouds with different levels of gaussian noise and the corresponding clean ground truths. PointDenoisingBenchmark for denoising: contains noisy point clouds with different levels of noise and density of outliers and the corresponding clean ground truths.

  9. T

    Data from: flic

    • tensorflow.org
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). flic [Dataset]. https://www.tensorflow.org/datasets/catalog/flic
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    From the paper: We collected a 5003 image dataset automatically from popular Hollywood movies. The images were obtained by running a state-of-the-art person detector on every tenth frame of 30 movies. People detected with high confidence (roughly 20K candidates) were then sent to the crowdsourcing marketplace Amazon Mechanical Turk to obtain groundtruthlabeling. Each image was annotated by five Turkers for $0.01 each to label 10 upperbody joints. The median-of-five labeling was taken in each image to be robust to outlier annotation. Finally, images were rejected manually by us if the person was occluded or severely non-frontal. We set aside 20% (1016 images) of the data for testing.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('flic', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/flic-small-2.0.0.png" alt="Visualization" width="500px">

  10. m

    Data from: The search for loci under selection: trends, biases and progress

    • figshare.mq.edu.au
    • researchdata.edu.au
    • +3more
    bin
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collin W. Ahrens; Paul D. Rymer; Adam Stow; Jason Bragg; Shannon Dillon; Kate D. L. Umbers; Rachael Y. Dudaniec (2023). Data from: The search for loci under selection: trends, biases and progress [Dataset]. http://doi.org/10.5061/dryad.jq5g627
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Macquarie University
    Authors
    Collin W. Ahrens; Paul D. Rymer; Adam Stow; Jason Bragg; Shannon Dillon; Kate D. L. Umbers; Rachael Y. Dudaniec
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Detecting genetic variants under selection using FST outlier analysis (OA) and environmental association analyses (EAA) are popular approaches that provide insight into the genetic basis of local adaptation. Despite the frequent use of OA and EAA approaches and their increasing attractiveness for detecting signatures of selection, their application to field-based empirical data have not been synthesized. Here, we review 66 empirical studies that use Single Nucleotide Polymorphisms (SNPs) in OA and EAA. We report trends and biases across biological systems, sequencing methods, approaches, parameters, environmental variables and their influence on detecting signatures of selection. We found striking variability in both the use and reporting of environmental data and statistical parameters. For example, linkage disequilibrium among SNPs and numbers of unique SNP associations identified with EAA were rarely reported. The proportion of putatively adaptive SNPs detected varied widely among studies, and decreased with the number of SNPs analyzed. We found that genomic sampling effort had a greater impact than biological sampling effort on the proportion of identified SNPs under selection. OA identified a higher proportion of outliers when more individuals were sampled, but this was not the case for EAA. To facilitate repeatability, interpretation and synthesis of studies detecting selection, we recommend that future studies consistently report geographic coordinates, environmental data, model parameters, linkage disequilibrium, and measures of genetic structure. Identifying standards for how OA and EAA studies are designed and reported will aid future transparency and comparability of SNP-based selection studies and help to progress landscape and evolutionary genomics.

    Usage Notes Table S1 - Full data set.Data was collected by reading papers associated with environmental association analyses. Data includes location, species, methods used, genetic parameters of data sets reviewed, and analytical parameters of the analyses.Table S1_data.xlsxR code for mixed-effects linear modelsThe R code used to create the figures and estimate regressions of the data set.Ahrens et al 2018_MolEcol_review.R

  11. Data for: "Model-free estimation of completeness, uncertainties, and...

    • zenodo.org
    application/gzip
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Schwalbe-Koda; Daniel Schwalbe-Koda; Sebastien Hamel; Sebastien Hamel; Babak Sadigh; Babak Sadigh; Fei Zhou; Fei Zhou; Vincenzo Lordi; Vincenzo Lordi (2025). Data for: "Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory" [Dataset]. http://doi.org/10.5281/zenodo.15025644
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Schwalbe-Koda; Daniel Schwalbe-Koda; Sebastien Hamel; Sebastien Hamel; Babak Sadigh; Babak Sadigh; Fei Zhou; Fei Zhou; Vincenzo Lordi; Vincenzo Lordi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 14, 2025
    Description
    # Data for: Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory
    
    This dataset contains the raw data to reproduce the paper:
    
    D. Schwalbe-Koda, S. Hamel, B. Sadigh, F. Zhou, V. Lordi. "Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory". arXiv:2404.12367 (2024). DOI: [10.48550/arXiv.2404.12367](https://doi.org/10.48550/arXiv.2404.12367)
    
    The raw data in `2025-quests-data.tar.gz` contains all the raw data to reproduce the paper.
    The tarfile is sorted by section of the paper (01 through 05) and supplementary information (A01 through A11). Its structure is the following:
    ``` data/ ├── 02-Aluminum ├── 02-GAP20 ├── 02-rMD17 ├── 04-TM23 ├── 05-Cu ├── 05-Ta ├── A08-Denoiser ├── A11-Cu ├── A11-QTB └── A11-Sn ```
    The tarfile contains files of the following formats:

    - CSV files containing tables with the data for the analysis
    - JSON files containing structured data for the analysis
    - logfiles from LAMMPS simulations
    - Extended XYZ files containing the results of MD trajectories or materials structure data ### Citing If you use QUESTS or its data/examples in a publication, please cite the following paper: ```bibtex @article{schwalbekoda2024information, title = {Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory}, author = {Schwalbe-Koda, Daniel and Hamel, Sebastien and Sadigh, Babak and Zhou, Fei and Lordi, Vincenzo}, year = {2024}, journal = {arXiv:2404.12367}, url = {https://arxiv.org/abs/2404.12367}, doi = {10.48550/arXiv.2404.12367}, } ```
  12. f

    Clinical Examples of the Various Categories of Each Characteristic of...

    • plos.figshare.com
    xls
    Updated May 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghayath Janoudi; Mara Uzun (Rada); Deshayne B. Fell; Joel G. Ray; Angel M. Foster; Randy Giffen; Tammy Clifford; Mark C. Walker (2024). Clinical Examples of the Various Categories of Each Characteristic of Outlier. [Dataset]. http://doi.org/10.1371/journal.pdig.0000515.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 22, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Ghayath Janoudi; Mara Uzun (Rada); Deshayne B. Fell; Joel G. Ray; Angel M. Foster; Randy Giffen; Tammy Clifford; Mark C. Walker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clinical Examples of the Various Categories of Each Characteristic of Outlier.

  13. i

    Outlier.corrected.winter.txt

    • doi.ipk-gatersleben.de
    Updated Jun 19, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Norman Philipp; Stephan Weise; Markus Oppermann; Andreas Börner; Andreas Graner; Jens Keilwagen; Benjamin Kilian; Daniel Arend; Yusheng Zhao; Jochen Reif; Albert Wilhelm Schulthess; Norman Philipp; Stephan Weise; Markus Oppermann; Andreas Börner; Andreas Graner; Jens Keilwagen; Benjamin Kilian; Daniel Arend; Yusheng Zhao; Jochen Reif; Albert Wilhelm Schulthess (2019). Outlier.corrected.winter.txt [Dataset]. https://doi.ipk-gatersleben.de/DOI/d54cbb0c-ea39-453a-992f-a2d9e2f34553/31805ab5-8348-4053-8022-63d809fdb783/1
    Explore at:
    Dataset updated
    Jun 19, 2019
    Dataset provided by
    e!DAL - Plant Genomics and Phenomics Research Data Repository (PGP), IPK Gatersleben, Seeland OT Gatersleben, Corrensstraße 3, 06466, Germany
    Authors
    Norman Philipp; Stephan Weise; Markus Oppermann; Andreas Börner; Andreas Graner; Jens Keilwagen; Benjamin Kilian; Daniel Arend; Yusheng Zhao; Jochen Reif; Albert Wilhelm Schulthess; Norman Philipp; Stephan Weise; Markus Oppermann; Andreas Börner; Andreas Graner; Jens Keilwagen; Benjamin Kilian; Daniel Arend; Yusheng Zhao; Jochen Reif; Albert Wilhelm Schulthess
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides historical phenotypic observations of 12,754 spring and winter wheat accessions (Triticum aestivum L.) gathered during 70 years of seed regeneration on field at the Federal ex situ Genebank of Agricultural and Horticultural Crops hosted by the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) in Gatersleben (Germany). Characterized by a highly non-orthogonal data structure, the following traits were recorded: (i) flowering time (FT) which corresponds to days after the 1st of January of each year for winter wheat, and days after the sowing date for spring wheat, (ii) plant height (PH) expressed in cm, and (iii) thousand grain weight (TGW) evaluated in g. The dataset also provides information about accession numbers, accession identifiers, sowing date, harvest year and origin country as well as monthly weather records for 63 regeneration years. The dataset and metadata are formatted using the ISA-Tab format (see subfolder /Original_data_ISATab). A previously described quality assessment pipeline has been used to derive outlier corrected data, which serve to compute the Best Linear Unbiased Estimates (BLUEs) allowing for the direct comparison of accessions across regeneration years (see subfolder /Processed_data). Example R-scripts for outlier correction and computation of BLUEs were included (see subfolder /R_scripts).

  14. S

    ML-CNPM2.5

    • scidb.cn
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yulong Fan; Lin Sun; Xirong Liu (2024). ML-CNPM2.5 [Dataset]. http://doi.org/10.57760/sciencedb.08635
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Yulong Fan; Lin Sun; Xirong Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The features possibly affecting ground-based PM2.5 from 2014 to 2023 in China were collected to make up our first version of the ML-CNPM2.5. Thanks to our filling and calibrating methods, over 5 million samples (5,076,608) have been obtained, which is so more PM2.5 samples that have not been covered in previous studies, to our knowledge. To train and assess different models in terms of primary and higher accuracy ML-based models, the dataset including unfilled AOD, with 1790210-line records, is also issued since filled AOD always shows lower accuracy than unfilled. To distinguish the two datasets, the filled AOD dataset is named ML-CNPM2.5-A and the unfiled is named ML-CNPM2.5-B. There are twenty-four features contained in the ML-CNPM2.5 A, whereas twenty-three features in the ML-CNPM2.5-B. Most of the features directly affect or indirectly affect ground-based PM2.5 estimating using remote sensing and ML technology, thereby being widely used as the input of ML-based models. The distribution of each feature in the ML-CNPM2.5-A (ML-CNPM2.5-B) is revealed in Fig. 1 (Fig. 2). The Figures intuitively demonstrate each feature’s range of values, including median, quartile, and outlier. For example, the distribution of Terra MAIAC AOD is changed plainly after being calibrated, i.e., from the range of 0-8 calibrated to the range of 0-3, which is more realistic. The discrete features, including year, month, day, Doy and LUC, show even distribution in their range of values, indicating the equilibrium and comprehensiveness of our sample dataset. Detailed information about these features is listed in Table 2 (Table S1) for ML-CNPM2.5-A (CNPM2.5-B). Overall, our sample dataset includes commend features used widely in estimating PM2.5, with high-volume and comprehensive records, as big data ensures the training and validation of different models.

  15. d

    Robust logistic regression to narrow down the winner's curse for rare and...

    • b2find.dkrz.de
    Updated Oct 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Robust logistic regression to narrow down the winner's curse for rare and recessive susceptibility variants [Source Code] - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/556bfd62-9c1f-5617-93b8-07ace2dceb08
    Explore at:
    Dataset updated
    Oct 23, 2023
    Description

    Logistic regression is the most common technique used for genetic case-control association studies. A disadvantage of standard maximum likelihood estimators of the genotype relative risk (GRR) is their strong dependence on outlier subjects, for example, patients diagnosed at unusually young age. Robust methods are available to constrain outlier influence, but they are scarcely used in genetic studies. This article provides a non-intimidating introduction to robust logistic regression, and investigates its benefits and limitations in genetic association studies. We applied the bounded Huber and extended the R package ‘robustbase’ with the re-descending Hampel functions to down-weight outlier influence. Computer simulations were carried out to assess the type I error rate, mean squared error (MSE) and statistical power according to major characteristics of the genetic study and investigated markers. Simulations were complemented with the analysis of real data. Both standard and robust estimation controlled type I error rates. Standard logistic regression showed the highest power but standard GRR estimates also showed the largest bias and MSE, in particular for associated rare and recessive variants. For illustration, a recessive variant with a true GRR=6.32 and a minor allele frequency=0.05 investigated in a 1000 case/1000 control study by standard logistic regression resulted in power=0.60 and MSE=16.5. The corresponding figures for Huber-based estimation were power=0.51 and MSE=0.53. Overall, Hampel- and Huber-based GRR estimates did not differ much. Robust logistic regression may represent a valuable alternative to standard maximum likelihood estimation when the focus lies on risk prediction rather than identification of susceptibility variants.

  16. d

    Data from: Genetic architecture in a marine hybrid zone: comparing outlier...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Mar 16, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Genetic architecture in a marine hybrid zone: comparing outlier detection and genomic clines analysis in the bivalve Macoma balthica [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.70np2513
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 16, 2012
    Dataset provided by
    Dryad
    Authors
    Pieternella C. Luttikhuizen; Jan Drent; Katja T. C. A. Peijnenburg; Henk W. van der Veer; Kerstin Johannesson
    Time period covered
    2012
    Area covered
    Scandinavia, Europe
    Description

    Luttikhuizen_et_al_MolEcol_2012_datadryadAFLP data for field collected marine bivalves in shallow intertidal locations in NW Europe. The species is Macoma balthica, the Baltic clam. Please refer to the original publication for further information such as exact locations and local habitat characteristics. File contains data on 644 individuals from 21 different locations scored (presence/absence) for 90 AFLP markers.

  17. e

    City of Darwin Average Park Water Usage

    • esriaustraliahub.com.au
    • open-darwin.opendata.arcgis.com
    • +2more
    Updated Aug 28, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jsilburn (2018). City of Darwin Average Park Water Usage [Dataset]. https://www.esriaustraliahub.com.au/maps/1a6091b606d94365998d157686971413
    Explore at:
    Dataset updated
    Aug 28, 2018
    Dataset authored and provided by
    jsilburn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    Many of Darwin's parks are connected to an automated irrigation system. This system is able to report water usage and other attributes. Note, the irrigation system has had parks added and removed and sensors damage and repaired over the years; therefore some parks have little or no usage. Outliers may also exist (sensors reporting incorrect usage, for example).The attached datasets also contain other month-to-month water usage data.Update frequency: TBA

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data

Related Article
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Dashlink
Description

There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

Search
Clear search
Close search
Google apps
Main menu