24 datasets found
  1. MNIST dataset for Outliers Detection - [ MNIST4OD ]

    • figshare.com
    application/gzip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Giovanni Stilo; Bardh Prenkaj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

  2. c

    Algorithms for Speeding up Distance-Based Outlier Detection

    • s.cnmilf.com
    • data.nasa.gov
    • +1more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Algorithms for Speeding up Distance-Based Outlier Detection [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/algorithms-for-speeding-up-distance-based-outlier-detection
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    The problem of distance-based outlier detection is difficult to solve efficiently in very large datasets because of potential quadratic time complexity. We address this problem and develop sequential and distributed algorithms that are significantly more efficient than state-of-the-art methods while still guaranteeing the same outliers. By combining simple but effective indexing and disk block accessing techniques, we have developed a sequential algorithm iOrca that is up to an order-of-magnitude faster than the state-of-the-art. The indexing scheme is based on sorting the data points in order of increasing distance from a fixed reference point and then accessing those points based on this sorted order. To speed up the basic outlier detection technique, we develop two distributed algorithms (DOoR and iDOoR) for modern distributed multi-core clusters of machines, connected on a ring topology. The first algorithm passes data blocks from each machine around the ring, incrementally updating the nearest neighbors of the points passed. By maintaining a cutoff threshold, it is able to prune a large number of points in a distributed fashion. The second distributed algorithm extends this basic idea with the indexing scheme discussed earlier. In our experiments, both distributed algorithms exhibit significant improvements compared to the state-of-the-art distributed methods.

  3. D

    Model Access Outlier Detection Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Model Access Outlier Detection Market Research Report 2033 [Dataset]. https://dataintelo.com/report/model-access-outlier-detection-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Model Access Outlier Detection Market Outlook



    According to our latest research, the global Model Access Outlier Detection market size reached USD 1.32 billion in 2024, driven by the increasing need for advanced anomaly detection in digital infrastructure. The market is projected to grow at a CAGR of 14.8% from 2025 to 2033, reaching an estimated USD 4.15 billion by 2033. This robust growth is fueled by the rising adoption of AI-based security solutions, the proliferation of complex data environments, and the urgent demand for real-time threat detection across critical industries.




    The primary growth factor for the Model Access Outlier Detection market is the exponential increase in cyber threats and sophisticated attacks targeting enterprise data and networks. As organizations digitize operations, they generate vast volumes of data, making traditional rule-based security approaches inadequate. Outlier detection solutions leverage machine learning and artificial intelligence to identify unusual patterns and potential threats in real time, significantly reducing response times and minimizing the risk of data breaches. The integration of these technologies into existing security frameworks is becoming a necessity, especially in highly regulated sectors such as banking, healthcare, and government, where data integrity and privacy are paramount.




    Another significant driver propelling the market is the rapid adoption of cloud computing and the proliferation of IoT devices. As businesses migrate workloads to the cloud and deploy interconnected devices, the attack surface expands, necessitating advanced outlier detection mechanisms. Cloud-based solutions offer scalability, flexibility, and centralized monitoring, making them particularly attractive for organizations with distributed operations. Furthermore, the shift towards remote work and digital collaboration has increased the demand for real-time monitoring and anomaly detection to safeguard sensitive data and ensure business continuity. The continuous evolution of AI algorithms and the availability of big data analytics further enhance the accuracy and efficiency of outlier detection systems, contributing to sustained market growth.




    The growing emphasis on regulatory compliance and data protection standards worldwide is also catalyzing the adoption of Model Access Outlier Detection solutions. Stringent regulations such as GDPR, HIPAA, and PCI DSS require organizations to implement robust security measures and continuously monitor access to critical systems. Outlier detection tools play a vital role in meeting these compliance requirements by providing automated alerts, detailed audit trails, and actionable insights into suspicious activities. As regulatory landscapes become more complex, organizations are investing in advanced detection technologies not only to avoid penalties but also to build trust with customers and stakeholders.




    From a regional perspective, North America currently dominates the Model Access Outlier Detection market, accounting for the largest revenue share in 2024, followed by Europe and Asia Pacific. The presence of leading technology vendors, high cybersecurity awareness, and significant investments in digital infrastructure contribute to North America’s leadership. Europe is experiencing steady growth due to stringent data protection regulations and the increasing adoption of cloud-based security solutions. Meanwhile, the Asia Pacific region is poised for the fastest growth, driven by rapid digital transformation, expanding IT ecosystems, and rising incidences of cyber threats in emerging economies. The market’s global expansion is further supported by ongoing technological advancements and the increasing integration of AI and machine learning in security operations.



    Component Analysis



    The Component segment of the Model Access Outlier Detection market is broadly categorized into Software and Services. Software solutions are at the core of this market, comprising advanced analytics platforms, AI-driven detection engines, and customizable dashboards. These software offerings are designed to seamlessly integrate with existing IT infrastructure, providing organizations with the capability to monitor access patterns, identify anomalies, and generate real-time alerts. The sophistication of these tools lies in their ability to adapt to evolving threat landscapes, utilizing machine learning algorithms to

  4. d

    Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  5. D

    Metrology Outlier Detection AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Metrology Outlier Detection AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/metrology-outlier-detection-ai-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Metrology Outlier Detection AI Market Outlook



    According to our latest research, the global Metrology Outlier Detection AI market size reached USD 1.18 billion in 2024, reflecting rapid adoption across high-precision industries. The market is expanding at a robust CAGR of 18.4% and is projected to attain a value of USD 5.53 billion by 2033. This impressive growth is primarily driven by the increasing demand for automated quality assurance and defect detection across manufacturing and high-tech sectors, as organizations strive to optimize processes and reduce costs while maintaining stringent accuracy standards.




    One of the primary growth factors propelling the Metrology Outlier Detection AI market is the surge in demand for advanced quality control solutions in semiconductor manufacturing and electronics industries. As these sectors face mounting pressure to deliver flawless products with microscopic tolerances, traditional metrology tools are often insufficient for detecting subtle anomalies. The integration of AI-based outlier detection into metrology systems enables real-time identification of defects and process deviations, significantly improving yield rates and reducing waste. Furthermore, the proliferation of smart factories and Industry 4.0 initiatives is compelling manufacturers to adopt intelligent metrology solutions that leverage machine learning algorithms, computer vision, and big data analytics to drive continuous process improvements and predictive maintenance.




    Another crucial driver is the increasing complexity of products in automotive, aerospace, and healthcare sectors. Modern vehicles, aircraft, and medical devices involve intricate assemblies and rely on components manufactured to exacting specifications. Even minor deviations can result in significant safety, performance, or regulatory issues. AI-powered metrology outlier detection systems provide a scalable and adaptive approach to monitoring production quality, detecting anomalies that might escape conventional inspection techniques. This capability not only ensures compliance with international standards but also enhances brand reputation and customer trust. The rising adoption of digital twins and simulation-driven design further amplifies the need for robust AI-driven metrology, as organizations seek to bridge the gap between virtual models and physical outcomes.




    The market is also benefiting from advancements in sensor technologies, edge computing, and cloud-based analytics platforms. These innovations enable seamless integration of AI-driven outlier detection into existing manufacturing and quality control workflows, facilitating real-time data acquisition, processing, and visualization. The availability of scalable cloud infrastructure allows enterprises of all sizes to leverage sophisticated AI models without incurring prohibitive upfront costs. Additionally, partnerships between AI solution providers and metrology equipment manufacturers are accelerating the development of turnkey systems tailored to specific industry requirements. As a result, the barrier to entry for implementing AI in metrology is rapidly diminishing, fueling widespread adoption across both established players and emerging entrants in the market.




    From a regional perspective, Asia Pacific remains the dominant force in the Metrology Outlier Detection AI market, accounting for the largest share in 2024. This is attributed to the region's strong presence in semiconductor manufacturing, electronics, and automotive industries, particularly in countries such as China, Japan, South Korea, and Taiwan. North America and Europe are also witnessing significant growth, driven by technological advancements, robust R&D ecosystems, and stringent quality regulations in aerospace and healthcare. Meanwhile, the Middle East & Africa and Latin America are gradually emerging as promising markets, supported by increasing investments in industrial automation and quality infrastructure. The interplay of regional dynamics, industry-specific challenges, and evolving regulatory landscapes will continue to shape the trajectory of the global market over the coming years.



    Component Analysis



    The Metrology Outlier Detection AI market by component is segmented into Software, Hardware, and Services, each playing a vital role in the overall ecosystem. The software segment dominates the market, accounting for the largest share in 2024. This is primarily due to the rapid advancemen

  6. Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United States
    Description

    Snapshot img

    Anomaly Detection Market Size 2025-2029

    The anomaly detection market size is valued to increase by USD 4.44 billion, at a CAGR of 14.4% from 2024 to 2029. Anomaly detection tools gaining traction in BFSI will drive the anomaly detection market.

    Major Market Trends & Insights

    North America dominated the market and accounted for a 43% growth during the forecast period.
    By Deployment - Cloud segment was valued at USD 1.75 billion in 2023
    By Component - Solution segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 173.26 million
    Market Future Opportunities: USD 4441.70 million
    CAGR from 2024 to 2029 : 14.4%
    

    Market Summary

    Anomaly detection, a critical component of advanced analytics, is witnessing significant adoption across various industries, with the financial services sector leading the charge. The increasing incidence of internal threats and cybersecurity frauds necessitates the need for robust anomaly detection solutions. These tools help organizations identify unusual patterns and deviations from normal behavior, enabling proactive response to potential threats and ensuring operational efficiency. For instance, in a supply chain context, anomaly detection can help identify discrepancies in inventory levels or delivery schedules, leading to cost savings and improved customer satisfaction. In the realm of compliance, anomaly detection can assist in maintaining regulatory adherence by flagging unusual transactions or activities, thereby reducing the risk of penalties and reputational damage.
    According to recent research, organizations that implement anomaly detection solutions experience a reduction in error rates by up to 25%. This improvement not only enhances operational efficiency but also contributes to increased customer trust and satisfaction. Despite these benefits, challenges persist, including data quality and the need for real-time processing capabilities. As the market continues to evolve, advancements in machine learning and artificial intelligence are expected to address these challenges and drive further growth.
    

    What will be the Size of the Anomaly Detection Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Anomaly Detection Market Segmented ?

    The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      Cloud
      On-premises
    
    
    Component
    
      Solution
      Services
    
    
    End-user
    
      BFSI
      IT and telecom
      Retail and e-commerce
      Manufacturing
      Others
    
    
    Technology
    
      Big data analytics
      AI and ML
      Data mining and business intelligence
    
    
    Geography
    
      North America
    
        US
        Canada
        Mexico
    
    
      Europe
    
        France
        Germany
        Spain
        UK
    
    
      APAC
    
        China
        India
        Japan
    
    
      Rest of World (ROW)
    

    By Deployment Insights

    The cloud segment is estimated to witness significant growth during the forecast period.

    The market is witnessing significant growth, driven by the increasing adoption of advanced technologies such as machine learning algorithms, predictive modeling tools, and real-time monitoring systems. Businesses are increasingly relying on anomaly detection solutions to enhance their root cause analysis, improve system health indicators, and reduce false positives. This is particularly true in sectors where data is generated in real-time, such as cybersecurity threat detection, network intrusion detection, and fraud detection systems. Cloud-based anomaly detection solutions are gaining popularity due to their flexibility, scalability, and cost-effectiveness.

    This growth is attributed to cloud-based solutions' quick deployment, real-time data visibility, and customization capabilities, which are offered at flexible payment options like monthly subscriptions and pay-as-you-go models. Companies like Anodot, Ltd, Cisco Systems Inc, IBM Corp, and SAS Institute Inc provide both cloud-based and on-premise anomaly detection solutions. Anomaly detection methods include outlier detection, change point detection, and statistical process control. Data preprocessing steps, such as data mining techniques and feature engineering processes, are crucial in ensuring accurate anomaly detection. Data visualization dashboards and alert fatigue mitigation techniques help in managing and interpreting the vast amounts of data generated.

    Network traffic analysis, log file analysis, and sensor data integration are essential components of anomaly detection systems. Additionally, risk management frameworks, drift detection algorithms, time series forecasting, and performance degradation detection are vital in maintaining system performance and capacity planning.

  7. S

    Water quality test data

    • scidb.cn
    Updated Oct 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HuiyunFeng; JingangJiang (2022). Water quality test data [Dataset]. http://doi.org/10.57760/sciencedb.05375
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2022
    Dataset provided by
    Science Data Bank
    Authors
    HuiyunFeng; JingangJiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outliers are often present in large datasets of water quality monitoring time series data. A method of combining the sliding window technique with Dixon detection criterion for the automatic detection of outliers in time series data is limited by the empirical determination of sliding window sizes. The scientific determination of the optimal sliding window size is very meaningful research work. This paper presents a new Monte Carlo Search Method (MCSM) based on random sampling to optimize the size of the sliding window, which fully takes advantage of computers and statistics. The MCSM was applied in a case study to automatic monitoring data of water quality factors in order to test its validity and usefulness. The results of comparing the accuracy and efficiency of the MCSM show that the new method in this paper is scientific and effective. The experimental results show that, at different sample sizes, the average accuracy is between 58.70% and 75.75%, and the average computation time increase is between 17.09% and 45.53%. In the era of big data in environmental monitoring, the proposed new methods can meet the required accuracy of outlier detection and improve the efficiency of calculation.

  8. n

    Anolis carolinensis character displacement SNP

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jan 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Douglas Crawford (2023). Anolis carolinensis character displacement SNP [Dataset]. http://doi.org/10.5061/dryad.qbzkh18ks
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 27, 2023
    Dataset provided by
    University of Miami
    Authors
    Douglas Crawford
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Here are six files that provide details for all 44,120 identified single nucleotide polymorphisms (SNPs) or the 215 outlier SNPs associated with the evolution of rapid character displacement among replicate islands with (2Spp) and without competition (1Spp) between two Anolis species. On 2Spp islands, A. carolinensis occurs higher in trees and have evolved larger toe pads. Among 1Spp and 2Spp island populations, we identify 44,120 SNPs, with 215-outlier SNPs with improbably large FST values, low nucleotide variation, greater linkage than expected, and these SNPs are enriched for animal walking behavior. Thus, we conclude that these 215-outliers are evolving by natural selection in response to the phenotypic convergent evolution of character displacement. There are two, non-mutually exclusive perspective of these nucleotide variants. One is character displacement is convergent: all 215 outlier SNPs are shared among 3 out of 5 2Spp island and 24% of outlier SNPS are shared among all five out of five 2Spp island. Second, character displacement is genetically redundant because the allele frequencies in one or more 2Spp are similar to 1Spp islands: among one or more 2Spp islands 33% of outlier SNPS are within the range of 1Spp MiAF and 76% of outliers are more similar to 1Spp island than mean MiAF of 2Spp islands. Focusing on convergence SNP is scientifically more robust, yet it distracts from the perspective of multiple genetic solutions that enhances the rate and stability of adaptive change. The six files include: a description of eight islands, details of 94 individuals, and four files on SNPs. The four SNP files include the VCF files for 94 individuals with 44KSNPs and two files (Excel sheet/tab-delimited file) with FST, p-values and outlier status for all 44,120 identified single nucleotide polymorphisms (SNPs) associated with the evolution of rapid character displacement. The sixth file is a detailed file on the 215 outlier SNPs. Complete sequence data is available at Bioproject PRJNA833453, which including samples not included in this study. The 94 individuals used in this study are described in “Supplemental_Sample_description.txt” Methods Anoles and genomic DNA: Tissue or DNA for 160 Anolis carolinensis and 20 A. sagrei samples were provided by the Museum of Comparative Zoology at Harvard University (Table S2). Samples were previously used to examine evolution of character displacement in native A. carolinensis following invasion by A. sagrei onto man-made spoil islands in Mosquito Lagoon Florida (Stuart et al. 2014). One hundred samples were genomic DNAs, and 80 samples were tissues (terminal tail clip, Table S2). Genomic DNA was isolated from 80 of 160 A. carolinensis individuals (MCZ, Table S2) using a custom SPRI magnetic bead protocol (Psifidi et al. 2015). Briefly, after removing ethanol, tissues were placed in 200 ul of GH buffer (25 mM Tris- HCl pH 7.5, 25 mM EDTA, , 2M GuHCl Guanidine hydrochloride, G3272 SIGMA, 5 mM CaCl2, 0.5% v/v Triton X-100, 1% N-Lauroyl-Sarcosine) with 5% per volume of 20 mg/ml proteinase K (10 ul/200 ul GH) and digested at 55º C for at least 2 hours. After proteinase K digestion, 100 ul of 0.1% carboxyl-modified Sera-Mag Magnetic beads (Fisher Scientific) resuspended in 2.5 M NaCl, 20% PEG were added and allowed to bind the DNA. Beads were subsequently magnetized and washed twice with 200 ul 70% EtOH, and then DNA was eluted in 100 ul 0.1x TE (10 mM Tris, 0.1 mM EDTA). All DNA samples were gel electrophoresed to ensure high molecular mass and quantified by spectrophotometry and fluorescence using Biotium AccuBlueTM High Sensitivity dsDNA Quantitative Solution according to manufacturer’s instructions. Genotyping-by-sequencing (GBS) libraries were prepared using a modified protocol after Elshire et al. (Elshire et al. 2011). Briefly, high-molecular-weight genomic DNA was aliquoted and digested using ApeKI restriction enzyme. Digests from each individual sample were uniquely barcoded, pooled, and size selected to yield insert sizes between 300-700 bp (Borgstrom et al. 2011). Pooled libraries were PCR amplified (15 cycles) using custom primers that extend into the genomic DNA insert by 3 bases (CTG). Adding 3 extra base pairs systematically reduces the number of sequenced GBS tags, ensuring sufficient sequencing depth. The final library had a mean size of 424 bp ranging from 188 to 700 bp . Anolis SNPs: Pooled libraries were sequenced on one lane on the Illumina HiSeq 4000 in 2x150 bp paired-end configuration, yielding approximately 459 million paired-end reads ( ~138 Gb). The medium Q-Score was 42 with the lower 10% Q-Scores exceeding 32 for all 150 bp. The initial library contained 180 individuals with 8,561,493 polymorphic sites. Twenty individuals were Anolis sagrei, and two individuals (Yan 1610 & Yin 1411) clustered with A. sagrei and were not used to define A. carolinesis’ SNPs. Anolis carolinesis reads were aligned to the Anolis carolinensis genome (NCBI RefSeq accession number:/GCF_000090745.1_AnoCar2.0). Single nucleotide polymorphisms (SNPs) for A. carolinensis were called using the GBeaSy analysis pipeline (Wickland et al. 2017) with the following filter settings: minimum read length of 100 bp after barcode and adapter trimming, minimum phred-scaled variant quality of 30 and minimum read depth of 5. SNPs were further filtered by requiring SNPs to occur in > 50% of individuals, and 66 individuals were removed because they had less than 70% of called SNPs. These filtering steps resulted in 51,155 SNPs among 94 individuals. Final filtering among 94 individuals required all sites to be polymorphic (with fewer individuals, some sites were no longer polymorphic) with a maximum of 2 alleles (all are bi-allelic), minimal allele frequency 0.05, and He that does not exceed HWE (FDR <0.01). SNPs with large He were removed (2,280 SNPs). These SNPs with large significant heterozygosity may result from aligning paralogues (different loci), and thus may not represent polymorphisms. No SNPs were removed with low He (due to possible demography or other exceptions to HWE). After filtering, 94 individual yielded 44,120 SNPs. Thus, the final filtered SNP data set was 44K SNPs from 94 indiviuals. Statistical Analyses: Eight A. carolinensis populations were analyzed: three populations from islands with native species only (1Spp islands) and 5 populations from islands where A. carolinesis co-exist with A. sagrei (2Spp islands, Table 1, Table S1). Most analyses pooled the three 1Spp islands and contrasted these with the pooled five 2Spp islands. Two approaches were used to define SNPs with unusually large allele frequency differences between 1Spp and 2Spp islands: 1) comparison of FST values to random permutations and 2) a modified FDIST approach to identify outlier SNPs with large and statistically unlikely FST values. Random Permutations: FST values were calculated in VCFTools (version 4.2, (Danecek et al. 2011)) where the p-value per SNP were defined by comparing FST values to 1,000 random permutations using a custom script (below). Basically, individuals and all their SNPs were randomly assigned to one of eight islands or to 1Spp versus 2Spp groups. The sample sizes (55 for 2Spp and 39 for 1Spp islands) were maintained. FST values were re-calculated for each 1,000 randomizations using VCFTools. Modified FDIST: To identify outlier SNPs with statistically large FST values, a modified FDIST (Beaumont and Nichols 1996) was implemented in Arlequin (Excoffier et al. 2005). This modified approach applies 50,000 coalescent simulations using hierarchical population structure, in which demes are arranged into k groups of d demes and in which migration rates between demes are different within and between groups. Unlike the finite island models, which have led to large frequencies of false positive because populations share different histories (Lotterhos and Whitlock 2014), the hierarchical island model avoids these false positives by avoiding the assumption of similar ancestry (Excoffier et al. 2009). References Beaumont, M. A. and R. A. Nichols. 1996. Evaluating loci for use in the genetic analysis of population structure. P Roy Soc B-Biol Sci 263:1619-1626. Borgstrom, E., S. Lundin, and J. Lundeberg. 2011. Large scale library generation for high throughput sequencing. PLoS One 6:e19119. Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635. Cingolani, P., A. Platts, L. Wang le, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu, and D. M. Ruden. 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80-92. Danecek, P., A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and G. Genomes Project Analysis. 2011. The variant call format and VCFtools. Bioinformatics 27:2156-2158. Earl, D. A. and B. M. vonHoldt. 2011. Structure Harvester: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genet Resour 4:359-361. Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto, E. S. Buckler, and S. E. Mitchell. 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:e19379. Evanno, G., S. Regnaut, and J. Goudet. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14:2611-2620. Excoffier, L., T. Hofer, and M. Foll. 2009. Detecting loci under selection in a hierarchically structured population. Heredity 103:285-298. Excoffier, L., G. Laval, and S. Schneider. 2005. Arlequin (version 3.0): An integrated software package for population genetics data analysis.

  9. D

    Supporting data for "A Standard Operating Procedure for Outlier Removal in...

    • dataverse.no
    • dataverse.azure.uit.no
    • +1more
    Updated May 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Einar Holsbø; Einar Holsbø (2017). Supporting data for "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets" [Dataset]. http://doi.org/10.18710/FGVLKS
    Explore at:
    tsv(309098854), txt(3680), tsv(43988), tsv(633), tsv(8212), tsv(271314861), application/x-rlang-transport(269), tsv(6583989), type/x-r-syntax(3194), tsv(198012971), tsv(40), application/x-rlang-transport(955932860)Available download formats
    Dataset updated
    May 31, 2017
    Dataset provided by
    DataverseNO
    Authors
    Einar Holsbø; Einar Holsbø
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

  10. r

    Deep one-class learning: a deep learning approach to anomaly detection

    • resodate.org
    Updated Oct 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Ruff (2021). Deep one-class learning: a deep learning approach to anomaly detection [Dataset]. http://doi.org/10.14279/depositonce-12250
    Explore at:
    Dataset updated
    Oct 8, 2021
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Lukas Ruff
    Description

    Anomaly detection is the problem of identifying unusual patterns in data. This problem is relevant for a wide variety of applications in various domains such as fault and damage detection in manufacturing, fraud detection in finance and insurance, intrusion detection in cybersecurity, disease detection in medical diagnosis, or scientific discovery. Many of these applications involve increasingly complex data at large scale, for instance, large collections of images or text. The lack of effective solutions in such settings has sparked an interest in developing anomaly detection methods based on deep learning, which has enabled breakthroughs in other machine learning problems that involve large amounts of complex data. This thesis proposes Deep One-Class Learning, a deep learning approach to anomaly detection that is based on the one-class classification paradigm. One-class classification views anomaly detection from a classification perspective, aiming to learn a discriminative decision boundary that separates the normal from the anomalous data. In contrast to previous methods that rely on fixed (usually manually engineered) features, deep one-class learning expands the one-class classification approach with methods that learn (or transfer) data representations via suitable one-class learning objectives. The key idea underlying deep one-class learning is to learn a transformation (e.g., a deep neural network) in such a way that the normal data points are concentrated in feature space, causing anomalies to deviate from the concentrated region, thereby making them detectable. We introduce several deep one-class learning methods in this thesis that follow the above idea while integrating different assumptions about the data or a specific domain. These include semi-supervised variants that can incorporate labeled anomalies, for example, or specific methods for images and text that enable model interpretability and an explanation of anomalies. Moreover, we present a unifying view of anomaly detection methods that, in addition to one-class classification, also covers reconstruction methods as well as methods based on density estimation and generative modeling. For each of these main approaches, we identify connections between respective deep and "shallow" methods based on common underlying principles. Through multiple experiments and analyses, we demonstrate that deep one-class learning is useful for anomaly detection, especially on semantic detection tasks. Finally, we conclude this thesis by discussing limits of the proposed approach and outlining specific paths for future research.

  11. A Bayesian Outlier Criterion to Detect SNPs under Selection in Large Data...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Gautier; Toby Dylan Hocking; Jean-Louis Foulley (2023). A Bayesian Outlier Criterion to Detect SNPs under Selection in Large Data Sets [Dataset]. http://doi.org/10.1371/journal.pone.0011913
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mathieu Gautier; Toby Dylan Hocking; Jean-Louis Foulley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe recent advent of high-throughput SNP genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged.Methodology/Principal FindingsThe purpose of this study is to develop an efficient model-based approach to perform Bayesian exploratory analyses for adaptive differentiation in very large SNP data sets. The basic idea is to start with a very simple model for neutral loci that is easy to implement under a Bayesian framework and to identify selected loci as outliers via Posterior Predictive P-values (PPP-values). Applications of this strategy are considered using two different statistical models. The first one was initially interpreted in the context of populations evolving respectively under pure genetic drift from a common ancestral population while the second one relies on populations under migration-drift equilibrium. Robustness and power of the two resulting Bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations. An application to a cattle data set is also provided.Conclusions/SignificanceThe procedure described turns out to be much faster than former Bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

  12. Number of statistics, number of errors, number of large errors, and number...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marjan Bakker; Jelte M. Wicherts (2023). Number of statistics, number of errors, number of large errors, and number of gross errors for each journal separately for articles in which outliers were removed and for articles that did not report any removal of outliers. [Dataset]. http://doi.org/10.1371/journal.pone.0103360.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Marjan Bakker; Jelte M. Wicherts
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of statistics, number of errors, number of large errors, and number of gross errors for each journal separately for articles in which outliers were removed and for articles that did not report any removal of outliers.

  13. d

    Data from: Localizing FST outliers on a QTL map reveals evidence for large...

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Aug 17, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Via; Gina Conte; Casey Mason-Foley; Kelly Mills (2012). Localizing FST outliers on a QTL map reveals evidence for large genomic regions of reduced gene exchange during speciation-with-gene-flow [Dataset]. http://doi.org/10.5061/dryad.9cf75
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 17, 2012
    Dataset provided by
    Dryad
    Authors
    Sara Via; Gina Conte; Casey Mason-Foley; Kelly Mills
    Time period covered
    Jul 18, 2012
    Area covered
    North America, New York
    Description

    Populations that maintain phenotypic divergence in sympatry typically show a mosaic pattern of genomic divergence, requiring a corresponding mosaic of genomic isolation (reduced gene flow). However, mechanisms that could produce the genomic isolation required for divergence-with-gene-flow have barely been explored, apart from the traditional localized effects of selection and reduced recombination near centromeres or inversions. By localizing FST outliers from a genome scan of wild pea aphid host races on a Quantitative Trait Locus (QTL) map of key traits, we test the hypothesis that between-population recombination and gene exchange are reduced over large ‘divergence hitchhiking’ (DH) regions. As expected under divergence hitchhiking, our map confirms that QTL and divergent markers cluster together in multiple large genomic regions. Under divergence hitchhiking, the nonoutlier markers within these regions should show signs of reduced gene exchange relative to nonoutlier markers in geno...

  14. Data from: Expected total thyroxine (TT4) concentrations and outlier values...

    • zenodo.org
    • datadryad.org
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maya Lottati; David Bruyette; David Aucoin; Maya Lottati; David Bruyette; David Aucoin (2022). Data from: Expected total thyroxine (TT4) concentrations and outlier values in 531,765 cats in the United States (2014-2015) [Dataset]. http://doi.org/10.5061/dryad.m6f721d
    Explore at:
    Dataset updated
    May 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maya Lottati; David Bruyette; David Aucoin; Maya Lottati; David Bruyette; David Aucoin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Background: Levels exceeding the standard reference interval (RI) for total thyroxine (TT4) concentrations are diagnostic for hyperthyroidism, however some hyperthyroid cats have TT4 values within the RI. Determining outlier TT4 concentrations should aid practitioners in identification of hyperthyroidism. The objective of this study was to determine the expected distribution of TT4 concentration using a large population of cats (531,765) of unknown health status to identify unexpected TT4 concentrations (outlier), and determine whether this concentration changes with age. Methodology/Principle Findings: This study is a population-based, retrospective study evaluating an electronic database of laboratory results to identify unique TT4 measurement between January 2014 and July 2015. An expected distribution of TT4 concentrations was determined using a large population of cats (531,765) of unknown health status, and this in turn was used to identify unexpected TT4 concentrations (outlier) and determine whether this concentration changes with age. All cats between the age of 1 and 9 years (n=141,294) had the same expected distribution of TT4 concentration (0.5-3.5ug/dL), and cats with a TT4 value >3.5ug/dL were determined to be unexpected outliers. There was a steep and progressive rise in both the total number and percentage of statistical outliers in the feline population as a function of age. The greatest acceleration in the percentage of outliers occurred between the age of 7 and 14 years, which was up to 4.6 times the rate seen between the age of 3 and 7 years. Conclusions: TT4 concentrations >3.5ug/dL represent outliers from the expected distribution of TT4 concentration. Furthermore, age has a strong influence on the proportion of cats. These findings suggest that patients with TT4 concentrations >3.5ug/dL should be more closely evaluated for hyperthyroidism, particularly between the ages of 7 and 14 years. This finding may aid clinicians in earlier identification of hyperthyroidism in at-risk patients.

  15. f

    Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture...

    • tandf.figshare.com
    zip
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yijia Zhou; Kyle A. Gallivan; Adrian Barbu (2024). Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers [Dataset]. http://doi.org/10.6084/m9.figshare.27226247.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Yijia Zhou; Kyle A. Gallivan; Adrian Barbu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This article introduces a provably robust clustering algorithm based on loss minimization that performs well on Gaussian mixture models with outliers. It provides theoretical guarantees that the algorithm obtains high accuracy with high probability under certain assumptions. Moreover, it can also be used as an initialization strategy for k-means clustering. Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm when clustering a large number of clusters, and a k-means algorithm initialized by the algorithm outperforms many of the classic clustering methods in both speed and accuracy, while scaling well to large datasets such as ImageNet. Supplementary materials for this article are available online.

  16. Comparison of the power and robustness of models 2 and 3 on simulated data...

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Gautier; Toby Dylan Hocking; Jean-Louis Foulley (2023). Comparison of the power and robustness of models 2 and 3 on simulated data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0011913.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mathieu Gautier; Toby Dylan Hocking; Jean-Louis Foulley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of the power and robustness of models 2 and 3 on simulated data sets.

  17. Parameters of the distribution of the PPP-values obtained with model 1 and...

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Gautier; Toby Dylan Hocking; Jean-Louis Foulley (2023). Parameters of the distribution of the PPP-values obtained with model 1 and model 2 under the null hypothesis. [Dataset]. http://doi.org/10.1371/journal.pone.0011913.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mathieu Gautier; Toby Dylan Hocking; Jean-Louis Foulley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Parameters of the distribution of the PPP-values obtained with model 1 and model 2 under the null hypothesis.

  18. f

    Data from: Robust Bayesian Modeling of Counts with Zero Inflation and...

    • tandf.figshare.com
    bin
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasuyuki Hamura; Kaoru Irie; Shonosuke Sugasawa (2025). Robust Bayesian Modeling of Counts with Zero Inflation and Outliers: Theoretical Robustness and Efficient Computation [Dataset]. http://doi.org/10.6084/m9.figshare.28131658.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 3, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Yasuyuki Hamura; Kaoru Irie; Shonosuke Sugasawa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Count data with zero inflation and large outliers are ubiquitous in many scientific applications. However, posterior analysis under a standard statistical model, such as Poisson or negative binomial distribution, is sensitive to such contamination. This study introduces a novel framework for Bayesian modeling of counts that is robust to both zero inflation and large outliers. In doing so, we introduce rescaled beta distribution and adopt it to absorb undesirable effects from zero and outlying counts. The proposed approach has two appealing features: the efficiency of the posterior computation via a custom Gibbs sampling algorithm and a theoretically guaranteed posterior robustness, where extreme outliers are automatically removed from the posterior distribution. We demonstrate the usefulness of the proposed method by applying it to trend filtering and spatial modeling using predictive Gaussian processes.

  19. Goodness-of-fit filtering in classical metric multidimensional scaling with...

    • tandf.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Graffelman (2023). Goodness-of-fit filtering in classical metric multidimensional scaling with large datasets [Dataset]. http://doi.org/10.6084/m9.figshare.11389830.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Jan Graffelman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasizing the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.

  20. Case number of outliers by district for 5 STD from the median.

    • plos.figshare.com
    xls
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lynn Muhimbuura Atuyambe; Justine N. Bukenya; Samuel Etajak; Jesca Nsungwa-Sabiiti; Richard Mugahi; Paul Mbaka; Onikepe Owolabi; Sharon Kim-Gibbons; Kristy Friesen; Arthur Bagonza (2025). Case number of outliers by district for 5 STD from the median. [Dataset]. http://doi.org/10.1371/journal.pone.0329842.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lynn Muhimbuura Atuyambe; Justine N. Bukenya; Samuel Etajak; Jesca Nsungwa-Sabiiti; Richard Mugahi; Paul Mbaka; Onikepe Owolabi; Sharon Kim-Gibbons; Kristy Friesen; Arthur Bagonza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Case number of outliers by district for 5 STD from the median.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Organization logoOrganization logo

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
application/gzipAvailable download formats
Dataset updated
May 17, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Giovanni Stilo; Bardh Prenkaj
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

Search
Clear search
Close search
Google apps
Main menu