84 datasets found
  1. MNIST dataset for Outliers Detection - [ MNIST4OD ]

    • figshare.com
    application/gzip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Giovanni Stilo; Bardh Prenkaj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

  2. a

    Find Outliers GRM

    • hub.arcgis.com
    Updated Aug 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tippecanoe County Assessor Hub Community (2020). Find Outliers GRM [Dataset]. https://hub.arcgis.com/maps/tippecanoehub::find-outliers-grm
    Explore at:
    Dataset updated
    Aug 8, 2020
    Dataset authored and provided by
    Tippecanoe County Assessor Hub Community
    Area covered
    Description

    The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 721 valid input features.GRM Properties:Min0.0000Max157.0200Mean9.1692Std. Dev.8.4220There were 4 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 1894.5039 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 248 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 30 statistically significant high outlier features.There are 7 statistically significant low outlier features.There are 202 features part of statistically significant low clusters.There are 9 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high GRM values.Light Blue output features are part of a cluster of low GRM values.Red output features represent high outliers within a cluster of low GRM values.Blue output features represent low outliers within a cluster of high GRM values.

  3. d

    Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  4. Data from: Valid Inference Corrected for Outlier Removal

    • tandf.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Shuxiao Chen; Jacob Bien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.

  5. Outlier classification using autoencoders: application for fluctuation...

    • osti.gov
    • dataverse.harvard.edu
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
    Explore at:
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
    Description

    Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.

  6. Cost of living(Treat Outliers)

    • kaggle.com
    zip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bharat Gokhale (2023). Cost of living(Treat Outliers) [Dataset]. https://www.kaggle.com/datasets/bharatgokhale/cost-of-livingtreat-outliers
    Explore at:
    zip(14244 bytes)Available download formats
    Dataset updated
    Jun 7, 2023
    Authors
    Bharat Gokhale
    Description

    Dataset

    This dataset was created by Bharat Gokhale

    Contents

  7. Modified ZScore to detect outliers - Practice

    • kaggle.com
    zip
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panagiotis Prassas (2023). Modified ZScore to detect outliers - Practice [Dataset]. https://www.kaggle.com/datasets/panagiotisprassas/modified-zscore-to-detect-outliers-practice
    Explore at:
    zip(17836 bytes)Available download formats
    Dataset updated
    Nov 12, 2023
    Authors
    Panagiotis Prassas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Panagiotis Prassas

    Released under Apache 2.0

    Contents

  8. f

    Data from: Methodology to filter out outliers in high spatial density data...

    • scielo.figshare.com
    jpeg
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELO journals
    Authors
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

  9. a

    Find Outliers Percent of households with income below the Federal Poverty...

    • uscssi.hub.arcgis.com
    Updated Dec 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spatial Sciences Institute (2021). Find Outliers Percent of households with income below the Federal Poverty Level [Dataset]. https://uscssi.hub.arcgis.com/maps/USCSSI::find-outliers-percent-of-households-with-income-below-the-federal-poverty-level
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset authored and provided by
    Spatial Sciences Institute
    Area covered
    Description

    The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 1684 valid input features.POVERTY Properties:Min0.0000Max91.8000Mean18.9902Std. Dev.12.7152There were 22 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band was based on the average distance to 30 nearest neighbors: 3709.0000 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 1155 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 68 statistically significant high outlier features.There are 84 statistically significant low outlier features.There are 557 features part of statistically significant low clusters.There are 446 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high POVERTY values.Light Blue output features are part of a cluster of low POVERTY values.Red output features represent high outliers within a cluster of low POVERTY values.Blue output features represent low outliers within a cluster of high POVERTY values.

  10. d

    Data from: Mining Distance-Based Outliers in Near Linear Time

    • catalog.data.gov
    • datasets.ai
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

  11. Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open...

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Mining Distance-Based Outliers in Near Linear Time - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/mining-distance-based-outliers-in-near-linear-time
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

  12. Addressing COVID-19 Outliers in BVARs with Stochastic Volatility

    • clevelandfed.org
    Updated Sep 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Reserve Bank of Cleveland (2021). Addressing COVID-19 Outliers in BVARs with Stochastic Volatility [Dataset]. https://www.clevelandfed.org/publications/working-paper/2021/wp-2102r-covid19-outliers-in-bvars-with-stochastic-volatility
    Explore at:
    Dataset updated
    Sep 8, 2021
    Dataset authored and provided by
    Federal Reserve Bank of Clevelandhttps://www.clevelandfed.org/
    Description

    The COVID-19 pandemic has led to enormous movements in economic data that strongly affect parameters and forecasts obtained from standard VARs. One way to address these issues is to model extreme observations as random shifts in the stochastic volatility (SV) of VAR residuals. Specifically, we propose VAR models with outlier-augmented SV that combine transitory and persistent changes in volatility. The resulting density forecasts for the COVID-19 period are much less sensitive to outliers in the data than standard VARs. Evaluating forecast performance over the last few decades, we find that outlier-augmented SV schemes do at least as well as a conventional SV model. Predictive Bayes factors indicate that our outlier-augmented SV model provides the best data fit for the period since the pandemic’s outbreak, as well as for earlier subsamples of relatively high volatility. This version has been accepted for publication in The Review of Economics and Statistics .

  13. Product Cost Analysis for Out/inler Detection

    • kaggle.com
    zip
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Botir (2024). Product Cost Analysis for Out/inler Detection [Dataset]. https://www.kaggle.com/datasets/botir2/product-cost-analysis-for-outinler-detection
    Explore at:
    zip(712974 bytes)Available download formats
    Dataset updated
    May 10, 2024
    Authors
    Botir
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains product reports from different companies. We need to find a real solution to detect outliers and inliers in the data each company reports regarding their product costs. This will help in identifying any discrepancies in reported prices. We have to find an algorithm that can detect outlier and inlier datasets effectively.

    1 org_id: A numerical identifier for an organization. 2 year: The year when the data was recorded. 3 month: The month when the data was recorded. 4 product_code: A code that identifies a product. 5 sub_product_code: A sub-code that further identifies specifics of the product. 6 value: A numerical value associated with the product, which could represent quantities, monetary value, or another metric depending on the context.

  14. h

    cifar10-outlier

    • huggingface.co
    Updated Jul 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renumics (2023). cifar10-outlier [Dataset]. https://huggingface.co/datasets/renumics/cifar10-outlier
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2023
    Dataset authored and provided by
    Renumics
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for "cifar10-outlier"

    📚 This dataset is an enriched version of the CIFAR-10 Dataset. The workflow is described in the medium article: Changes of Embeddings during Fine-Tuning of Transformers.

      Explore the Dataset
    

    The open source data curation tool Renumics Spotlight allows you to explorer this dataset. You can find a Hugging Face Spaces running Spotlight with this dataset here:

    Full Version (High hardware requirement)… See the full description on the dataset page: https://huggingface.co/datasets/renumics/cifar10-outlier.

  15. h

    mnist-outlier

    • huggingface.co
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renumics (2023). mnist-outlier [Dataset]. https://huggingface.co/datasets/renumics/mnist-outlier
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2023
    Dataset authored and provided by
    Renumics
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "mnist-outlier"

    📚 This dataset is an enriched version of the MNIST Dataset. The workflow is described in the medium article: Changes of Embeddings during Fine-Tuning of Transformers.

      Explore the Dataset
    

    The open source data curation tool Renumics Spotlight allows you to explorer this dataset. You can find a Hugging Face Space running Spotlight with this dataset here: https://huggingface.co/spaces/renumics/mnist-outlier.

    Or you can explorer it locally:… See the full description on the dataset page: https://huggingface.co/datasets/renumics/mnist-outlier.

  16. DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu (2023). DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning tree generation with applications on medical data.pdf [Dataset]. http://doi.org/10.3389/fphys.2023.1233341.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.

  17. Get a Room: ML Hackathon (hackerearth)

    • kaggle.com
    Updated Sep 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaisingh Chauhan (2022). Get a Room: ML Hackathon (hackerearth) [Dataset]. https://www.kaggle.com/datasets/jaisinghchauhan/get-a-room-ml-hackathon-hackerearth
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jaisingh Chauhan
    Description

    ABOUT CHALLENGE

    Problem Statement

    Finding the correct property to live in is a crucial task while moving to a new city/location. An inappropriate property can make our life miserable. Can AI help us find better places?

    You have given a relevant dataset about various properties in the USA. Your task is to identify the habitability score of the property.

    Dataset description

    The dataset contains the following files:

    train.csv: 39496 x 15 test.csv: 10500 x 14 sample_submission.csv: 5 x 2 The columns provided in the dataset are as follows:

    ColumnDescription
    Property_IDRepresents a unique identification of a property
    Property_TypeRepresents the type of the property( Apartment, Bungalow, etc)
    Property_AreaRepresents the area of the property in square feets
    Number_of_WindowsRepresents the number of windows available in the property
    Number_of_DoorsRepresents the number of doors available in the property
    FurnishingRepresents the furnishing type ( Fully Furnished, Semi Furnished, or Unfurnished )
    Frequency_of_PowercutsRepresents the average number of power cuts per week
    Power_BackupRepresents the availability of power backup
    Water_SupplyRepresents the availability of water supply ( All time, Once in a day - Morning, Once in a day - Evening,
    and Once in two days)
    Traffic_Density_ScoreRepresents the density of traffic on a scale of 1 to 10
    Crime_RateRepresents the crime rate in the neighborhood ( Well below average, Slightly below average,Slightly
    above average, and Well above average )
    Dust_and_NoiseRepresents the quantity of dust and noise in the neighborhood ( High, Medium, Low )
    Air_Quality_IndexRepresents the Air Quality Index of the neighborhood
    Neighborhood_ReviewRepresents the average ratings given to the neighborhood by the people
    Habitability_scoreRepresents the habitability score of the property

    you to build a model that successfully predicts the habitability score of a property.

    Credits:

    HackerEarth (Challenge & Datasets) Link - https://www.hackerearth.com/challenges/competitive/get-a-room-ml-hackathon/ Image credits Image by Zahid Hasan from Pixabay

  18. Change of Primary Care Physician(PCP)

    • kaggle.com
    zip
    Updated Mar 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsha MS (2021). Change of Primary Care Physician(PCP) [Dataset]. https://www.kaggle.com/harshams07/change-of-primary-care-physician
    Explore at:
    zip(38782 bytes)Available download formats
    Dataset updated
    Mar 15, 2021
    Authors
    Harsha MS
    Description

    Context

    An insurance provider (US based) offers health insurance to customers. The provider assigns a PCP(primary care physician) to each customer. The PCP addresses most health concerns of the customers assigned to them. For various reasons, customers want change of PCP. It involves significant effort for the provider whenever the customer makes a change of PCP. You will find a subset of the insurance provider data along with PCP changes. The provider likes to understand why are members likely to leave the recommended provider. Further, they like to recommend a provider to them that they are less likely to leave.

    Content

    The dataset consists of following fields: Id: Column identification field OUTCOME: Member changed to his/her preferred primary care provider instead of auto assigned to. 0: Member keeps the auto assigned provider. 1: Member changed to this provider by calling customer service. Distance: Distance between member and provider in miles.
    Visit_count: Number of claims between member and provider.
    Claims_days_away: Days between member changed to / assigned to the provider and latest claim between member and provider.
    Tier: Provider Tier from service, value 1, 2, 3, 4. Tier 1 is highest benefit level and most cost-effective level.
    Fqhc: Value 0 or 1 (1 : Provider is a certified Federally Qualified Health Center)
    Pcp_lookback: Value 0 or 1.(1: the provider was the member primary care provider before )
    Family_Assignment: Value 0 or 1. (1: The provider is the pcp of the member in the same family)
    Kid: 0 or 1. (1: Member is a kid. (under 18 for state of New York))
    Is_Ped: Value 0 or 1 (1: Provider is a pediatrician)
    Same_gender: Value 0 or 1. (1: provider and member are the same gender)
    Same_language: Value 0 or 1(1: Provider and member speak the same language)
    Same_address: Value 0 or 1 (1: The re-assigned provider has the same address as the provider pre-assigned)

  19. d

    Data from: specleanr: An R package for automated flagging of environmental...

    • datadryad.org
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Basooma; Astrid Schmidt-Kloiber; Sami Domisch; Yusdiel Torres-Cambas; Marija Smederevac-Lalić; Vanessa Bremerich; Martin Tschikof; Paul Meulenbroek; Andrea Funk; Thomas Hein; Florian Borgwardt (2025). specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows [Dataset]. http://doi.org/10.5061/dryad.6m905qgd7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 4, 2025
    Dataset provided by
    Dryad
    Authors
    Anthony Basooma; Astrid Schmidt-Kloiber; Sami Domisch; Yusdiel Torres-Cambas; Marija Smederevac-Lalić; Vanessa Bremerich; Martin Tschikof; Paul Meulenbroek; Andrea Funk; Thomas Hein; Florian Borgwardt
    Time period covered
    Sep 24, 2025
    Description

    specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows

    Dataset DOI: 10.5061/dryad.6m905qgd7

    Description of the data and file structure

    1. The files include species occurrences from the Global Biodiversity Information Facility. Refer to the data links file to access the original data.
    2. Environmental data was retrieved from CHELSA and Hydrography90m. These files included B101 to 19 for CHELSA and cti, order*strahler, slopecurv*dw_cel, accumulation, spi, sti, and subcatchment from Hydrography90m. The data link file has the URL to connect to the original dataset.
    3. Model outputs were data outputs packaged after model implementation, including modeloutput and modeloutput2.
    4. The sdm function was implemented in the sdm_function file.
    5. sdmodeling file that processed all files.
    6. species prediction were archived in species model prediction output.

    Files and variables

    ...

  20. n

    Data from: Pacman profiling: a simple procedure to identify stratigraphic...

    • data-staging.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jul 8, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Lazarus; Manuel Weinkauf; Patrick Diver (2011). Pacman profiling: a simple procedure to identify stratigraphic outliers in high-density deep-sea microfossil data [Dataset]. http://doi.org/10.5061/dryad.2m7b0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2011
    Authors
    David Lazarus; Manuel Weinkauf; Patrick Diver
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Global, Marine
    Description

    The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compiling species occurrences by time and marking as outliers calibrated fractions of the youngest and oldest occurrence data for each species. A subset of biostratigraphic marker species whose ranges have been previously documented is used to calibrate the fraction of occurrences to mark as outliers. These outlier occurrences are compiled for samples, and profiles of outlier frequency are made from the sections used to compile the data; the profiles can then identify samples and sections with problematic data caused, for example, by taxonomic errors, incorrect age models, or reworking of sediment. These samples/sections can then be targeted for re-study.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Organization logoOrganization logo

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
application/gzipAvailable download formats
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Giovanni Stilo; Bardh Prenkaj
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

Search
Clear search
Close search
Google apps
Main menu