17 datasets found
  1. d

    Integrated Building Health Management

    • catalog.data.gov
    • data.nasa.gov
    Updated Dec 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). Integrated Building Health Management [Dataset]. https://catalog.data.gov/dataset/integrated-building-health-management
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.

  2. o

    Controlled Anomalies Time Series (CATS) Dataset

    • explore.openaire.eu
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646896
    Explore at:
    Dataset updated
    Feb 16, 2023
    Authors
    Patrick Fleith
    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies. The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]: Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including: 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment. 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna. 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc. 5 million timestamps. Sensors readings are at 1Hz sampling frequency. 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour. 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection). 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments. Different types of anomalies to understand what anomaly types can be detected by different approaches. Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data. Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies. Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation. Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise. No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline. [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602” About Solenix Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  3. ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • zenodo.org
    • elki-project.github.io
    • +1more
    application/gzip
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2022
    Description

    These data sets were originally created for the following publications:

    M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
    Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
    In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

    H.-P. Kriegel, E. Schubert, A. Zimek
    Evaluation of Multiple Clustering Solutions
    In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

    The outlier data set versions were introduced in:

    E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
    On Evaluation of Outlier Rankings and Outlier Scores
    In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

    They are derived from the original image data available at https://aloi.science.uva.nl/

    The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

    Additional information is available at: https://elki-project.github.io/datasets/multi_view

    The following views are currently available:

    Feature typeDescriptionFiles
    Object numberSparse 1000 dimensional vectors that give the true object assignmentobjs.arff.gz
    RGB color histogramsStandard RGB color histograms (uniform binning)aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
    HSV color histogramsStandard HSV/HSB color histograms in various binningsaloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
    Color similiarityAverage similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
    Haralick featuresFirst 13 Haralick features (radius 1 pixel)aloi-haralick-1.csv.gz
    Front to backVectors representing front face vs. back faces of individual objectsfront.arff.gz
    Basic lightVectors indicating basic light situationslight.arff.gz
    Manual annotationsManually annotated object groups of semantically related objects such as cupsmanual1.arff.gz

    Outlier Detection Versions

    Additionally, we generated a number of subsets for outlier detection:

    Feature typeDescriptionFiles
    RGB HistogramsDownsampled to 100000 objects (553 outliers)aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
    Downsampled to 75000 objects (717 outliers)aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
    Downsampled to 50000 objects (1508 outliers)aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
  4. Integrated Building Health Management - Dataset - NASA Open Data Portal

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Integrated Building Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/integrated-building-health-management
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.

  5. d

    Synthetic temporal dataset for temporal trend analysis and retrieval

    • search-dev-2.test.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Ao; Kara Schatz; Rada Chirkova (2024). Synthetic temporal dataset for temporal trend analysis and retrieval [Dataset]. http://doi.org/10.5061/dryad.q573n5trf
    Explore at:
    Dataset updated
    May 10, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Jing Ao; Kara Schatz; Rada Chirkova
    Time period covered
    May 7, 2024
    Description

    This repository contains a synthetic, temporal data set that was generated by the authors by sampling values from the Gaussian distribution. The dataset contains eight nontemporal dimensions, a temporal dimension, and a numerical measure attribute. The data set was generated according to the scheme and procedure detailed in this source paper: Kaufmann, M., Fischer, P.M., May, N., Tonder, A., Kossmann, D. (2014). TPC-BiH: A Benchmark for Bitemporal Databases. In: Performance Characterization and Benchmarking. TPCTC 2013. Lecture Notes in Computer Science, vol 8391. Springer, Cham. The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be o..., , , # Synthetic temporal dataset for temporal trend analysis and retrieval

    https://doi.org/10.5061/dryad.q573n5trf

    The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be of interest to understand which nontemporal dimensions are associated with the temporal trends of interest. To this end, the data set can be used for analyzing and locating temporal trends in the data cube induced by the data set, e.g., retrieving outlier temporal trends using an outlier detector.Â

    We generated the synthetic temporal data set [1], which contains up to 8 nontemporal dimensions, one temporal dimension, and a nume...

  6. n

    TreeShrink: fast and accurate detection of outlier long branches in...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siavash Mirarab; Uyen Mai (2023). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees [Dataset]. http://doi.org/10.6076/D1HC71
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 26, 2023
    Dataset provided by
    University of California, San Diego
    Authors
    Siavash Mirarab; Uyen Mai
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink” problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset. Methods All the raw data are obtained from other publications as shown below. We further analyzed the data and provide the results of the analyses here. The methods used to analyze the data are described in the paper.

    Dataset

    Species

    Genes

    Download

    Plants

    104

    852

    DOI 10.1186/2047-217X-3-17

    Mammals

    37

    424

    DOI 10.13012/C5BG2KWG

    Insects

    144

    1478

    http://esayyari.github.io/InsectsData

    Cannon

    78

    213

    DOI 10.5061/dryad.493b7

    Rouse

    26

    393

    DOI 10.5061/dryad.79dq1

    Frogs

    164

    95

    DOI 10.5061/dryad.12546.2

  7. D

    Supporting data for "A Standard Operating Procedure for Outlier Removal in...

    • dataverse.no
    • search.dataone.org
    Updated May 31, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Einar Holsbø; Einar Holsbø (2017). Supporting data for "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets" [Dataset]. http://doi.org/10.18710/FGVLKS
    Explore at:
    tsv(309098854), txt(3680), tsv(43988), tsv(633), tsv(8212), tsv(271314861), application/x-rlang-transport(269), tsv(6583989), type/x-r-syntax(3194), tsv(198012971), tsv(40), application/x-rlang-transport(955932860)Available download formats
    Dataset updated
    May 31, 2017
    Dataset provided by
    DataverseNO
    Authors
    Einar Holsbø; Einar Holsbø
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

  8. A

    ‘Young People Survey’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 27, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘Young People Survey’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-young-people-survey-40db/latest
    Explore at:
    Dataset updated
    Aug 27, 2016
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Young People Survey’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/miroslavsabo/young-people-survey on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Introduction

    In 2013, students of the Statistics class at "https://fses.uniba.sk/en/">FSEV UK were asked to invite their friends to participate in this survey.

    • The data file (responses.csv) consists of 1010 rows and 150 columns (139 integer and 11 categorical).
    • For convenience, the original variable names were shortened in the data file. See the columns.csv file if you want to match the data with the original names.
    • The data contain missing values.
    • The survey was presented to participants in both electronic and written form.
    • The original questionnaire was in Slovak language and was later translated into English.
    • All participants were of Slovakian nationality, aged between 15-30.

    The variables can be split into the following groups:

    • Music preferences (19 items)
    • Movie preferences (12 items)
    • Hobbies & interests (32 items)
    • Phobias (10 items)
    • Health habits (3 items)
    • Personality traits, views on life, & opinions (57 items)
    • Spending habits (7 items)
    • Demographics (10 items)

    Research questions

    Many different techniques can be used to answer many questions, e.g.

    • Clustering: Given the music preferences, do people make up any clusters of similar behavior?
    • Hypothesis testing: Do women fear certain phenomena significantly more than men? Do the left handed people have different interests than right handed?
    • Predictive modeling: Can we predict spending habits of a person from his/her interests and movie or music preferences?
    • Dimension reduction: Can we describe a large number of human interests by a smaller number of latent concepts?
    • Correlation analysis: Are there any connections between music and movie preferences?
    • Visualization: How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?
    • (Multivariate) Outlier detection: Small number of participants often cheats and randomly answers the questions. Can you identify them? Hint: [Local outlier factor][1] may help.
    • Missing values analysis: Are there any patterns in missing responses? What is the optimal way of imputing the values in surveys?
    • Recommendations: If some of user's interests are known, can we predict the other? Or, if we know what a person listen, can we predict which kind of movies he/she might like?

    Past research

    • (in slovak) Sleziak, P. - Sabo, M.: Gender differences in the prevalence of specific phobias. Forum Statisticum Slovacum. 2014, Vol. 10, No. 6. [Differences (gender + whether people lived in village/town) in the prevalence of phobias.]

    • Sabo, Miroslav. Multivariate Statistical Methods with Applications. Diss. Slovak University of Technology in Bratislava, 2014. [Clustering of variables (music preferences, movie preferences, phobias) + Clustering of people w.r.t. their interests.]

    Questionnaire

    MUSIC PREFERENCES

    1. I enjoy listening to music.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    2. I prefer.: Slow paced music 1-2-3-4-5 Fast paced music (integer)
    3. Dance, Disco, Funk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    4. Folk music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    5. Country: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    6. Classical: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    7. Musicals: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    8. Pop: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    9. Rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    10. Metal, Hard rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    11. Punk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    12. Hip hop, Rap: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    13. Reggae, Ska: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    14. Swing, Jazz: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    15. Rock n Roll: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    16. Alternative music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    17. Latin: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    18. Techno, Trance: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    19. Opera: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

    MOVIE PREFERENCES

    1. I really enjoy watching movies.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    2. Horror movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    3. Thriller movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    4. Comedies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    5. Romantic movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    6. Sci-fi movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    7. War movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    8. Tales: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    9. Cartoons: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    10. Documentaries: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    11. Western movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    12. Action movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

    HOBBIES & INTERESTS

    1. History: Not interested 1-2-3-4-5 Very interested (integer)
    2. Psychology: Not interested 1-2-3-4-5 Very interested (integer)
    3. Politics: Not interested 1-2-3-4-5 Very interested (integer)
    4. Mathematics: Not interested 1-2-3-4-5 Very interested (integer)
    5. Physics: Not interested 1-2-3-4-5 Very interested (integer)
    6. Internet: Not interested 1-2-3-4-5 Very interested (integer)
    7. PC Software, Hardware: Not interested 1-2-3-4-5 Very interested (integer)
    8. Economy, Management: Not interested 1-2-3-4-5 Very interested (integer)
    9. Biology: Not interested 1-2-3-4-5 Very interested (integer)
    10. Chemistry: Not interested 1-2-3-4-5 Very interested (integer)
    11. Poetry reading: Not interested 1-2-3-4-5 Very interested (integer)
    12. Geography: Not interested 1-2-3-4-5 Very interested (integer)
    13. Foreign languages: Not interested 1-2-3-4-5 Very interested (integer)
    14. Medicine: Not interested 1-2-3-4-5 Very interested (integer)
    15. Law: Not interested 1-2-3-4-5 Very interested (integer)
    16. Cars: Not interested 1-2-3-4-5 Very interested (integer)
    17. Art: Not interested 1-2-3-4-5 Very interested (integer)
    18. Religion: Not interested 1-2-3-4-5 Very interested (integer)
    19. Outdoor activities: Not interested 1-2-3-4-5 Very interested (integer)
    20. Dancing: Not interested 1-2-3-4-5 Very interested (integer)
    21. Playing musical instruments: Not interested 1-2-3-4-5 Very interested (integer)
    22. Poetry writing: Not interested 1-2-3-4-5 Very interested (integer)
    23. Sport and leisure activities: Not interested 1-2-3-4-5 Very interested (integer)
    24. Sport at competitive level: Not interested 1-2-3-4-5 Very interested (integer)
    25. Gardening: Not interested 1-2-3-4-5 Very interested (integer)
    26. Celebrity lifestyle: Not interested 1-2-3-4-5 Very interested (integer)
    27. Shopping: Not interested 1-2-3-4-5 Very interested (integer)
    28. Science and technology: Not interested 1-2-3-4-5 Very interested (integer)
    29. Theatre: Not interested 1-2-3-4-5 Very interested (integer)
    30. Socializing: Not interested 1-2-3-4-5 Very interested (integer)
    31. Adrenaline sports: Not interested 1-2-3-4-5 Very interested (integer)
    32. Pets: Not interested 1-2-3-4-5 Very interested (integer)

    PHOBIAS

    1. Flying: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    2. Thunder, lightning: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    3. Darkness: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    4. Heights: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    5. Spiders: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    6. Snakes: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    7. Rats, mice: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    8. Ageing: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    9. Dangerous dogs: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
    10. Public speaking: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

    HEALTH HABITS

    1. Smoking habits: Never smoked - Tried smoking - Former smoker - Current smoker (categorical)
    2. Drinking: Never - Social drinker - Drink a lot (categorical)
    3. I live a very healthy lifestyle.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

    PERSONALITY TRAITS, VIEWS ON LIFE & OPINIONS

    1. I take notice of what goes on around me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    2. I try to do tasks as soon as possible and not leave them until last minute.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    3. I always make a list so I don't forget anything.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    4. I often study or work even in my spare time.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    5. I look at things from all different angles before I go ahead.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    6. I believe that bad people will suffer one day and good people will be rewarded.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    7. I am reliable at work and always complete all tasks given to me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    8. I always keep my promises.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    9. **I can fall for someone very quickly and then
  9. A

    DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES

    • data.amerigeoss.org
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +2more
    pdf
    Updated Jul 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2018). DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES [Dataset]. https://data.amerigeoss.org/nl/dataset/df6438ff-cfea-4de8-84ff-c915974a8dfd
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 19, 2018
    Dataset provided by
    United States
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES

    KANISHKA BHADURI*, KAMALIKA DAS**, AND PETR VOTAVA***

    Abstract. There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets ate physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.

  10. Linear Performance Pricing (LPP) Pricing Dataset

    • kaggle.com
    Updated May 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahriar Kabir (2025). Linear Performance Pricing (LPP) Pricing Dataset [Dataset]. https://www.kaggle.com/datasets/shahriarkabir/linear-performance-pricing-lpp-pricing-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shahriar Kabir
    Description

    Linear Performance Pricing (LPP) Pipe Pricing Dataset

    This synthetic dataset simulates supplier quotes for iron pipes of varying lengths. It is designed to demonstrate Linear Performance Pricing (LPP), a procurement analytics technique used to identify cost-saving opportunities by correlating product prices with performance parameters (e.g., pipe length). The dataset includes: - Real-world variations: Noise, outliers, and multiple suppliers. - Target prices: Calculated using market trends (target_price_market) and best-practice benchmarks (target_price_benchmark).

    Inspired by the example from "Data-Driven Spend Management" (Chapter 3).

    Suggested Analysis Tasks

    1. Regression Analysis: Replicate the market line (P^M = 1.465 + 1.076L) using linear regression.
    2. Outlier Detection: Identify overpriced quotes using Z-scores or IQR.
    3. Savings Calculation: Compute total savings if prices are negotiated down to the benchmark.
    4. Supplier Comparison: Analyze pricing strategies of S1 vs S2 vs S3 vs S4.
    5. Visualization: Plot price vs. length with market/benchmark lines.

    Support This Dataset 🚀

    Help this dataset reach more learners and practitioners in procurement analytics! If you find this dataset useful, consider:

    **Upvoting** this dataset on Kaggle – it boosts visibility and helps others discover it.

    Your support keeps datasets like this free and open-source for the community! 🌟

  11. d

    Data from: NPP Multi-Biome: Global Primary Production Data Initiative...

    • datasets.ai
    • data.nasa.gov
    • +5more
    21, 33, 34
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2024). NPP Multi-Biome: Global Primary Production Data Initiative Products, R2 [Dataset]. https://datasets.ai/datasets/npp-multi-biome-global-primary-production-data-initiative-products-r2-ddf60
    Explore at:
    21, 33, 34Available download formats
    Dataset updated
    Sep 9, 2024
    Dataset authored and provided by
    National Aeronautics and Space Administration
    Description

    Net primary productivity (NPP) estimates were compiled by the Global Primary Production Data Initiative (GPPDI). The database covers 2,523 individual sites and 5,164 half-degree grid cells and underwent extensive review under the Ecosystem Model-Data Intercomparison (EMDI) process. The GPPDI database includes NPP measurements that were collected over a long time period by many investigators using a variety of methods. The measurements are categorized as either Class A, from intensively studied sites; Class B, from extensive sites; or reported as Class C, 0.5 latitude-longitude grid cells. The data set contains six comma-separated files (.csv format). There are two files for each class. One file for each class contains site locations, elevation, NPP estimates, climate data, biome and dominant species information, and references. The other file for each class contains model validation outlier flags derived from site-specific reviews. This document and a companion file (Olson et al., 2001) describe the compilation of NPP estimates under the GPPDI. The results of the EMDI review and outlier analysis produced a refined set of NPP estimates and model driver data (the EMDI database; Olson et al., 2001; 2013). Another ORNL DAAC data set (Zheng et al., 2013) contributed to the compilation of GPPDI. Revision Notes: This data set has been revised to correct previously reported ANPP, BNPP, and TNPP estimates for three OTTER Transect sites, USA, in the Class A NPP data file and BNPP, and TNPP estimates for Vindhyan, India, in the Class B NPP data file. Please see the Data Set Revisions section of this document for detailed information.

  12. a

    Hypertension (in persons of all ages): England

    • hub.arcgis.com
    • data.catchmentbasedapproach.org
    Updated Apr 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Hypertension (in persons of all ages): England [Dataset]. https://hub.arcgis.com/maps/theriverstrust::hypertension-in-persons-of-all-ages-england
    Explore at:
    Dataset updated
    Apr 7, 2021
    Dataset authored and provided by
    The Rivers Trust
    Description

    SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of hypertension (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to hypertension (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with hypertension was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with hypertension was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with hypertension , within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have hypertension B) the NUMBER of people within that MSOA who are estimated to have hypertension An average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have hypertension , compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from hypertension, and where those people make up a large percentage of the population, indicating there is a real issue with hypertension within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of hypertension, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of hypertension .TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  13. n

    Data from: Subtle limits to connectivity revealed by outlier loci within two...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrien Tran Lu Y; Stephanie Ruault; Claire Daguin-Thiébaut; Jade Castel; Nicolas Bierne; Thomas Broquet; Patrick Wincker; Aude Perdereau; Sophie Arnaud-Haond; Pierre-Alexandre Gagnaire; Didier Jollivet; Stephane Hourdez; François Bonhomme (2022). Subtle limits to connectivity revealed by outlier loci within two divergent metapopulations of the deep-sea hydrothermal gastropod Ifremeria nautilei [Dataset]. http://doi.org/10.5061/dryad.ffbg79cwq
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 28, 2022
    Dataset provided by
    Institute of Evolutionary Science of Montpellier
    Ifremer
    Sorbonne Université
    Genoscope
    Authors
    Adrien Tran Lu Y; Stephanie Ruault; Claire Daguin-Thiébaut; Jade Castel; Nicolas Bierne; Thomas Broquet; Patrick Wincker; Aude Perdereau; Sophie Arnaud-Haond; Pierre-Alexandre Gagnaire; Didier Jollivet; Stephane Hourdez; François Bonhomme
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Hydrothermal vents form archipelagos of ephemeral deep-sea habitats that raise interesting questions about the evolution and dynamics of the associated endemic fauna, constantly subject to extinction-recolonization processes. These metal-rich environments are coveted for the mineral resources they harbor, thus raising recent conservation concerns. The evolutionary fate and demographic resilience of hydrothermal species strongly depend on the degree of connectivity among and within their fragmented metapopulations. In the deep sea, however, assessing connectivity is difficult and usually requires indirect genetic approaches. Improved detection of fine-scale genetic connectivity is now possible based on genome-wide screening for genetic differentiation. Here, we explored population connectivity in the hydrothermal vent snail Ifremeria nautilei across its species range encompassing five distinct back-arc basins in the Southwest Pacific. The global analysis, based on 10 570 single nucleotide polymorphism (SNP) markers derived from double digest restriction-site associated DNA sequencing (ddRAD-seq), depicted two semi-isolated and homogeneous genetic clusters. Demo-genetic modeling suggests that these two groups began to diverge about 70 000 generations ago, but continue to exhibit weak and slightly asymmetrical gene flow. Furthermore, a careful analysis of outlier loci showed subtle limitations to connectivity between neighboring basins within both groups. This finding indicates that migration is not strong enough to totally counterbalance drift or local selection, hence questioning the potential for demographic resilience at this latter geographical scale. These results illustrate the potential of large genomic datasets to understand fine-scale connectivity patterns in hydrothermal vents and the deep sea. Methods VCF datasets were generated “de novo” with Stacks V.2.52 from reads produce by the protocols used and provided in the manuscript.Sample associated metadata were collected during field sampling.

  14. f

    Cut-off points for NW under different concentration parameters and sample...

    • plos.figshare.com
    xls
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sümeyra Sert; Filiz Kardiyen (2023). Cut-off points for NW under different concentration parameters and sample sizes, q = 0.95. [Dataset]. http://doi.org/10.1371/journal.pone.0286448.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 12, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Sümeyra Sert; Filiz Kardiyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cut-off points for NW under different concentration parameters and sample sizes, q = 0.95.

  15. Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

    • zenodo.org
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
    Explore at:
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    o; o
    Time period covered
    May 4, 2025
    Description

    Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

    This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

    • ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

    • ESM_2.py – Python script to calculate Z-scores from raw financial ratios

    • ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

    • ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

    • ESM_5.xlsx – Mahalanobis distance values for each firm

    • ESM_6.py – Python script to compute Mahalanobis distances

    • ESM_7.py – Python script to visualize Mahalanobis distances

    • ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

    • ESM_9.py – Python script to compute mean Z-scores

    • ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

    • ESM_11.py – Python script to re-standardize mean Z-scores

    • ESM_12.py – Python script to generate the hierarchical clustering dendrogram

    All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).

  16. COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 2 consisted of the following sections

    Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

  17. d

    NZ Height Conversion Index - Dataset - data.govt.nz - discover and use data

    • catalogue.data.govt.nz
    Updated Sep 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). NZ Height Conversion Index - Dataset - data.govt.nz - discover and use data [Dataset]. https://catalogue.data.govt.nz/dataset/nz-height-conversion-index1
    Explore at:
    Dataset updated
    Sep 30, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New Zealand
    Description

    This index enables users to identify the extent of the relationship grids provided on LDS, which are used to convert heights provided in terms of one of 13 historic local vertical datums to NZVD2016. The polygons comprising the index show the extent of the conversion grids. Users can view the following polygon attributes: Shape_VDR: Vertical Datum Relationship grid area LVD: Local Vertical Datum Control: Number of control marks used to compute the relationship grid Mean: Mean vertical datum relationship value at control points Std: Standard deviation of vertical datum relationship value at control points Min: Minimum vertical datum relationship value at control points Max: Maximum vertical datum relationship value at control points Range: Range of vertical datum relationship value at control points Ref: Reference control mark for the local vertical datum Ref_value: Vertical datum relationship value at the reference mark Grid: Formal grid id Users should note that the values represented in this dataset have been calculated with the outliers excluded. These same outliers were excluded during the computation of the relationship grids, but were included when calculating the 95% confidence intervals More information on converting heights between vertical datums can be found on the LINZ website.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2023). Integrated Building Health Management [Dataset]. https://catalog.data.gov/dataset/integrated-building-health-management

Integrated Building Health Management

Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description

Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.

Search
Clear search
Close search
Google apps
Main menu