29 datasets found
  1. f

    Data from: A Diagnostic Procedure for Detecting Outliers in Linear...

    • tandf.figshare.com
    • figshare.com
    txt
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.

  2. f

    Data_Sheet_1_The hazards of dealing with response time outliers.pdf

    • frontiersin.figshare.com
    pdf
    Updated Aug 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan I. Vankov (2023). Data_Sheet_1_The hazards of dealing with response time outliers.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2023.1220281.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 24, 2023
    Dataset provided by
    Frontiers
    Authors
    Ivan I. Vankov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The presence of outliers in response times can affect statistical analyses and lead to incorrect interpretation of the outcome of a study. Therefore, it is a widely accepted practice to try to minimize the effect of outliers by preprocessing the raw data. There exist numerous methods for handling outliers and researchers are free to choose among them. In this article, we use computer simulations to show that serious problems arise from this flexibility. Choosing between alternative ways for handling outliers can result in the inflation of p-values and the distortion of confidence intervals and measures of effect size. Using Bayesian parameter estimation and probability distributions with heavier tails eliminates the need to deal with response times outliers, but at the expense of opening another source of flexibility.

  3. H

    Data from: Outlier classification using autoencoders: application for...

    • dataverse.harvard.edu
    • osti.gov
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kube, R.; Bianchi, F.M.; Brunner, D.; LaBombard, B. (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Kube, R.; Bianchi, F.M.; Brunner, D.; LaBombard, B.
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SKEHRJhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SKEHRJ

    Description

    Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.

  4. d

    Data from: Pacman profiling: a simple procedure to identify stratigraphic...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Lazarus; Manuel Weinkauf; Patrick Diver (2025). Pacman profiling: a simple procedure to identify stratigraphic outliers in high-density deep-sea microfossil data [Dataset]. http://doi.org/10.5061/dryad.2m7b0
    Explore at:
    Dataset updated
    Apr 2, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    David Lazarus; Manuel Weinkauf; Patrick Diver
    Time period covered
    Jan 1, 2011
    Description

    The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compil...

  5. COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 2 consisted of the following sections

    Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

  6. z

    Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    • explore.openaire.eu
    bin
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Solenix Engineering GmbH
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  7. S

    Euler number calculation with spots

    • scidb.cn
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Zhang (2025). Euler number calculation with spots [Dataset]. http://doi.org/10.57760/sciencedb.25091
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Yu Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Since the small spots in the slices were not completely removed, the calculation of the Euler number was incorrect. Therefore, taking Sr30 as an example, we provide the original liquid phase, the liquid phase after removing noise, and the three-phase data of the noise. After recalculating the Euler number, we confirmed that the calculation error was caused by the noise.The noise removal operation can be performed in ImageJ as follows:​​Process > Noise > Remove Outliers​​, with parameters set to ​​Radius=5​​ and ​​Threshold=0.50

  8. f

    Local redundancy (ri), standard deviation of the least-squares...

    • plos.figshare.com
    xls
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinicius Francisco Rofatto; Marcelo Tomio Matsuoka; Ivandro Klein; Maurício Roberto Veronez; Luiz Gonzaga da Silveira Junior (2023). Local redundancy (ri), standard deviation of the least-squares (LS)-estimated outlier and the maximum absolute correlation () for each scenario of hard constraint. [Dataset]. http://doi.org/10.1371/journal.pone.0238145.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Vinicius Francisco Rofatto; Marcelo Tomio Matsuoka; Ivandro Klein; Maurício Roberto Veronez; Luiz Gonzaga da Silveira Junior
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Local redundancy (ri), standard deviation of the least-squares (LS)-estimated outlier and the maximum absolute correlation () for each scenario of hard constraint.

  9. Vehicle insurance data

    • kaggle.com
    Updated Jun 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himanshu Bhatt (2020). Vehicle insurance data [Dataset]. https://www.kaggle.com/junglisher/vehicle-insurance-data/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 17, 2020
    Dataset provided by
    Kaggle
    Authors
    Himanshu Bhatt
    Description

    ##Vehicle-insurance

    Vehicle Insurance data: This dataset contains multiple features according to the customer’s vehicle and insurance type.

    OBJECTIVE: Business requirement is to increase the clv (customer lifetime value) that means clv is the target variable.

    Data Cleansing:

    This dataset is pretty clean already, a few outliers are there. Remove the outliers.

    Why remove Outliers? Outliers are unusual values in dataset, and they can distort statistical analyses and violate their assumptions.

    Feature selection:

    This step is required to remove unwanted features.

    VIF and Correlation Coefficient can be used to find important features.

    VIF: Variance Inflation Factor It is a measure of collinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variance of a single beta if it were fit alone.

    Correlation Coefficient: A positive Pearson coefficient mean that one variable's value increases with the others. And a negative Pearson coefficient means one variable decreases as other variable decreases. Correlations coefficients of -1 or +1 mean the relationship is exactly linear.

    Log transformation and Normalisation: Many ML algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.

    Applying different ML Algorithms to the dataset for predictions. Their accuracies are in notebook.

    Please see my work. And I am open to suggestion.

  10. m

    Dataset of "Consistency of pacing profile according to performance level in...

    • data.mendeley.com
    Updated Jun 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Ignacio Priego-Quesada (2022). Dataset of "Consistency of pacing profile according to performance level in three different editions of the Chicago, London, and Tokyo marathons" [Dataset]. http://doi.org/10.17632/xvfvk2zvhw.1
    Explore at:
    Dataset updated
    Jun 16, 2022
    Authors
    Jose Ignacio Priego-Quesada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Tokyo, London
    Description

    Dataset with the data of the manuscript "Consistency of pacing profile according to performance level in three different editions of the Chicago, London, and Tokyo marathons" published in Scientific Reports (DOI: 10.1038/s41598-022-14868-6). The dataset is after pre-processing data (removing outliers, calculate the variables of analysis, etc.).

  11. a

    Data from: Robust Global Translations with 1DSfM

    • academictorrents.com
    bittorrent
    Updated Jun 18, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle Wilson and Noah Snavely (2019). Robust Global Translations with 1DSfM [Dataset]. https://academictorrents.com/details/9fba8b6d6323a8eb66b0fee0886f134a16625eef
    Explore at:
    bittorrentAvailable download formats
    Dataset updated
    Jun 18, 2019
    Dataset authored and provided by
    Kyle Wilson and Noah Snavely
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    We present a simple, effective method for solving structure from motion problems by averaging epipolar geometries. Based on recent successes in solving for global camera rotations using averaging schemes, we focus on the problem of solving for 3D camera translations given a network of noisy pairwise camera translation directions (or 3D point observations). To do this well, we have two main insights. First, we propose a method for removing outliers from problem instances by solving simpler low-dimensional subproblems, which we refer to as 1DSfM problems. Second, we present a simple, principled averaging scheme. We demonstrate this new method in the wild on Internet photo collections.

  12. h

    Filtered-StarCoder-Dataset-Mini

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jugal Gajjar (2025). Filtered-StarCoder-Dataset-Mini [Dataset]. https://huggingface.co/datasets/jugalgajjar/Filtered-StarCoder-Dataset-Mini
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Jugal Gajjar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Filtered StarCoder Dataset Mini

      Dataset Description
    

    This dataset contains filtered and processed code samples from 10 popular programming languages: C, C++, C#, Go, Java, JavaScript, Python, Ruby, Scala, and TypeScript. The dataset was created by filtering source code based on quality metrics, removing outliers, and standardizing the format for machine learning and code analysis applications.

      Key Features
    

    Cleaned and Filtered Code: Samples have been processed… See the full description on the dataset page: https://huggingface.co/datasets/jugalgajjar/Filtered-StarCoder-Dataset-Mini.

  13. Chiller Energy Data

    • kaggle.com
    Updated Aug 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chiller_Energy (2021). Chiller Energy Data [Dataset]. https://www.kaggle.com/datasets/chillerenergy/chiller-energy-data/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Chiller_Energy
    Description

    Context

    Input features of the data set include Timestamp, Chilled Water Rate (L/sec), Cooling Water Temperature (C), Building Load (RT), Total Energy (kWh), Temperature (F), Dew Point (F), Humidity (%), Wind Speed (mph), Pressure (in), Hour of Day (h) and Day of Week. The training and validation data sets contain data related to a commercial building located in Singapore, from 18/08/2019 00:00 to 01/06/2020 13:00 which refined to 13,561 data samples after removing outliers and missing values.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  14. Machine learning pipeline to train toxicity prediction model of...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Ewald; Jan Ewald
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

    01_DATA # preprocessing and filtering of raw activity data from ChEMBL
    - Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
    - filt_stats.R # Filtering and preparation of raw data
    - Filtered # output data sets from filt_stats.R
    - toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

    02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
    - datastore # files with all compounds and their calculated molecular descriptors based on SMILES
    - scripts
    - calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
    - chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

    03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
    - datastore # output files with statistics calculated by make_Z.R
    - scripts
    -make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

    04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
    - datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
    - scripts
    -calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

    05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

    - datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
    - scripts
    - data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
    - Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
    - Rforest.R # based on analysis of Rforest_CV.R learning of final models

    rregrs_output
    # early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2

  15. i

    cross-phone calibration dataset

    • ieee-dataport.org
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheng Zeng (2024). cross-phone calibration dataset [Dataset]. https://ieee-dataport.org/documents/cross-phone-calibration-dataset
    Explore at:
    Dataset updated
    Jan 5, 2024
    Authors
    Sheng Zeng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    one for the raw data after removing the outliers and the other for the preprocessed feature dataset. See the Readme file in the folder for details.

  16. Data from: Weight, Temperature and Humidity Sensor Data of Honey Bee...

    • zenodo.org
    Updated Sep 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diren Senger; Diren Senger; Clemens Gruber; Thorsten Kluss; Carolin Johannsen; Clemens Gruber; Thorsten Kluss; Carolin Johannsen (2023). Weight, Temperature and Humidity Sensor Data of Honey Bee Colonies in Germany, 2019 - 2022 [Dataset]. http://doi.org/10.5281/zenodo.8389138
    Explore at:
    Dataset updated
    Sep 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diren Senger; Diren Senger; Clemens Gruber; Thorsten Kluss; Carolin Johannsen; Clemens Gruber; Thorsten Kluss; Carolin Johannsen
    Description

    This dataset accompanies our data paper. We will provide a link to the paper once accepted.

    bob_publication_data.zip
    Data in 1-minute, 1-hour and 1-day interval, processed and unprocessed version as described in data paper.

    bob_raw_data.zip
    Raw data with original measurement interval, mostly 5 or 10 seconds.

    bob_code_publication.zip
    Code we used to prepare the data. We anonymised sections with personal keys and passwords.

    We present sensor data from 78 honey bee colonies in Germany collected as part of a citizen science project. Each honey bee hive was equipped with five temperature sensors within the hive, one temperature sensor for outside measurements, a combined sensor for temperature, ambient air pressure and humidity, and a scale to measure the weight. During the data acquisition period, beekeepers used a web app to report their observations and beekeeping activities.
    We provide the raw data with a measurement interval of up to 5 seconds as well as aggregated data, with minutely, hourly or daily average values. Furthermore, we performed several preprocessing steps, removing outliers with a threshold based approach, excluding changes in weight that were induced by beekeeping activities and combining the sensor data with the most important meta-data from the beekeepers' observations. The data is organised in directories based on the year of recording. Alternatively, we provide subsets of the data structured based on the occurrence or non-occurrence of a swarming event or the death of a colony.
    The data can be analysed using methods from time series analysis, time series classification or other data science approaches to form a better understanding of specifics in the development of honey bee colonies.

  17. Data from: Toward Chemical Accuracy in Predicting Enthalpies of Formation...

    • acs.figshare.com
    xlsx
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peikun Zheng; Wudi Yang; Wei Wu; Olexandr Isayev; Pavlo O. Dral (2023). Toward Chemical Accuracy in Predicting Enthalpies of Formation with General-Purpose Data-Driven Methods [Dataset]. http://doi.org/10.1021/acs.jpclett.2c00734.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    ACS Publications
    Authors
    Peikun Zheng; Wudi Yang; Wei Wu; Olexandr Isayev; Pavlo O. Dral
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Enthalpies of formation and reaction are important thermodynamic properties that have a crucial impact on the outcome of chemical transformations. Here we implement the calculation of enthalpies of formation with a general-purpose ANI‑1ccx neural network atomistic potential. We demonstrate on a wide range of benchmark sets that both ANI-1ccx and our other general-purpose data-driven method AIQM1 approach the coveted chemical accuracy of 1 kcal/mol with the speed of semiempirical quantum mechanical methods (AIQM1) or faster (ANI-1ccx). It is remarkably achieved without specifically training the machine learning parts of ANI-1ccx or AIQM1 on formation enthalpies. Importantly, we show that these data-driven methods provide statistical means for uncertainty quantification of their predictions, which we use to detect and eliminate outliers and revise reference experimental data. Uncertainty quantification may also help in the systematic improvement of such data-driven methods.

  18. A

    Near-surface vegetation monitoring in Adventdalen, Svalbard (Rack #9,...

    • adc.met.no
    Updated Feb 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lennart Nilsen (2021). Near-surface vegetation monitoring in Adventdalen, Svalbard (Rack #9, 2016-2018) [Dataset]. https://adc.met.no/dataset/e8f5a34f-a5f6-5ff3-8111-4790fd04160f
    Explore at:
    Dataset updated
    Feb 9, 2021
    Dataset provided by
    Department of Arctic and Marine Biology
    Norwegian Meteorological Institute / Arctic Data Centre
    Authors
    Lennart Nilsen
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    May 15, 2016 - Sep 25, 2018
    Area covered
    Description

    NDVI, GCC, soil temperature and soil water content data from Adventdalen, Svalbard. This data was collected with a time-lapse RGB camera and NDVI sensor installed on a two meter high metal rack to monitor tundra vegetation. The time-lapse photos have gone through a manual quality check and were automatically adjusted with an algorithm to correct for lateral and rotational movements. A mask was used to calculate Green Chromatic Channel (GCC) from the photos. The NDVI data was quality controlled by removing outliers that were two standard deviations removed from the mean value of the growing season, and by removing dates where there was snow on the ground (as indicated by the time-lapse photos). In addition, soil and surface temperature and soil moisture were measured to facilitate the interpretation of shifts in the vegetation indices.

  19. Data from: PCP-SAFT Parameters of Pure Substances Using Large Experimental...

    • figshare.com
    zip
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross (2023). PCP-SAFT Parameters of Pure Substances Using Large Experimental Databases [Dataset]. http://doi.org/10.1021/acs.iecr.3c02255.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 6, 2023
    Dataset provided by
    ACS Publications
    Authors
    Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This work reports pure component parameters for the PCP-SAFT equation of state for 1842 substances using a total of approximately 551 172 experimental data points for vapor pressure and liquid density. We utilize data from commercial and public databases in combination with an automated workflow to assign chemical identifiers to all substances, remove duplicate data sets, and filter unsuited data. The use of raw experimental data, as opposed to pseudoexperimental data from empirical correlations, requires means to identify and remove outliers, especially for vapor pressure data. We apply robust regression using a Huber loss function. For identifying and removing outliers, the empirical Wagner equation for vapor pressure is adjusted to experimental data, because the Wagner equation is mathematically rather flexible and is thus not subject to a systematic model bias. For adjusting model parameters of the PCP-SAFT model, nonpolar, dipolar and associating substances are distinguished. The resulting substance-specific parameters of the PCP-SAFT equation of state yield in a mean absolute relative deviation of the of 2.73% for vapor pressure and 0.52% for liquid densities (2.56% and 0.47% for nonpolar substances, 2.67% and 0.61% for dipolar substances, and 3.24% and 0.54% for associating substances) when evaluated against outlier-removed data. All parameters are provided as JSON and CSV files.

  20. f

    Additional file 2 of Thresher: determining the number of clusters while...

    • springernature.figshare.com
    zip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Wang; Zachary B. Abrams; Steven M. Kornblau; Kevin R. Coombes (2023). Additional file 2 of Thresher: determining the number of clusters while removing outliers [Dataset]. http://doi.org/10.6084/m9.figshare.5768622.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    figshare
    Authors
    Min Wang; Zachary B. Abrams; Steven M. Kornblau; Kevin R. Coombes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R Code for Analyses. This is a zip file containing all of the R code used to perform simulations and to analyze the breast cancer data. (ZIP 407 kb)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1

Data from: A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.

Search
Clear search
Close search
Google apps
Main menu