29 datasets found

f
Data from: A Diagnostic Procedure for Detecting Outliers in Linear...
tandf.figshare.com
figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162075.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.
f
Data_Sheet_1_The hazards of dealing with response time outliers.pdf
frontiersin.figshare.com
pdf
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan I. Vankov (2023). Data_Sheet_1_The hazards of dealing with response time outliers.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2023.1220281.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2023.1220281.s001
Dataset updated
Aug 24, 2023
Dataset provided by
Frontiers
Authors
Ivan I. Vankov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The presence of outliers in response times can affect statistical analyses and lead to incorrect interpretation of the outcome of a study. Therefore, it is a widely accepted practice to try to minimize the effect of outliers by preprocessing the raw data. There exist numerous methods for handling outliers and researchers are free to choose among them. In this article, we use computer simulations to show that serious problems arise from this flexibility. Choosing between alternative ways for handling outliers can result in the inflation of p-values and the distortion of confidence intervals and measures of effect size. Using Bayesian parameter estimation and probability distributions with heavier tails eliminates the need to deal with response times outliers, but at the expense of opening another source of flexibility.
H
Data from: Outlier classification using autoencoders: application for...
dataverse.harvard.edu
osti.gov
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kube, R.; Bianchi, F.M.; Brunner, D.; LaBombard, B. (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/SKEHRJ
Dataset updated
Jun 2, 2021
Dataset provided by
Harvard Dataverse
Authors
Kube, R.; Bianchi, F.M.; Brunner, D.; LaBombard, B.
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SKEHRJhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SKEHRJ
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
d
Data from: Pacman profiling: a simple procedure to identify stratigraphic...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Lazarus; Manuel Weinkauf; Patrick Diver (2025). Pacman profiling: a simple procedure to identify stratigraphic outliers in high-density deep-sea microfossil data [Dataset]. http://doi.org/10.5061/dryad.2m7b0
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2m7b0
Dataset updated
Apr 2, 2025
Dataset provided by
Dryad Digital Repository
Authors
David Lazarus; Manuel Weinkauf; Patrick Diver
Time period covered
Jan 1, 2011
Description
The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple methodâ€”Pacmanâ€”to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compil...
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
explore.openaire.eu
bin
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7646897
Dataset updated
Jul 12, 2024
Dataset provided by
Solenix Engineering GmbH
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
S
Euler number calculation with spots
scidb.cn
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Zhang (2025). Euler number calculation with spots [Dataset]. http://doi.org/10.57760/sciencedb.25091
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.25091
Dataset updated
May 19, 2025
Dataset provided by
Science Data Bank
Authors
Yu Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Since the small spots in the slices were not completely removed, the calculation of the Euler number was incorrect. Therefore, taking Sr30 as an example, we provide the original liquid phase, the liquid phase after removing noise, and the three-phase data of the noise. After recalculating the Euler number, we confirmed that the calculation error was caused by the noise.The noise removal operation can be performed in ImageJ as follows:Process > Noise > Remove Outliers, with parameters set to Radius=5 and Threshold=0.50
f
Local redundancy (ri), standard deviation of the least-squares...
plos.figshare.com
xls
Updated Jun 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinicius Francisco Rofatto; Marcelo Tomio Matsuoka; Ivandro Klein; Maurício Roberto Veronez; Luiz Gonzaga da Silveira Junior (2023). Local redundancy (ri), standard deviation of the least-squares (LS)-estimated outlier and the maximum absolute correlation () for each scenario of hard constraint. [Dataset]. http://doi.org/10.1371/journal.pone.0238145.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0238145.t001
Dataset updated
Jun 14, 2023
Dataset provided by
PLOS ONE
Authors
Vinicius Francisco Rofatto; Marcelo Tomio Matsuoka; Ivandro Klein; Maurício Roberto Veronez; Luiz Gonzaga da Silveira Junior
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Local redundancy (ri), standard deviation of the least-squares (LS)-estimated outlier and the maximum absolute correlation () for each scenario of hard constraint.
Vehicle insurance data
kaggle.com
Updated Jun 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himanshu Bhatt (2020). Vehicle insurance data [Dataset]. https://www.kaggle.com/junglisher/vehicle-insurance-data/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 17, 2020
Dataset provided by
Kaggle
Authors
Himanshu Bhatt
Description
##Vehicle-insurance

Vehicle Insurance data: This dataset contains multiple features according to the customer’s vehicle and insurance type.

OBJECTIVE: Business requirement is to increase the clv (customer lifetime value) that means clv is the target variable.

Data Cleansing:

This dataset is pretty clean already, a few outliers are there. Remove the outliers.

Why remove Outliers? Outliers are unusual values in dataset, and they can distort statistical analyses and violate their assumptions.

Feature selection:

This step is required to remove unwanted features.

VIF and Correlation Coefficient can be used to find important features.

VIF: Variance Inflation Factor It is a measure of collinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variance of a single beta if it were fit alone.

Correlation Coefficient: A positive Pearson coefficient mean that one variable's value increases with the others. And a negative Pearson coefficient means one variable decreases as other variable decreases. Correlations coefficients of -1 or +1 mean the relationship is exactly linear.

Log transformation and Normalisation: Many ML algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.

Applying different ML Algorithms to the dataset for predictions. Their accuracies are in notebook.

Please see my work. And I am open to suggestion.
m
Dataset of "Consistency of pacing profile according to performance level in...
data.mendeley.com
Updated Jun 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose Ignacio Priego-Quesada (2022). Dataset of "Consistency of pacing profile according to performance level in three different editions of the Chicago, London, and Tokyo marathons" [Dataset]. http://doi.org/10.17632/xvfvk2zvhw.1
Explore at:
Unique identifier
https://doi.org/10.17632/xvfvk2zvhw.1
Dataset updated
Jun 16, 2022
Authors
Jose Ignacio Priego-Quesada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Tokyo, London
Description
Dataset with the data of the manuscript "Consistency of pacing profile according to performance level in three different editions of the Chicago, London, and Tokyo marathons" published in Scientific Reports (DOI: 10.1038/s41598-022-14868-6). The dataset is after pre-processing data (removing outliers, calculate the variables of analysis, etc.).
a
Data from: Robust Global Translations with 1DSfM
academictorrents.com
bittorrent
Updated Jun 18, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle Wilson and Noah Snavely (2019). Robust Global Translations with 1DSfM [Dataset]. https://academictorrents.com/details/9fba8b6d6323a8eb66b0fee0886f134a16625eef
Explore at:
bittorrentAvailable download formats
Dataset updated
Jun 18, 2019
Dataset authored and provided by
Kyle Wilson and Noah Snavely
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
We present a simple, effective method for solving structure from motion problems by averaging epipolar geometries. Based on recent successes in solving for global camera rotations using averaging schemes, we focus on the problem of solving for 3D camera translations given a network of noisy pairwise camera translation directions (or 3D point observations). To do this well, we have two main insights. First, we propose a method for removing outliers from problem instances by solving simpler low-dimensional subproblems, which we refer to as 1DSfM problems. Second, we present a simple, principled averaging scheme. We demonstrate this new method in the wild on Internet photo collections.
h
Filtered-StarCoder-Dataset-Mini
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jugal Gajjar (2025). Filtered-StarCoder-Dataset-Mini [Dataset]. https://huggingface.co/datasets/jugalgajjar/Filtered-StarCoder-Dataset-Mini
Explore at:
Dataset updated
May 28, 2025
Authors
Jugal Gajjar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Filtered StarCoder Dataset Mini

Dataset Description

This dataset contains filtered and processed code samples from 10 popular programming languages: C, C++, C#, Go, Java, JavaScript, Python, Ruby, Scala, and TypeScript. The dataset was created by filtering source code based on quality metrics, removing outliers, and standardizing the format for machine learning and code analysis applications.

Key Features

Cleaned and Filtered Code: Samples have been processed… See the full description on the dataset page: https://huggingface.co/datasets/jugalgajjar/Filtered-StarCoder-Dataset-Mini.
Chiller Energy Data
kaggle.com
Updated Aug 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chiller_Energy (2021). Chiller Energy Data [Dataset]. https://www.kaggle.com/datasets/chillerenergy/chiller-energy-data/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Chiller_Energy
Description
Context

Input features of the data set include Timestamp, Chilled Water Rate (L/sec), Cooling Water Temperature (C), Building Load (RT), Total Energy (kWh), Temperature (F), Dew Point (F), Humidity (%), Wind Speed (mph), Pressure (in), Hour of Day (h) and Day of Week. The training and validation data sets contain data related to a commercial building located in Singapore, from 18/08/2019 00:00 to 01/06/2020 13:00 which refined to 13,561 data samples after removing outliers and missing values.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Machine learning pipeline to train toxicity prediction model of...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3529162
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Ewald; Jan Ewald
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

01_DATA # preprocessing and filtering of raw activity data from ChEMBL
- Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
- filt_stats.R # Filtering and preparation of raw data
- Filtered # output data sets from filt_stats.R
- toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
- datastore # files with all compounds and their calculated molecular descriptors based on SMILES
- scripts
- calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
- chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
- datastore # output files with statistics calculated by make_Z.R
- scripts
-make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
- datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
- scripts
-calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

- datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
- scripts
- data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
- Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
- Rforest.R # based on analysis of Rforest_CV.R learning of final models

rregrs_output
# early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2
i
cross-phone calibration dataset
ieee-dataport.org
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheng Zeng (2024). cross-phone calibration dataset [Dataset]. https://ieee-dataport.org/documents/cross-phone-calibration-dataset
Explore at:
Dataset updated
Jan 5, 2024
Authors
Sheng Zeng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
one for the raw data after removing the outliers and the other for the preprocessed feature dataset. See the Readme file in the folder for details.
Data from: Weight, Temperature and Humidity Sensor Data of Honey Bee...
zenodo.org
Updated Sep 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diren Senger; Diren Senger; Clemens Gruber; Thorsten Kluss; Carolin Johannsen; Clemens Gruber; Thorsten Kluss; Carolin Johannsen (2023). Weight, Temperature and Humidity Sensor Data of Honey Bee Colonies in Germany, 2019 - 2022 [Dataset]. http://doi.org/10.5281/zenodo.8389138
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8389138
Dataset updated
Sep 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diren Senger; Diren Senger; Clemens Gruber; Thorsten Kluss; Carolin Johannsen; Clemens Gruber; Thorsten Kluss; Carolin Johannsen
Description
This dataset accompanies our data paper. We will provide a link to the paper once accepted.

bob_publication_data.zip
Data in 1-minute, 1-hour and 1-day interval, processed and unprocessed version as described in data paper.

bob_raw_data.zip
Raw data with original measurement interval, mostly 5 or 10 seconds.

bob_code_publication.zip
Code we used to prepare the data. We anonymised sections with personal keys and passwords.

We present sensor data from 78 honey bee colonies in Germany collected as part of a citizen science project. Each honey bee hive was equipped with five temperature sensors within the hive, one temperature sensor for outside measurements, a combined sensor for temperature, ambient air pressure and humidity, and a scale to measure the weight. During the data acquisition period, beekeepers used a web app to report their observations and beekeeping activities.
We provide the raw data with a measurement interval of up to 5 seconds as well as aggregated data, with minutely, hourly or daily average values. Furthermore, we performed several preprocessing steps, removing outliers with a threshold based approach, excluding changes in weight that were induced by beekeeping activities and combining the sensor data with the most important meta-data from the beekeepers' observations. The data is organised in directories based on the year of recording. Alternatively, we provide subsets of the data structured based on the occurrence or non-occurrence of a swarming event or the death of a colony.
The data can be analysed using methods from time series analysis, time series classification or other data science approaches to form a better understanding of specifics in the development of honey bee colonies.
Data from: Toward Chemical Accuracy in Predicting Enthalpies of Formation...
acs.figshare.com
xlsx
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peikun Zheng; Wudi Yang; Wei Wu; Olexandr Isayev; Pavlo O. Dral (2023). Toward Chemical Accuracy in Predicting Enthalpies of Formation with General-Purpose Data-Driven Methods [Dataset]. http://doi.org/10.1021/acs.jpclett.2c00734.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jpclett.2c00734.s001
Dataset updated
Jun 15, 2023
Dataset provided by
ACS Publications
Authors
Peikun Zheng; Wudi Yang; Wei Wu; Olexandr Isayev; Pavlo O. Dral
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Enthalpies of formation and reaction are important thermodynamic properties that have a crucial impact on the outcome of chemical transformations. Here we implement the calculation of enthalpies of formation with a general-purpose ANI‑1ccx neural network atomistic potential. We demonstrate on a wide range of benchmark sets that both ANI-1ccx and our other general-purpose data-driven method AIQM1 approach the coveted chemical accuracy of 1 kcal/mol with the speed of semiempirical quantum mechanical methods (AIQM1) or faster (ANI-1ccx). It is remarkably achieved without specifically training the machine learning parts of ANI-1ccx or AIQM1 on formation enthalpies. Importantly, we show that these data-driven methods provide statistical means for uncertainty quantification of their predictions, which we use to detect and eliminate outliers and revise reference experimental data. Uncertainty quantification may also help in the systematic improvement of such data-driven methods.
A
Near-surface vegetation monitoring in Adventdalen, Svalbard (Rack #9,...
adc.met.no
Updated Feb 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lennart Nilsen (2021). Near-surface vegetation monitoring in Adventdalen, Svalbard (Rack #9, 2016-2018) [Dataset]. https://adc.met.no/dataset/e8f5a34f-a5f6-5ff3-8111-4790fd04160f
Explore at:
Dataset updated
Feb 9, 2021
Dataset provided by
Department of Arctic and Marine Biology
Norwegian Meteorological Institute / Arctic Data Centre
Authors
Lennart Nilsen
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
May 15, 2016 - Sep 25, 2018
Area covered

Description
NDVI, GCC, soil temperature and soil water content data from Adventdalen, Svalbard. This data was collected with a time-lapse RGB camera and NDVI sensor installed on a two meter high metal rack to monitor tundra vegetation. The time-lapse photos have gone through a manual quality check and were automatically adjusted with an algorithm to correct for lateral and rotational movements. A mask was used to calculate Green Chromatic Channel (GCC) from the photos. The NDVI data was quality controlled by removing outliers that were two standard deviations removed from the mean value of the growing season, and by removing dates where there was snow on the ground (as indicated by the time-lapse photos). In addition, soil and surface temperature and soil moisture were measured to facilitate the interpretation of shifts in the vegetation indices.
Data from: PCP-SAFT Parameters of Pure Substances Using Large Experimental...
figshare.com
zip
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross (2023). PCP-SAFT Parameters of Pure Substances Using Large Experimental Databases [Dataset]. http://doi.org/10.1021/acs.iecr.3c02255.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.iecr.3c02255.s001
Dataset updated
Sep 6, 2023
Dataset provided by
ACS Publications
Authors
Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This work reports pure component parameters for the PCP-SAFT equation of state for 1842 substances using a total of approximately 551 172 experimental data points for vapor pressure and liquid density. We utilize data from commercial and public databases in combination with an automated workflow to assign chemical identifiers to all substances, remove duplicate data sets, and filter unsuited data. The use of raw experimental data, as opposed to pseudoexperimental data from empirical correlations, requires means to identify and remove outliers, especially for vapor pressure data. We apply robust regression using a Huber loss function. For identifying and removing outliers, the empirical Wagner equation for vapor pressure is adjusted to experimental data, because the Wagner equation is mathematically rather flexible and is thus not subject to a systematic model bias. For adjusting model parameters of the PCP-SAFT model, nonpolar, dipolar and associating substances are distinguished. The resulting substance-specific parameters of the PCP-SAFT equation of state yield in a mean absolute relative deviation of the of 2.73% for vapor pressure and 0.52% for liquid densities (2.56% and 0.47% for nonpolar substances, 2.67% and 0.61% for dipolar substances, and 3.24% and 0.54% for associating substances) when evaluated against outlier-removed data. All parameters are provided as JSON and CSV files.
f
Additional file 2 of Thresher: determining the number of clusters while...
springernature.figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min Wang; Zachary B. Abrams; Steven M. Kornblau; Kevin R. Coombes (2023). Additional file 2 of Thresher: determining the number of clusters while removing outliers [Dataset]. http://doi.org/10.6084/m9.figshare.5768622.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5768622.v1
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Authors
Min Wang; Zachary B. Abrams; Steven M. Kornblau; Kevin R. Coombes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R Code for Analyses. This is a zip file containing all of the R code used to perform simulations and to analyze the breast cancer data. (ZIP 407 kb)

Facebook

Twitter

Click to copy link

Link copied

Cite

Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1

Data from: A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12162075.v1

Dataset updated

Feb 9, 2024

Dataset provided by

Taylor & Francis

Authors

Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.

Clear search

Close search

Google apps

Main menu

Data from: A Diagnostic Procedure for Detecting Outliers in Linear...

Data_Sheet_1_The hazards of dealing with response time outliers.pdf

Data from: Outlier classification using autoencoders: application for...

Data from: Pacman profiling: a simple procedure to identify stratigraphic...

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Controlled Anomalies Time Series (CATS) Dataset

Euler number calculation with spots

Local redundancy (ri), standard deviation of the least-squares...

Vehicle insurance data

Dataset of "Consistency of pacing profile according to performance level in...

Data from: Robust Global Translations with 1DSfM

Filtered-StarCoder-Dataset-Mini

Chiller Energy Data

Context

Content

Acknowledgements

Inspiration

Machine learning pipeline to train toxicity prediction model of...

cross-phone calibration dataset

Data from: Weight, Temperature and Humidity Sensor Data of Honey Bee...

Data from: Toward Chemical Accuracy in Predicting Enthalpies of Formation...

Near-surface vegetation monitoring in Adventdalen, Svalbard (Rack #9,...

Data from: PCP-SAFT Parameters of Pure Substances Using Large Experimental...

Additional file 2 of Thresher: determining the number of clusters while...

Data from: A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models