Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The presence of outliers in response times can affect statistical analyses and lead to incorrect interpretation of the outcome of a study. Therefore, it is a widely accepted practice to try to minimize the effect of outliers by preprocessing the raw data. There exist numerous methods for handling outliers and researchers are free to choose among them. In this article, we use computer simulations to show that serious problems arise from this flexibility. Choosing between alternative ways for handling outliers can result in the inflation of p-values and the distortion of confidence intervals and measures of effect size. Using Bayesian parameter estimation and probability distributions with heavier tails eliminates the need to deal with response times outliers, but at the expense of opening another source of flexibility.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SKEHRJhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SKEHRJ
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compil...
National, regional
Households
Sample survey data [ssd]
The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for Round 2 consisted of the following sections
Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Since the small spots in the slices were not completely removed, the calculation of the Euler number was incorrect. Therefore, taking Sr30 as an example, we provide the original liquid phase, the liquid phase after removing noise, and the three-phase data of the noise. After recalculating the Euler number, we confirmed that the calculation error was caused by the noise.The noise removal operation can be performed in ImageJ as follows:Process > Noise > Remove Outliers, with parameters set to Radius=5 and Threshold=0.50
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Local redundancy (ri), standard deviation of the least-squares (LS)-estimated outlier and the maximum absolute correlation () for each scenario of hard constraint.
##Vehicle-insurance
Vehicle Insurance data: This dataset contains multiple features according to the customer’s vehicle and insurance type.
OBJECTIVE: Business requirement is to increase the clv (customer lifetime value) that means clv is the target variable.
Data Cleansing:
This dataset is pretty clean already, a few outliers are there. Remove the outliers.
Why remove Outliers? Outliers are unusual values in dataset, and they can distort statistical analyses and violate their assumptions.
Feature selection:
This step is required to remove unwanted features.
VIF and Correlation Coefficient can be used to find important features.
VIF: Variance Inflation Factor It is a measure of collinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variance of a single beta if it were fit alone.
Correlation Coefficient: A positive Pearson coefficient mean that one variable's value increases with the others. And a negative Pearson coefficient means one variable decreases as other variable decreases. Correlations coefficients of -1 or +1 mean the relationship is exactly linear.
Log transformation and Normalisation: Many ML algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.
Applying different ML Algorithms to the dataset for predictions. Their accuracies are in notebook.
Please see my work. And I am open to suggestion.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset with the data of the manuscript "Consistency of pacing profile according to performance level in three different editions of the Chicago, London, and Tokyo marathons" published in Scientific Reports (DOI: 10.1038/s41598-022-14868-6). The dataset is after pre-processing data (removing outliers, calculate the variables of analysis, etc.).
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
We present a simple, effective method for solving structure from motion problems by averaging epipolar geometries. Based on recent successes in solving for global camera rotations using averaging schemes, we focus on the problem of solving for 3D camera translations given a network of noisy pairwise camera translation directions (or 3D point observations). To do this well, we have two main insights. First, we propose a method for removing outliers from problem instances by solving simpler low-dimensional subproblems, which we refer to as 1DSfM problems. Second, we present a simple, principled averaging scheme. We demonstrate this new method in the wild on Internet photo collections.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Filtered StarCoder Dataset Mini
Dataset Description
This dataset contains filtered and processed code samples from 10 popular programming languages: C, C++, C#, Go, Java, JavaScript, Python, Ruby, Scala, and TypeScript. The dataset was created by filtering source code based on quality metrics, removing outliers, and standardizing the format for machine learning and code analysis applications.
Key Features
Cleaned and Filtered Code: Samples have been processed… See the full description on the dataset page: https://huggingface.co/datasets/jugalgajjar/Filtered-StarCoder-Dataset-Mini.
Input features of the data set include Timestamp, Chilled Water Rate (L/sec), Cooling Water Temperature (C), Building Load (RT), Total Energy (kWh), Temperature (F), Dew Point (F), Humidity (%), Wind Speed (mph), Pressure (in), Hour of Day (h) and Day of Week. The training and validation data sets contain data related to a commercial building located in Singapore, from 18/08/2019 00:00 to 01/06/2020 13:00 which refined to 13,561 data samples after removing outliers and missing values.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks
01_DATA # preprocessing and filtering of raw activity data from ChEMBL
- Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
- filt_stats.R # Filtering and preparation of raw data
- Filtered # output data sets from filt_stats.R
- toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity
02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
- datastore # files with all compounds and their calculated molecular descriptors based on SMILES
- scripts
- calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
- chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105
03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
- datastore # output files with statistics calculated by make_Z.R
- scripts
-make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models
04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
- datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
- scripts
-calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated
05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.
- datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
- scripts
- data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
- Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
- Rforest.R # based on analysis of Rforest_CV.R learning of final models
rregrs_output
# early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
one for the raw data after removing the outliers and the other for the preprocessed feature dataset. See the Readme file in the folder for details.
This dataset accompanies our data paper. We will provide a link to the paper once accepted.
bob_publication_data.zip
Data in 1-minute, 1-hour and 1-day interval, processed and unprocessed version as described in data paper.
bob_raw_data.zip
Raw data with original measurement interval, mostly 5 or 10 seconds.
bob_code_publication.zip
Code we used to prepare the data. We anonymised sections with personal keys and passwords.
We present sensor data from 78 honey bee colonies in Germany collected as part of a citizen science project. Each honey bee hive was equipped with five temperature sensors within the hive, one temperature sensor for outside measurements, a combined sensor for temperature, ambient air pressure and humidity, and a scale to measure the weight. During the data acquisition period, beekeepers used a web app to report their observations and beekeeping activities.
We provide the raw data with a measurement interval of up to 5 seconds as well as aggregated data, with minutely, hourly or daily average values. Furthermore, we performed several preprocessing steps, removing outliers with a threshold based approach, excluding changes in weight that were induced by beekeeping activities and combining the sensor data with the most important meta-data from the beekeepers' observations. The data is organised in directories based on the year of recording. Alternatively, we provide subsets of the data structured based on the occurrence or non-occurrence of a swarming event or the death of a colony.
The data can be analysed using methods from time series analysis, time series classification or other data science approaches to form a better understanding of specifics in the development of honey bee colonies.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Enthalpies of formation and reaction are important thermodynamic properties that have a crucial impact on the outcome of chemical transformations. Here we implement the calculation of enthalpies of formation with a general-purpose ANI‑1ccx neural network atomistic potential. We demonstrate on a wide range of benchmark sets that both ANI-1ccx and our other general-purpose data-driven method AIQM1 approach the coveted chemical accuracy of 1 kcal/mol with the speed of semiempirical quantum mechanical methods (AIQM1) or faster (ANI-1ccx). It is remarkably achieved without specifically training the machine learning parts of ANI-1ccx or AIQM1 on formation enthalpies. Importantly, we show that these data-driven methods provide statistical means for uncertainty quantification of their predictions, which we use to detect and eliminate outliers and revise reference experimental data. Uncertainty quantification may also help in the systematic improvement of such data-driven methods.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NDVI, GCC, soil temperature and soil water content data from Adventdalen, Svalbard. This data was collected with a time-lapse RGB camera and NDVI sensor installed on a two meter high metal rack to monitor tundra vegetation. The time-lapse photos have gone through a manual quality check and were automatically adjusted with an algorithm to correct for lateral and rotational movements. A mask was used to calculate Green Chromatic Channel (GCC) from the photos. The NDVI data was quality controlled by removing outliers that were two standard deviations removed from the mean value of the growing season, and by removing dates where there was snow on the ground (as indicated by the time-lapse photos). In addition, soil and surface temperature and soil moisture were measured to facilitate the interpretation of shifts in the vegetation indices.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This work reports pure component parameters for the PCP-SAFT equation of state for 1842 substances using a total of approximately 551 172 experimental data points for vapor pressure and liquid density. We utilize data from commercial and public databases in combination with an automated workflow to assign chemical identifiers to all substances, remove duplicate data sets, and filter unsuited data. The use of raw experimental data, as opposed to pseudoexperimental data from empirical correlations, requires means to identify and remove outliers, especially for vapor pressure data. We apply robust regression using a Huber loss function. For identifying and removing outliers, the empirical Wagner equation for vapor pressure is adjusted to experimental data, because the Wagner equation is mathematically rather flexible and is thus not subject to a systematic model bias. For adjusting model parameters of the PCP-SAFT model, nonpolar, dipolar and associating substances are distinguished. The resulting substance-specific parameters of the PCP-SAFT equation of state yield in a mean absolute relative deviation of the of 2.73% for vapor pressure and 0.52% for liquid densities (2.56% and 0.47% for nonpolar substances, 2.67% and 0.61% for dipolar substances, and 3.24% and 0.54% for associating substances) when evaluated against outlier-removed data. All parameters are provided as JSON and CSV files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R Code for Analyses. This is a zip file containing all of the R code used to perform simulations and to analyze the breast cancer data. (ZIP 407 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.