Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last six rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that interested individuals can readily estimate the typical range of expression of each gene. Values representing aberrant levels for a particular parameter (z-score>2.5) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p0.05). SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LOF calculation time (seconds) comparison.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effect sizes calculated using mean difference for burnt-unburnt study designs and mean change for before-after desings. Outliers, as defined in the methods section of the paper, were excluded prior to calculating effect sizes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Outlier detection is the key to the quality control of marine survey data. For the detection of outliers in Conductivity-Temperature-Depth (CTD) data, previous methods, such as the Wild Edit method and the Median Filter Combined with Maximum Deviation method, mostly set a threshold based on statistics. Values greater than the threshold are treated as outliers, but there is no clear specification for the selection of threshold, thus multiple attempts are required. The process is time-consuming and inefficient, and the results have high false negative and positive rates. In response to this problem, we proposed an outlier detection method in CTD conductivity data, based on a physical constraint, the continuity of seawater. The method constructs a cubic spline fitting function based on the independent points scheme and the cubic spline interpolation to fit the conductivity data. The maximum fitting residual points will be flagged as outliers. The fitting stops when the optimal number of iterations is reached, which is automatically obtained by the minimum value of the sequence of maximum fitting residuals. Verification of the accuracy and stability of the method by means of examples proves that it has a lower false negative rate (17.88%) and false positive rate (0.24%) than other methods. Indeed, rates for the Wild Edit method are 56.96% and 2.19%, while for the Median Filter Combined with Maximum Deviation method rates are 23.28% and 0.31%. The Cubic Spline Fitting method is simple to operate, the result is clear and definite, better solved the problem of conductivity outliers detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Outliers are often present in large datasets of water quality monitoring time series data. A method of combining the sliding window technique with Dixon detection criterion for the automatic detection of outliers in time series data is limited by the empirical determination of sliding window sizes. The scientific determination of the optimal sliding window size is very meaningful research work. This paper presents a new Monte Carlo Search Method (MCSM) based on random sampling to optimize the size of the sliding window, which fully takes advantage of computers and statistics. The MCSM was applied in a case study to automatic monitoring data of water quality factors in order to test its validity and usefulness. The results of comparing the accuracy and efficiency of the MCSM show that the new method in this paper is scientific and effective. The experimental results show that, at different sample sizes, the average accuracy is between 58.70% and 75.75%, and the average computation time increase is between 17.09% and 45.53%. In the era of big data in environmental monitoring, the proposed new methods can meet the required accuracy of outlier detection and improve the efficiency of calculation.
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
This dataset is used to determine whether a case qualifies for outlier payments under the hospital inpatient prospective payment system (IPPS), hospital-specific cost-to-charge ratios are applied to the total covered charges for the case. Operating and capital costs for the case are calculated separately by applying separate operating and capital cost-to-charge ratios and combining these costs to compare them with the fixed-loss outlier threshold.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
The mean is the median (synonym: 50. percentile, central value). It is the value above or below which 50% of all cases of a data group are located. The calculation is carried out on outlier-adjusted data collectives. The total content is determined from the aqua regia extract (according to DIN ISO 11466 (1997)). The concentration is given in mg/kg. The salary classes take into account, among other things, the pension values of the BBodSchV (1999). These are 40 mg/kg for sand, 70 mg/kg for clay, silt and very silty sand and 100 mg/kg for clay. According to LABO (2003) a sample count of >=20 is required for the calculation of background values. However, the map also shows groups with a sample count >= 10. This information is then only informal and not representative.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
We are enclosing the database used in our research titled "Concentration and Geospatial Modelling of Health Development Offices' Accessibility for the Total and Elderly Populations in Hungary", along with our statistical calculations. For the sake of reproducibility, further information can be found in the file Short_Description_of_Data_Analysis.pdf and Statistical_formulas.pdf
The sharing of data is part of our aim to strengthen the base of our scientific research. As of March 7, 2024, the detailed submission and analysis of our research findings to a scientific journal has not yet been completed.
The dataset was expanded on 23rd September 2024 to include SPSS statistical analysis data, a heatmap, and buffer zone analysis around the Health Development Offices (HDOs) created in QGIS software.
Short Description of Data Analysis and Attached Files (datasets):
Our research utilised data from 2022, serving as the basis for statistical standardisation. The 2022 Hungarian census provided an objective basis for our analysis, with age group data available at the county level from the Hungarian Central Statistical Office (KSH) website. The 2022 demographic data provided an accurate picture compared to the data available from the 2023 microcensus. The used calculation is based on our standardisation of the 2022 data. For xlsx files, we used MS Excel 2019 (version: 1808, build: 10406.20006) with the SOLVER add-in.
Hungarian Central Statistical Office served as the data source for population by age group, county, and regions: https://www.ksh.hu/stadat_files/nep/hu/nep0035.html, (accessed 04 Jan. 2024.) with data recorded in MS Excel in the Data_of_demography.xlsx file.
In 2022, 108 Health Development Offices (HDOs) were operational, and it's noteworthy that no developments have occurred in this area since 2022. The availability of these offices and the demographic data from the Central Statistical Office in Hungary are considered public interest data, freely usable for research purposes without requiring permission.
The contact details for the Health Development Offices were sourced from the following page (Hungarian National Population Centre (NNK)): https://www.nnk.gov.hu/index.php/efi (n=107). The Semmelweis University Health Development Centre was not listed by NNK, hence it was separately recorded as the 108th HDO. More information about the office can be found here: https://semmelweis.hu/egeszsegfejlesztes/en/ (n=1). (accessed 05 Dec. 2023.)
Geocoordinates were determined using Google Maps (N=108): https://www.google.com/maps. (accessed 02 Jan. 2024.) Recording of geocoordinates (latitude and longitude according to WGS 84 standard), address data (postal code, town name, street, and house number), and the name of each HDO was carried out in the: Geo_coordinates_and_names_of_Hungarian_Health_Development_Offices.csv file.
The foundational software for geospatial modelling and display (QGIS 3.34), an open-source software, can be downloaded from:
https://qgis.org/en/site/forusers/download.html. (accessed 04 Jan. 2024.)
The HDOs_GeoCoordinates.gpkg QGIS project file contains Hungary's administrative map and the recorded addresses of the HDOs from the
Geo_coordinates_and_names_of_Hungarian_Health_Development_Offices.csv file,
imported via .csv file.
The OpenStreetMap tileset is directly accessible from www.openstreetmap.org in QGIS. (accessed 04 Jan. 2024.)
The Hungarian county administrative boundaries were downloaded from the following website: https://data2.openstreetmap.hu/hatarok/index.php?admin=6 (accessed 04 Jan. 2024.)
HDO_Buffers.gpkg is a QGIS project file that includes the administrative map of Hungary, the county boundaries, as well as the HDO offices and their corresponding buffer zones with a radius of 7.5 km.
Heatmap.gpkg is a QGIS project file that includes the administrative map of Hungary, the county boundaries, as well as the HDO offices and their corresponding heatmap (Kernel Density Estimation).
A brief description of the statistical formulas applied is included in the Statistical_formulas.pdf.
Recording of our base data for statistical concentration and diversification measurement was done using MS Excel 2019 (version: 1808, build: 10406.20006) in .xlsx format.
Using the SPSS 29.0.1.0 program, we performed the following statistical calculations with the databases Data_HDOs_population_without_outliers.sav and Data_HDOs_population.sav:
For easier readability, the files have been provided in both SPV and PDF formats.
The translation of these supplementary files into English was completed on 23rd Sept. 2024.
If you have any further questions regarding the dataset, please contact the corresponding author: domjan.peter@phd.semmelweis.hu
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data and code to calculate Probability-Density-Ranking (PDR) outliers and Most Probable Range (MPR)
Data licence Germany – Attribution – Version 2.0https://www.govdata.de/dl-de/by-2-0
License information was derived automatically
The median (synonym: 50th percentile, central value) is used as the mean value. It is the value above or below which 50% of all cases in a data group are. The calculation is carried out on outlier-free data collectives. The total content is determined from the aqua regia extract (according to DIN ISO 11466 (1997)). The concentration is given in mg/kg. The BBodSchV (1999) does not set any precautionary values for arsenic. According to LABO (2003), a sample number of >=20 is required for the calculation of background values. However, groups with a number of samples >= 10 are also shown on the map. This information is then only informal and not representative. Further information on definitions of terms, horizon grouping and statistical evaluation: (http://mapserver.lgb-rlp.de/php_hgw_bod/meta/Background values_Hinweise.pdf) Terms of use see: http://www.lgb-rlp.de/karten-und- products/online-maps/terms-of-use-for-online-maps.html
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The mean is the median (synonym: 50. percentile, central value). It is the value above or below which 50% of all cases of a data group are located. The calculation is carried out on outlier-adjusted data collectives. The total content is determined from the aqua regia extract (according to DIN ISO 11466 (1997)). The concentration is given in mg/kg. The salary classes take into account, among other things, the pension values of the BBodSchV (1999). These are 15 mg/kg for the soil type sand, 50 mg/kg for clay, silt and heavily silty sand and 70 mg/kg for clay. According to LABO (2003) a sample count of >=20 is required for the calculation of background values. However, the map also shows groups with a sample count >= 10. This information is then only informal and not representative.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the years there have been a number of different computational methods that allow for the identification of outliers. Methods for robust estimation are known in the set of M-estimates methods (derived from the method of Maximum Likelihood Estimation) or in the set of R-estimation methods (robust estimation based on the application of some rank test). There are also algorithms that are not classified in any of these groups but these methods are also resistant to gross errors, for example, in M-split estimation. Another proposal, which can be used to detect outliers in the process of transformation of coordinates, where the coordinates of some points may be affected by gross errors, can be a method called RANSAC algorithm (Random Sample and Consensus). The authors present a study that was performed in the process of 2D transformation parameter estimation using RANSAC algorithm to detect points that have coordinates with outliers. The calculations were performed in three scenarios on the real geodetic network. Selected coordinates were burdened with simulated values of errors to confirm the efficiency of the proposed method.
Data licence Germany – Attribution – Version 2.0https://www.govdata.de/dl-de/by-2-0
License information was derived automatically
The median (synonym: 50th percentile, central value) is used as the mean value. It is the value above or below which 50% of all cases in a data group are. The calculation is carried out on outlier-free data collectives. The total content is determined from the aqua regia extract (according to DIN ISO 11466 (1997)). The concentration is given in mg/kg. The salary classes take into account, among other things, the precautionary values of the BBodSchV (1999). These are 40 mg/kg for the soil type sand, 70 mg/kg for loam, silt and very silty sand and 100 mg/kg for clay. According to LABO (2003), a sample number of >=20 is required for the calculation of background values. However, groups with a number of samples >= 10 are also shown on the map. This information is then only informal and not representative.
20141220 Database for PLoS ONE Manuscript
aHR calculated in younger subgroup was the dose-response evaluation assessed per 5 U/l of ALT increment;HR calculated in older subgroup was the evaluation compared between higher and lower ALT categories.Abbreviations: CI: confidence interval; CV: cardiovascular; HR: hazard ratio.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Automated flow cytometry (FCM) adapted to real-time quality surveillance provides high-temporal-resolution data about the microbial communities in a water system. The cell concentration calculated from FCM measurements indicates sudden increases in the number of bacteria, but can fluctuate significantly due to man-made and natural dynamics; it can thus obscure the presence of microbial anomalies. Cytometric fingerprinting tools enable a detailed analysis of the aquatic microbial communities, and could distinguish between normal and abnormal community changes. However, the vast majority of current cytometric fingerprinting tools use offline statistical computations which cannot detect anomalies immediately. Here, we present a computational model, entitled Microbial Community Change Detection (MCCD), which transforms microbial community characteristics into an online process control signal (herein called outlier score) that remains close to zero if the microbial community remains stable and increases with fluctuations in the community. The model is based on fingerprints and distance-based outlier calculations. We tested it in silico and in vitro by simulating acute contaminations to real-world water systems with large inherent microbial fluctuations. We showed that the outlier score was robust against these dynamic variations, while reliably detecting intentional contaminations. This model can be used with automated FCM to quickly detect potential microbiological contamination, and this especially when the time between treatment and distribution is very short.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This work reports pure component parameters for the PCP-SAFT equation of state for 1842 substances using a total of approximately 551 172 experimental data points for vapor pressure and liquid density. We utilize data from commercial and public databases in combination with an automated workflow to assign chemical identifiers to all substances, remove duplicate data sets, and filter unsuited data. The use of raw experimental data, as opposed to pseudoexperimental data from empirical correlations, requires means to identify and remove outliers, especially for vapor pressure data. We apply robust regression using a Huber loss function. For identifying and removing outliers, the empirical Wagner equation for vapor pressure is adjusted to experimental data, because the Wagner equation is mathematically rather flexible and is thus not subject to a systematic model bias. For adjusting model parameters of the PCP-SAFT model, nonpolar, dipolar and associating substances are distinguished. The resulting substance-specific parameters of the PCP-SAFT equation of state yield in a mean absolute relative deviation of the of 2.73% for vapor pressure and 0.52% for liquid densities (2.56% and 0.47% for nonpolar substances, 2.67% and 0.61% for dipolar substances, and 3.24% and 0.54% for associating substances) when evaluated against outlier-removed data. All parameters are provided as JSON and CSV files.
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.