Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Outlier removal is a fundamental data processing task to ensure the quality of scanned point cloud data (PCD), which is becoming increasing important in industrial applications and reverse engineering. Acquired scanned PCD is usually noisy, sparse and temporarily incoherent. Thus the processing of scanned data is typically an ill-posed problem. In the paper, we present a simple and effective method based on two geometrical characteristics constraints to trim the noisy points. One of the geometrical characteristics is the local density information and another is the deviation from the local fitting plane. The local density based method provides a preprocessing step, which could remove those sparse outlier and isolated outlier. The non-isolated outlier removal in this paper depends on a local projection method, which placing those points onto objects. There is no doubt that the deviation of any point from the local fitting plane should be a criterion to reduce the noisy points. The experimental results demonstrate the ability to remove the noisy point from various man-made objects consisting of complex outlier.
National, regional
Households
Sample survey data [ssd]
The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for Round 2 consisted of the following sections
Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compil...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our research demonstrates that machine learning algorithms can effectively predict heart failure, highlighting high-accuracy models that improve detection and treatment. The Kaggle “Heart Failure” dataset, with 918 instances and 12 key features, was preprocessed to remove outliers and features a distribution of cases with and without heart disease (508 and 410). Five models were evaluated: the random forest achieved the highest accuracy (92%) and was consolidated as the most effective at classifying cases. Logistic regression and multilayer perceptron were also quite accurate (89%), while decision tree and k-nearest neighbors performed less well, showing that k-neighbors is less suitable for this data. F1 scores confirmed the random forest as the optimal one, benefiting from preprocessing and hyperparameter tuning. The data analysis revealed that age, blood pressure and cholesterol correlate with disease risk, suggesting that these models may help prioritize patients at risk and improve their preventive management. The research underscores the potential of these models in clinical practice to improve diagnostic accuracy and reduce costs, supporting informed medical decisions and improving health outcomes.
The Javalambre Photometric Local Universe Survey (J-PLUS) is an observational campaign that aims to obtain photometry in 12 ultraviolet-visible filters (0.3-1um) over ~8500deg^2^ of the sky observable from Javalambre (Teruel, Spain). Due to its characteristics and observation strategy, this survey will allow a great number of Solar System small bodies to be analyzed, and with improved spectrophotometric resolution with respect to previous large-area photometric surveys in optical wavelengths. The main goal of the present work is to present the first catalog of magnitudes and colors of minor bodies of the Solar System compiled using the first data release (DR1) of the J-PLUS observational campaign: the Moving Objects Observed from Javalambre (MOOJa) catalog. Using the compiled photometric data we obtained very-low-resolution reflectance (photo)spectra of the asteroids. We first used a {sigma}-clipping algorithm in order to remove outliers and clean the data. We then devised a method to select the optimal solar colors in the J-PLUS photometric system. These solar colors were computed using two different approaches: on one hand, we used different spectra of the Sun convolved with the filter transmissions of the J-PLUS system, and on the other, we selected a group of solar-type stars in the J-PLUS DR1 according to their computed stellar parameters. Finally, we used the solar colors to obtain the reflectance spectra of the asteroids. We present photometric data in the J-PLUS filters for a total of 3122 minor bodies (3666 before outlier removal), and we discuss the main issues with the data, as well as some guidelines to solve them.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a CSV data file, which includes baseline data collected before participants were exposed to the intervention, post-test data after participants were exposed to the intervention, and data from the control group. These data have been cleaned to remove outliers and to remove identifying information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data contain bathymetric data from the Namibia continental slope. The data were acquired on R/V Meteor research expeditions M76/1 in 2008, and R/V Maria S. Merian expedition MSM19/1c in 2011. The purpose of the data was the exploration of the Namibian continental slope and espressially the investigation of large seafloor depressions. The bathymetric data were acquired with the 191-beam 12 kHz Kongsberg EM120 system. The data were processed using the public software package MBSystems. The loaded data were cleaned semi-automatically and manually, removing outliers and other erroneous data. Initial velocity fields were adjusted to remove artifacts from the data. Gridding was done in 10x10 m grid cells for the MSM19-1c dataset and 50x50 m for the M76 dataset using the Gaussian Weighted Mean algorithm.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Reservoir operations face persistent challenges due to increasing water demand, more frequent extreme events, and stricter environmental requirements. Historical operation records are crucial to investigating real-world reservoir operations, which integrate prescribed operation rules, empirical knowledge of operators, and regulatory response to extreme events. This dataset offers processed daily operation records—including inflow, outflow, and storage—for 256 major reservoirs across the Contiguous United States (CONUS) from 1990 to 2019. The reservoirs were selected from the dataset of Li et al. (2023), which includes 452 reservoirs, based on two criteria: (1) a minimum of 25 years of records (starting in 1990 and ending in 2014 or later), (2) less than 10% missing data during the study period. To enhance data quality, we remove the outliers of storage data with abnormal sudden storage changes even when inflows remain stable, and use linear interpolation to fill missing values, resulting in continuous daily records. Additionally, daily water surface elevation data are included for 217 of the 256 reservoirs. Related findings on changes in reservoir storage and operations are published in Chen and Cai (2025, Water Resources Research).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains analyses and summaries of hydrochemistry data for the Galilee subregion, and includes an additional quality assurance of the source hydrochemistry and waterlevel data to remove anomalous and outlier values.
Several bores were removed from the 'chem master sheet' in the QLD Hydrochemistry QA QC GAL v02 (GUID: e3fb6c9b-e224-4d2e-ad11-4bcba882b0af) dataset based on their TDS values. Bores with high or unrealistic TDS that were removed are found at the bottom of the 'updated data' sheet.
Outlier water level values from the JK GAL Bore Waterlevels v01 (GUID: 2f8fe7e6-021f-4070-9f63-aa996b77469d) dataset were identified and removed. Those bores are identified in the 'outliers not used' sheet
Pivot tables were created to summarise data, and create various histograms for analysis and interpretation. These are found in the 'chemistry histogram', 'Pivot tables', 'summaries'.
Bioregional Assessment Programme (2016) Hydrochemistry analysis of the Galilee subregion. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/fd944f9f-14f6-4e20-bb8a-61d1116412ec.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From QLD DNRM Hydrochemistry with QA/QC
Derived From QLD Hydrochemistry QA QC GAL v02
Derived From QLD DNRM Galilee Mine Groundwater Bores - Water Levels
Derived From Galilee bore water levels v01
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014
Derived From RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original)
Derived From Geoscience Australia, 1 second SRTM Digital Elevation Model (DEM)
Derived From Carmichael Coal Mine and Rail Project Environmental Impact Statement
Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is updated more frequently and can be visualized on NCWQR's data portal.
If you have any questions, please contact Dr. Laura Johnson or Dr. Nathan Manning.
The National Center for Water Quality Research (NCWQR) is a research laboratory at Heidelberg University in Tiffin, Ohio, USA. Our primary research program is the Heidelberg Tributary Loading Program (HTLP), where we currently monitor water quality at 22 river locations throughout Ohio and Michigan, effectively covering ~half of the land area of Ohio. The goal of the program is to accurately measure the total amounts (loads) of pollutants exported from watersheds by rivers and streams. Thus these data are used to assess different sources (nonpoint vs point), forms, and timing of pollutant export from watersheds. The HTLP officially began with high-frequency monitoring for sediment and nutrients from the Sandusky and Maumee rivers in 1974, and has continually expanded since then.
Each station where samples are collected for water quality is paired with a US Geological Survey gage for quantifying discharge (http://waterdata.usgs.gov/usa/nwis/rt). Our stations cover a wide range of watershed areas upstream of the sampling point from 11.0 km2 for the unnamed tributary to Lost Creek to 19,215 km2 for the Muskingum River. These rivers also drain a variety of land uses, though a majority of the stations drain over 50% row-crop agriculture.
At most sampling stations, submersible pumps located on the stream bottom continuously pump water into sampling wells inside heated buildings where automatic samplers collect discrete samples (4 unrefrigerated samples/d at 6-h intervals, 1974–1987; 3 refrigerated samples/d at 8-h intervals, 1988-current). At weekly intervals the samples are returned to the NCWQR laboratories for analysis. When samples either have high turbidity from suspended solids or are collected during high flow conditions, all samples for each day are analyzed. As stream flows and/or turbidity decreases, analysis frequency shifts to one sample per day. At the River Raisin and Muskingum River, a cooperator collects a grab sample from a bridge at or near the USGS station approximately daily and all samples are analyzed. Each sample bottle contains sufficient volume to support analyses of total phosphorus (TP), dissolved reactive phosphorus (DRP), suspended solids (SS), total Kjeldahl nitrogen (TKN), ammonium-N (NH4), nitrate-N and nitrite-N (NO2+3), chloride, fluoride, and sulfate. Nitrate and nitrite are commonly added together when presented; henceforth we refer to the sum as nitrate.
Upon return to the laboratory, all water samples are analyzed within 72h for the nutrients listed below using standard EPA methods. For dissolved nutrients, samples are filtered through a 0.45 um membrane filter prior to analysis. We currently use a Seal AutoAnalyzer 3 for DRP, silica, NH4, TP, and TKN colorimetry, and a DIONEX Ion Chromatograph with AG18 and AS18 columns for anions. Prior to 2014, we used a Seal TRAACs for all colorimetry.
2017 Ohio EPA Project Study Plan and Quality Assurance Plan
Data quality control and data screening
The data provided in the River Data files have all been screened by NCWQR staff. The purpose of the screening is to remove outliers that staff deem likely to reflect sampling or analytical errors rather than outliers that reflect the real variability in stream chemistry. Often, in the screening process, the causes of the outlier values can be determined and appropriate corrective actions taken. These may involve correction of sample concentrations or deletion of those data points.
This micro-site contains data for approximately 126,000 water samples collected beginning in 1974. We cannot guarantee that each data point is free from sampling bias/error, analytical errors, or transcription errors. However, since its beginnings, the NCWQR has operated a substantial internal quality control program and has participated in numerous external quality control reviews and sample exchange programs. These programs have consistently demonstrated that data produced by the NCWQR is of high quality.
A note on detection limits and zero and negative concentrations
It is routine practice in analytical chemistry to determine method detection limits and/or limits of quantitation, below which analytical results are considered less reliable or unreliable. This is something that we also do as part of our standard procedures. Many laboratories, especially those associated with agencies such as the U.S. EPA, do not report individual values that are less than the detection limit, even if the analytical equipment returns such values. This is in part because as individual measurements they may not be considered valid under litigation.
The measured concentration consists of the true but unknown concentration plus random instrument error, which is usually small compared to the range of expected environmental values. In a sample for which the true concentration is very small, perhaps even essentially zero, it is possible to obtain an analytical result of 0 or even a small negative concentration. Results of this sort are often “censored” and replaced with the statement “
Censoring these low values creates a number of problems for data analysis. How do you take an average? If you leave out these numbers, you get a biased result because you did not toss out any other (higher) values. Even if you replace negative concentrations with 0, a bias ensues, because you’ve chopped off some portion of the lower end of the distribution of random instrument error.
For these reasons, we do not censor our data. Values of -9 and -1 are used as missing value codes, but all other negative and zero concentrations are actual, valid results. Negative concentrations make no physical sense, but they make analytical and statistical sense. Users should be aware of this, and if necessary make their own decisions about how to use these values. Particularly if log transformations are to be used, some decision on the part of the user will be required.
Analyte Detection Limits
https://ncwqr.files.wordpress.com/2021/12/mdl-june-2019-epa-methods.jpg?w=1024
For more information, please visit https://ncwqr.org/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This composite repository contains high-frequency data of discharge, electrical conductivity, nitrate-N, DOC and water temperature obtained the Rappbode headwater catchment in the Harz mountains, Germany. This catchment was affected by a bark-beetle infestion and forest dieback from 2018 onwards.The data extents previous observations from the same catchment (RB) published as part of Musolff (2020). Details on the catchment can be found here: Werner et al. (2019, 2021), Musolff et al. (2021). The file RB_HF_data_2018_2023.txt states measurements for each timestep using the following columns: "index" (number of observation),"Date.Time" (timestamp in YYYY-MM-DD HH:MM:SS), "WT" (water temperature in degree celsius), "Q.smooth" ( discharge in mm/d smoothed using moving average), "NO3.smooth" (nitrate concentrations in mg N/L smoothed using moving average), "DOC.smooth" (Dissolved organic carbon concentrations in mg/L, smoothed using moving average), "EC.smooth" (electrical conductivity in µS/cm smoothed using moving average); NA - no data.
Water quality data and discharge was measured at a high-frequency interval of 15 min in the time period between January 2018 and August 2023. Both, NO3-N and DOC were measured using an in-situ UV-VIS probe (s::can spectrolyser, scan Austria). EC was measured using an in-situ probe (CTD Diver, Van Essen Canada). Discharge measurements relied on an established stage-discharge relationship based on water level observations (CTD Diver, Van Essen Canada, see Werner et al. [2019]). Data loggers were maintained every two weeks, including manual cleaning of the UV-VIS probes and grab sampling for subsequent lab analysis, calibration and validation.
Data preparation included five steps: drift corrections, outlier detection, gap filling, calibration and moving averaging: - Drift was corrected by distributing the offset between mean values one hour before and after cleaning equally among the two weeks maintenance interval as an exponential growth. - Outliers were detected with a two-step procedure. First, values outside a physically unlikely range were removed. Second, the Grubbs test, to detect and remove outliers, was applied to a moving window of 100 values. - Data gaps smaller than two hours were filled using cubic spline interpolation. - The resulting time series were globally calibrated against the lab measured concentration of NO3-N and DOC. EC was calibrated against field values obtained with a handheld WTW probe (WTW Multi 430, Xylem Analytics Germany). - Noise in the signal of both discharge and water quality was reduced by a moving average with a window lenght of 2.5 hours.
References: Musolff, A. (2020). High frequency dataset for event-scale concentration-discharge analysis. https://doi.org/http://www.hydroshare.org/resource/27c93a3f4ee2467691a1671442e047b8 Musolff, A., Zhan, Q., Dupas, R., Minaudo, C., Fleckenstein, J. H., Rode, M., Dehaspe, J., & Rinke, K. (2021). Spatial and Temporal Variability in Concentration-Discharge Relationships at the Event Scale. Water Resources Research, 57(10). Werner, B. J., A. Musolff, O. J. Lechtenfeld, G. H. de Rooij, M. R. Oosterwoud, and J. H. Fleckenstein (2019), High-frequency measurements explain quantity and quality of dissolved organic carbon mobilization in a headwater catchment, Biogeosciences, 16(22), 4497-4516. Werner, B. J., Lechtenfeld, O. J., Musolff, A., de Rooij, G. H., Yang, J., Grundling, R., Werban, U., & Fleckenstein, J. H. (2021). Small-scale topography explains patterns and dynamics of dissolved organic carbon exports from the riparian zone of a temperate, forested catchment. Hydrology and Earth System Sciences, 25(12), 6067-6086.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ocean dynamic datasets of seafloor observation network experiment system at the South China Sea was completed in September 2016. This system provided energy supply and communication transmission channel through optical fiber composite power cable for the deep ocean observation platform, enabling multi-parameter, real-time and continuous ocean observation. The Subsea Dynamic Platform with CTD and ADCP was deployed in June 2017, and the collection of observation data was started from July 2017, including the collection of temperature, conductivity, water pressure from CTD and velocity from ADCP. Based on the raw observation data collected by ADCP and CTD sensors from July 2017 to December 2018, the data processing and quality control algorithm is adopted to remove outliers, add missing values, format the data and finally produce the dataset. The dataset consists of 4 data files in total: Ocean dynamic datasets of South China Sea 2017 - ADCP.CSV, totaling 1.12 MB, Ocean dynamic datasets of South China Sea 2018 - ADCP.CSV, totaling 2.24 MB, Ocean dynamic datasets of South China Sea 2017 – CTD.CSV, totaling 35.6 MB, Ocean dynamic datasets of South China Sea 2018 - CTD.CSV, totaling 73 MB.
Single-wing images were captured from 14,354 pairs of field-collected tsetse wings of species Glossina pallidipes and G. m. morsitans and analysed together with relevant biological data. To answer research questions regarding these flies, we need to locate 11 anatomical landmark coordinates on each wing. The manual location of landmarks is time-consuming, prone to error, and simply infeasible given the number of images. Automatic landmark detection has been proposed to locate these landmark coordinates. We developed a two-tier method using deep learning architectures to classify images and make accurate landmark predictions. The first tier used a classification convolutional neural network to remove most wings that were missing landmarks. The second tier provided landmark coordinates for the remaining wings. For the second tier, compared direct coordinate regression using a convolutional neural network and segmentation using a fully convolutional network. For the resulting landmark pred...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Since the small spots in the slices were not completely removed, the calculation of the Euler number was incorrect. Therefore, taking Sr30 as an example, we provide the original liquid phase, the liquid phase after removing noise, and the three-phase data of the noise. After recalculating the Euler number, we confirmed that the calculation error was caused by the noise.The noise removal operation can be performed in ImageJ as follows:Process > Noise > Remove Outliers, with parameters set to Radius=5 and Threshold=0.50
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gene-expression deconvolution is used to quantify different types of cells in a mixed population. It provides a highly promising solution to rapidly characterize the tumor-infiltrating immune landscape and identify cold cancers. However, a major challenge is that gene-expression data are frequently contaminated by many outliers that decrease the estimation accuracy. Thus, it is imperative to develop a robust deconvolution method that automatically decontaminates data by reliably detecting and removing outliers. We developed a new machine learning tool, Fast And Robust DEconvolution of Expression Profiles (FARDEEP), to enumerate immune cell subsets from whole tumor tissue samples. To reduce noise in the tumor gene expression datasets, FARDEEP utilizes an adaptive least trimmed square to automatically detect and remove outliers before estimating the cell compositions. We show that FARDEEP is less susceptible to outliers and returns a better estimation of coefficients than the existing methods with both numerical simulations and real datasets. FARDEEP provides an estimate related to the absolute quantity of each immune cell subset in addition to relative percentages. Hence, FARDEEP represents a novel robust algorithm to complement the existing toolkit for the characterization of tissue-infiltrating immune cell landscape. The source code for FARDEEP is implemented in R and available for download at https://github.com/YuningHao/FARDEEP.git.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.