51 datasets found

f
Data from: Error and anomaly detection for intra-participant time-series...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5189002
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
David R. Mullineaux; Gareth Irwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Data from: Outlier classification using autoencoders: application for...
osti.gov
dataverse.harvard.edu
Updated Jun 2, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SKEHRJ
Dataset updated
Jun 2, 2021
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
R code
figshare.com
txt
Updated Jun 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5021297.v1
Dataset updated
Jun 5, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Christine Dodge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
f
An efficient outlier removal method for scattered point cloud data
plos.figshare.com
txt
Updated Aug 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang (2018). An efficient outlier removal method for scattered point cloud data [Dataset]. http://doi.org/10.1371/journal.pone.0201280
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0201280
Dataset updated
Aug 3, 2018
Dataset provided by
PLOS ONE
Authors
Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outlier removal is a fundamental data processing task to ensure the quality of scanned point cloud data (PCD), which is becoming increasing important in industrial applications and reverse engineering. Acquired scanned PCD is usually noisy, sparse and temporarily incoherent. Thus the processing of scanned data is typically an ill-posed problem. In the paper, we present a simple and effective method based on two geometrical characteristics constraints to trim the noisy points. One of the geometrical characteristics is the local density information and another is the deviation from the local fitting plane. The local density based method provides a preprocessing step, which could remove those sparse outlier and isolated outlier. The non-isolated outlier removal in this paper depends on a local projection method, which placing those points onto objects. There is no doubt that the deviation of any point from the local fitting plane should be a criterion to reduce the noisy points. The experimental results demonstrate the ability to remove the noisy point from various man-made objects consisting of complex outlier.
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
+1more
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset provided by
World Bank Grouphttp://www.worldbank.org/
Authors
World Bank
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
n
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
d
Data from: Pacman profiling: a simple procedure to identify stratigraphic...
search.dataone.org
datadryad.org
+1more
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Lazarus; Manuel Weinkauf; Patrick Diver (2025). Pacman profiling: a simple procedure to identify stratigraphic outliers in high-density deep-sea microfossil data [Dataset]. http://doi.org/10.5061/dryad.2m7b0
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2m7b0
Dataset updated
Apr 2, 2025
Dataset provided by
Dryad Digital Repository
Authors
David Lazarus; Manuel Weinkauf; Patrick Diver
Time period covered
Jan 1, 2011
Description
The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple methodâ€”Pacmanâ€”to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compil...
m
Data from: Classification of Heart Failure Using Machine Learning: A...
data.mendeley.com
Updated Oct 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryan Chulde (2024). Classification of Heart Failure Using Machine Learning: A Comparative Study [Dataset]. http://doi.org/10.17632/959dxmgj8d.1
Explore at:
Unique identifier
https://doi.org/10.17632/959dxmgj8d.1
Dataset updated
Oct 29, 2024
Authors
Bryan Chulde
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Our research demonstrates that machine learning algorithms can effectively predict heart failure, highlighting high-accuracy models that improve detection and treatment. The Kaggle “Heart Failure” dataset, with 918 instances and 12 key features, was preprocessed to remove outliers and features a distribution of cases with and without heart disease (508 and 410). Five models were evaluated: the random forest achieved the highest accuracy (92%) and was consolidated as the most effective at classifying cases. Logistic regression and multilayer perceptron were also quite accurate (89%), while decision tree and k-nearest neighbors performed less well, showing that k-neighbors is less suitable for this data. F1 scores confirmed the random forest as the optimal one, benefiting from preprocessing and hyperparameter tuning. The data analysis revealed that age, blood pressure and cholesterol correlate with disease risk, suggesting that these models may help prioritize patients at risk and improve their preventive management. The research underscores the potential of these models in clinical practice to improve diagnostic accuracy and reduce costs, supporting informed medical decisions and improving health outcomes.
e
MOOJa catalogs for Solar System Objects - Dataset - B2FIND
b2find.eudat.eu
Updated Aug 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). MOOJa catalogs for Solar System Objects - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/66a02175-6b39-5794-ac02-1282651d5431
Explore at:
Dataset updated
Aug 30, 2022
Description
The Javalambre Photometric Local Universe Survey (J-PLUS) is an observational campaign that aims to obtain photometry in 12 ultraviolet-visible filters (0.3-1um) over ~8500deg^2^ of the sky observable from Javalambre (Teruel, Spain). Due to its characteristics and observation strategy, this survey will allow a great number of Solar System small bodies to be analyzed, and with improved spectrophotometric resolution with respect to previous large-area photometric surveys in optical wavelengths. The main goal of the present work is to present the first catalog of magnitudes and colors of minor bodies of the Solar System compiled using the first data release (DR1) of the J-PLUS observational campaign: the Moving Objects Observed from Javalambre (MOOJa) catalog. Using the compiled photometric data we obtained very-low-resolution reflectance (photo)spectra of the asteroids. We first used a {sigma}-clipping algorithm in order to remove outliers and clean the data. We then devised a method to select the optimal solar colors in the J-PLUS photometric system. These solar colors were computed using two different approaches: on one hand, we used different spectra of the Sun convolved with the filter transmissions of the J-PLUS system, and on the other, we selected a group of solar-type stars in the J-PLUS DR1 according to their computed stellar parameters. Finally, we used the solar colors to obtain the reflectance spectra of the asteroids. We present photometric data in the J-PLUS filters for a total of 3122 minor bodies (3666 before outlier removal), and we discuss the main issues with the data, as well as some guidelines to solve them.
Data
figshare.com
txt
Updated Aug 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chan Thai (2021). Data [Dataset]. http://doi.org/10.6084/m9.figshare.16441719.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16441719.v2
Dataset updated
Aug 26, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Chan Thai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a CSV data file, which includes baseline data collected before participants were exposed to the intervention, post-test data after participants were exposed to the intervention, and data from the control group. These data have been cleaned to remove outliers and to remove identifying information.
t
Wenau, Stefan, Spieß, Volkhard, Zabel, Matthias (2021). Dataset: Multibeam...
service.tib.eu
Updated Nov 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Wenau, Stefan, Spieß, Volkhard, Zabel, Matthias (2021). Dataset: Multibeam bathymetry processed data (EM 120 echosounder dataset compilation) of RV METEOR & RV MARIA S. MERIAN during cruise M76/1 & MSM19/1c, Namibian continental slope. https://doi.org/10.1594/PANGAEA.932434 [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-932434
Explore at:
Dataset updated
Nov 29, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data contain bathymetric data from the Namibia continental slope. The data were acquired on R/V Meteor research expeditions M76/1 in 2008, and R/V Maria S. Merian expedition MSM19/1c in 2011. The purpose of the data was the exploration of the Namibian continental slope and espressially the investigation of large seafloor depressions. The bathymetric data were acquired with the 191-beam 12 kHz Kongsberg EM120 system. The data were processed using the public software package MBSystems. The loaded data were cleaned semi-automatically and manually, removing outliers and other erroneous data. Initial velocity fields were adjusted to remove artifacts from the data. Gridding was done in 10x10 m grid cells for the MSM19-1c dataset and 50x50 m for the M76 dataset using the Gaussian Weighted Mean algorithm.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
d
Historical Operation Data of 256 Reservoirs in Contiguous United States
dataone.org
hydroshare.org
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanan Chen; Ximing Cai; Donghui Li (2025). Historical Operation Data of 256 Reservoirs in Contiguous United States [Dataset]. https://dataone.org/datasets/sha256%3A0d047b7884564eeb1fe148e0a072226790f1ddf40a73ec21379349deb3a7ef63
Explore at:
Dataset updated
Feb 1, 2025
Dataset provided by
Hydroshare
Authors
Yanan Chen; Ximing Cai; Donghui Li
Area covered

Description
Reservoir operations face persistent challenges due to increasing water demand, more frequent extreme events, and stricter environmental requirements. Historical operation records are crucial to investigating real-world reservoir operations, which integrate prescribed operation rules, empirical knowledge of operators, and regulatory response to extreme events. This dataset offers processed daily operation records—including inflow, outflow, and storage—for 256 major reservoirs across the Contiguous United States (CONUS) from 1990 to 2019. The reservoirs were selected from the dataset of Li et al. (2023), which includes 452 reservoirs, based on two criteria: (1) a minimum of 25 years of records (starting in 1990 and ending in 2014 or later), (2) less than 10% missing data during the study period. To enhance data quality, we remove the outliers of storage data with abnormal sudden storage changes even when inflows remain stable, and use linear interpolation to fill missing values, resulting in continuous daily records. Additionally, daily water surface elevation data are included for 217 of the 256 reservoirs. Related findings on changes in reservoir storage and operations are published in Chen and Cai (2025, Water Resources Research).
Hydrochemistry analysis of the Galilee subregion
researchdata.edu.au
Updated Dec 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2018). Hydrochemistry analysis of the Galilee subregion [Dataset]. https://researchdata.edu.au/hydrochemistry-analysis-galilee-subregion/2991745
Explore at:
Dataset updated
Dec 6, 2018
Dataset provided by
Data.govhttps://data.gov/
Authors
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Galilee
Description
Abstract

This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

This dataset contains analyses and summaries of hydrochemistry data for the Galilee subregion, and includes an additional quality assurance of the source hydrochemistry and waterlevel data to remove anomalous and outlier values.

Dataset History

Several bores were removed from the 'chem master sheet' in the QLD Hydrochemistry QA QC GAL v02 (GUID: e3fb6c9b-e224-4d2e-ad11-4bcba882b0af) dataset based on their TDS values. Bores with high or unrealistic TDS that were removed are found at the bottom of the 'updated data' sheet.

Outlier water level values from the JK GAL Bore Waterlevels v01 (GUID: 2f8fe7e6-021f-4070-9f63-aa996b77469d) dataset were identified and removed. Those bores are identified in the 'outliers not used' sheet

Pivot tables were created to summarise data, and create various histograms for analysis and interpretation. These are found in the 'chemistry histogram', 'Pivot tables', 'summaries'.

Dataset Citation

Bioregional Assessment Programme (2016) Hydrochemistry analysis of the Galilee subregion. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/fd944f9f-14f6-4e20-bb8a-61d1116412ec.

Dataset Ancestors

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204

Derived From QLD DNRM Hydrochemistry with QA/QC

Derived From QLD Hydrochemistry QA QC GAL v02

Derived From QLD DNRM Galilee Mine Groundwater Bores - Water Levels

Derived From Galilee bore water levels v01

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014

Derived From RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original)

Derived From Geoscience Australia, 1 second SRTM Digital Elevation Model (DEM)

Derived From Carmichael Coal Mine and Rail Project Environmental Impact Statement

Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
Heidelberg Tributary Loading Program (HTLP) Dataset
zenodo.org
bin, png
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCWQR; NCWQR (2024). Heidelberg Tributary Loading Program (HTLP) Dataset [Dataset]. http://doi.org/10.5281/zenodo.6606950
Explore at:
bin, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6606950
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
NCWQR; NCWQR
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is updated more frequently and can be visualized on NCWQR's data portal.

If you have any questions, please contact Dr. Laura Johnson or Dr. Nathan Manning.

The National Center for Water Quality Research (NCWQR) is a research laboratory at Heidelberg University in Tiffin, Ohio, USA. Our primary research program is the Heidelberg Tributary Loading Program (HTLP), where we currently monitor water quality at 22 river locations throughout Ohio and Michigan, effectively covering ~half of the land area of Ohio. The goal of the program is to accurately measure the total amounts (loads) of pollutants exported from watersheds by rivers and streams. Thus these data are used to assess different sources (nonpoint vs point), forms, and timing of pollutant export from watersheds. The HTLP officially began with high-frequency monitoring for sediment and nutrients from the Sandusky and Maumee rivers in 1974, and has continually expanded since then.

Each station where samples are collected for water quality is paired with a US Geological Survey gage for quantifying discharge (http://waterdata.usgs.gov/usa/nwis/rt). Our stations cover a wide range of watershed areas upstream of the sampling point from 11.0 km2 for the unnamed tributary to Lost Creek to 19,215 km2 for the Muskingum River. These rivers also drain a variety of land uses, though a majority of the stations drain over 50% row-crop agriculture.

At most sampling stations, submersible pumps located on the stream bottom continuously pump water into sampling wells inside heated buildings where automatic samplers collect discrete samples (4 unrefrigerated samples/d at 6-h intervals, 1974–1987; 3 refrigerated samples/d at 8-h intervals, 1988-current). At weekly intervals the samples are returned to the NCWQR laboratories for analysis. When samples either have high turbidity from suspended solids or are collected during high flow conditions, all samples for each day are analyzed. As stream flows and/or turbidity decreases, analysis frequency shifts to one sample per day. At the River Raisin and Muskingum River, a cooperator collects a grab sample from a bridge at or near the USGS station approximately daily and all samples are analyzed. Each sample bottle contains sufficient volume to support analyses of total phosphorus (TP), dissolved reactive phosphorus (DRP), suspended solids (SS), total Kjeldahl nitrogen (TKN), ammonium-N (NH4), nitrate-N and nitrite-N (NO2+3), chloride, fluoride, and sulfate. Nitrate and nitrite are commonly added together when presented; henceforth we refer to the sum as nitrate.

Upon return to the laboratory, all water samples are analyzed within 72h for the nutrients listed below using standard EPA methods. For dissolved nutrients, samples are filtered through a 0.45 um membrane filter prior to analysis. We currently use a Seal AutoAnalyzer 3 for DRP, silica, NH4, TP, and TKN colorimetry, and a DIONEX Ion Chromatograph with AG18 and AS18 columns for anions. Prior to 2014, we used a Seal TRAACs for all colorimetry.

2017 Ohio EPA Project Study Plan and Quality Assurance Plan

Project Study Plan

Quality Assurance Plan

Data quality control and data screening

The data provided in the River Data files have all been screened by NCWQR staff. The purpose of the screening is to remove outliers that staff deem likely to reflect sampling or analytical errors rather than outliers that reflect the real variability in stream chemistry. Often, in the screening process, the causes of the outlier values can be determined and appropriate corrective actions taken. These may involve correction of sample concentrations or deletion of those data points.

This micro-site contains data for approximately 126,000 water samples collected beginning in 1974. We cannot guarantee that each data point is free from sampling bias/error, analytical errors, or transcription errors. However, since its beginnings, the NCWQR has operated a substantial internal quality control program and has participated in numerous external quality control reviews and sample exchange programs. These programs have consistently demonstrated that data produced by the NCWQR is of high quality.

A note on detection limits and zero and negative concentrations

It is routine practice in analytical chemistry to determine method detection limits and/or limits of quantitation, below which analytical results are considered less reliable or unreliable. This is something that we also do as part of our standard procedures. Many laboratories, especially those associated with agencies such as the U.S. EPA, do not report individual values that are less than the detection limit, even if the analytical equipment returns such values. This is in part because as individual measurements they may not be considered valid under litigation.

The measured concentration consists of the true but unknown concentration plus random instrument error, which is usually small compared to the range of expected environmental values. In a sample for which the true concentration is very small, perhaps even essentially zero, it is possible to obtain an analytical result of 0 or even a small negative concentration. Results of this sort are often “censored” and replaced with the statement “
Censoring these low values creates a number of problems for data analysis. How do you take an average? If you leave out these numbers, you get a biased result because you did not toss out any other (higher) values. Even if you replace negative concentrations with 0, a bias ensues, because you’ve chopped off some portion of the lower end of the distribution of random instrument error.

For these reasons, we do not censor our data. Values of -9 and -1 are used as missing value codes, but all other negative and zero concentrations are actual, valid results. Negative concentrations make no physical sense, but they make analytical and statistical sense. Users should be aware of this, and if necessary make their own decisions about how to use these values. Particularly if log transformations are to be used, some decision on the part of the user will be required.

Analyte Detection Limits

https://ncwqr.files.wordpress.com/2021/12/mdl-june-2019-epa-methods.jpg?w=1024

For more information, please visit https://ncwqr.org/
H
High frequency dataset for event-scale concentration-discharge analysis in a...
hydroshare.org
search.dataone.org
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Musolff (2024). High frequency dataset for event-scale concentration-discharge analysis in a forested headwater 01/2018-08/2023 [Dataset]. http://doi.org/10.4211/hs.9be43573ba754ec1b3650ce233fc99de
Explore at:
zip(17.1 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.9be43573ba754ec1b3650ce233fc99de
Dataset updated
Sep 19, 2024
Dataset provided by
HydroShare
Authors
Andreas Musolff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2018 - Aug 23, 2023
Area covered

Description
This composite repository contains high-frequency data of discharge, electrical conductivity, nitrate-N, DOC and water temperature obtained the Rappbode headwater catchment in the Harz mountains, Germany. This catchment was affected by a bark-beetle infestion and forest dieback from 2018 onwards.The data extents previous observations from the same catchment (RB) published as part of Musolff (2020). Details on the catchment can be found here: Werner et al. (2019, 2021), Musolff et al. (2021). The file RB_HF_data_2018_2023.txt states measurements for each timestep using the following columns: "index" (number of observation),"Date.Time" (timestamp in YYYY-MM-DD HH:MM:SS), "WT" (water temperature in degree celsius), "Q.smooth" ( discharge in mm/d smoothed using moving average), "NO3.smooth" (nitrate concentrations in mg N/L smoothed using moving average), "DOC.smooth" (Dissolved organic carbon concentrations in mg/L, smoothed using moving average), "EC.smooth" (electrical conductivity in µS/cm smoothed using moving average); NA - no data.

Water quality data and discharge was measured at a high-frequency interval of 15 min in the time period between January 2018 and August 2023. Both, NO3-N and DOC were measured using an in-situ UV-VIS probe (s::can spectrolyser, scan Austria). EC was measured using an in-situ probe (CTD Diver, Van Essen Canada). Discharge measurements relied on an established stage-discharge relationship based on water level observations (CTD Diver, Van Essen Canada, see Werner et al. [2019]). Data loggers were maintained every two weeks, including manual cleaning of the UV-VIS probes and grab sampling for subsequent lab analysis, calibration and validation.

Data preparation included five steps: drift corrections, outlier detection, gap filling, calibration and moving averaging: - Drift was corrected by distributing the offset between mean values one hour before and after cleaning equally among the two weeks maintenance interval as an exponential growth. - Outliers were detected with a two-step procedure. First, values outside a physically unlikely range were removed. Second, the Grubbs test, to detect and remove outliers, was applied to a moving window of 100 values. - Data gaps smaller than two hours were filled using cubic spline interpolation. - The resulting time series were globally calibrated against the lab measured concentration of NO3-N and DOC. EC was calibrated against field values obtained with a handheld WTW probe (WTW Multi 430, Xylem Analytics Germany). - Noise in the signal of both discharge and water quality was reduced by a moving average with a window lenght of 2.5 hours.

References: Musolff, A. (2020). High frequency dataset for event-scale concentration-discharge analysis. https://doi.org/http://www.hydroshare.org/resource/27c93a3f4ee2467691a1671442e047b8 Musolff, A., Zhan, Q., Dupas, R., Minaudo, C., Fleckenstein, J. H., Rode, M., Dehaspe, J., & Rinke, K. (2021). Spatial and Temporal Variability in Concentration-Discharge Relationships at the Event Scale. Water Resources Research, 57(10). Werner, B. J., A. Musolff, O. J. Lechtenfeld, G. H. de Rooij, M. R. Oosterwoud, and J. H. Fleckenstein (2019), High-frequency measurements explain quantity and quality of dissolved organic carbon mobilization in a headwater catchment, Biogeosciences, 16(22), 4497-4516. Werner, B. J., Lechtenfeld, O. J., Musolff, A., de Rooij, G. H., Yang, J., Grundling, R., Werban, U., & Fleckenstein, J. H. (2021). Small-scale topography explains patterns and dynamics of dissolved organic carbon exports from the riparian zone of a temperate, forested catchment. Hydrology and Earth System Sciences, 25(12), 6067-6086.
S
The ocean dynamic datasets of seafloor observation network experiment system...
scidb.cn
Updated Aug 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
常永国; 张飞; 郭永刚; 宋晓阳; 杨杰; 刘若芸 (2019). The ocean dynamic datasets of seafloor observation network experiment system at the South China Sea [Dataset]. http://doi.org/10.11922/sciencedb.823
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.823
Dataset updated
Aug 7, 2019
Dataset provided by
Science Data Bank
Authors
常永国; 张飞; 郭永刚; 宋晓阳; 杨杰; 刘若芸
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South China Sea, China
Description
The ocean dynamic datasets of seafloor observation network experiment system at the South China Sea was completed in September 2016. This system provided energy supply and communication transmission channel through optical fiber composite power cable for the deep ocean observation platform, enabling multi-parameter, real-time and continuous ocean observation. The Subsea Dynamic Platform with CTD and ADCP was deployed in June 2017, and the collection of observation data was started from July 2017, including the collection of temperature, conductivity, water pressure from CTD and velocity from ADCP. Based on the raw observation data collected by ADCP and CTD sensors from July 2017 to December 2018, the data processing and quality control algorithm is adopted to remove outliers, add missing values, format the data and finally produce the dataset. The dataset consists of 4 data files in total: Ocean dynamic datasets of South China Sea 2017 - ADCP.CSV, totaling 1.12 MB, Ocean dynamic datasets of South China Sea 2018 - ADCP.CSV, totaling 2.24 MB, Ocean dynamic datasets of South China Sea 2017 – CTD.CSV, totaling 35.6 MB, Ocean dynamic datasets of South China Sea 2018 - CTD.CSV, totaling 73 MB.
d
Tsetse fly wing landmark data for morphometrics (Vol 20, 21)
datadryad.org
search.dataone.org
zip
Updated Dec 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Geldenhuys (2022). Tsetse fly wing landmark data for morphometrics (Vol 20, 21) [Dataset]. http://doi.org/10.5061/dryad.qz612jmh1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.qz612jmh1
Dataset updated
Dec 2, 2022
Dataset provided by
Dryad
Authors
Dylan Geldenhuys
Time period covered
Oct 31, 2022
Description
Single-wing images were captured from 14,354 pairs of field-collected tsetse wings of species Glossina pallidipes and G. m. morsitans and analysed together with relevant biological data. To answer research questions regarding these flies, we need to locate 11 anatomical landmark coordinates on each wing. The manual location of landmarks is time-consuming, prone to error, and simply infeasible given the number of images. Automatic landmark detection has been proposed to locate these landmark coordinates. We developed a two-tier method using deep learning architectures to classify images and make accurate landmark predictions. The first tier used a classification convolutional neural network to remove most wings that were missing landmarks. The second tier provided landmark coordinates for the remaining wings. For the second tier, compared direct coordinate regression using a convolutional neural network and segmentation using a fully convolutional network. For the resulting landmark pred...
S
Euler number calculation with spots
scidb.cn
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Zhang (2025). Euler number calculation with spots [Dataset]. http://doi.org/10.57760/sciencedb.25091
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.25091
Dataset updated
May 19, 2025
Dataset provided by
Science Data Bank
Authors
Yu Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Since the small spots in the slices were not completely removed, the calculation of the Euler number was incorrect. Therefore, taking Sr30 as an example, we provide the original liquid phase, the liquid phase after removing noise, and the three-phase data of the noise. After recalculating the Euler number, we confirmed that the calculation error was caused by the noise.The noise removal operation can be performed in ImageJ as follows:Process > Noise > Remove Outliers, with parameters set to Radius=5 and Threshold=0.50
f
Fast and robust deconvolution of tumor infiltrating lymphocyte from...
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuning Hao; Ming Yan; Blake R. Heath; Yu L. Lei; Yuying Xie (2023). Fast and robust deconvolution of tumor infiltrating lymphocyte from expression profiles using least trimmed squares [Dataset]. http://doi.org/10.1371/journal.pcbi.1006976
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1006976
Dataset updated
May 31, 2023
Dataset provided by
PLOS Computational Biology
Authors
Yuning Hao; Ming Yan; Blake R. Heath; Yu L. Lei; Yuying Xie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Gene-expression deconvolution is used to quantify different types of cells in a mixed population. It provides a highly promising solution to rapidly characterize the tumor-infiltrating immune landscape and identify cold cancers. However, a major challenge is that gene-expression data are frequently contaminated by many outliers that decrease the estimation accuracy. Thus, it is imperative to develop a robust deconvolution method that automatically decontaminates data by reliably detecting and removing outliers. We developed a new machine learning tool, Fast And Robust DEconvolution of Expression Profiles (FARDEEP), to enumerate immune cell subsets from whole tumor tissue samples. To reduce noise in the tumor gene expression datasets, FARDEEP utilizes an adaptive least trimmed square to automatically detect and remove outliers before estimating the cell compositions. We show that FARDEEP is less susceptible to outliers and returns a better estimation of coefficients than the existing methods with both numerical simulations and real datasets. FARDEEP provides an estimate related to the absolute quantity of each immune cell subset in addition to relative percentages. Hence, FARDEEP represents a novel robust algorithm to complement the existing toolkit for the characterization of tissue-infiltrating immune cell landscape. The source code for FARDEEP is implemented in R and available for download at https://github.com/YuningHao/FARDEEP.git.

Facebook

Twitter

Click to copy link

Link copied

Cite

David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002

Data from: Error and anomaly detection for intra-participant time-series data

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5189002

Dataset updated

Jun 1, 2023

Dataset provided by

Taylor & Francis

Authors

David R. Mullineaux; Gareth Irwin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.

Clear search

Close search

Google apps

Main menu

Data from: Error and anomaly detection for intra-participant time-series...

Data from: Outlier classification using autoencoders: application for...

R code

An efficient outlier removal method for scattered point cloud data

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Malaria disease and grading system dataset from public hospitals reflecting...

Data from: Pacman profiling: a simple procedure to identify stratigraphic...

Data from: Classification of Heart Failure Using Machine Learning: A...

MOOJa catalogs for Solar System Objects - Dataset - B2FIND

Data

Wenau, Stefan, Spieß, Volkhard, Zabel, Matthias (2021). Dataset: Multibeam...

Cdd Dataset

Historical Operation Data of 256 Reservoirs in Contiguous United States

Hydrochemistry analysis of the Galilee subregion

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Heidelberg Tributary Loading Program (HTLP) Dataset

High frequency dataset for event-scale concentration-discharge analysis in a...

The ocean dynamic datasets of seafloor observation network experiment system...

Tsetse fly wing landmark data for morphometrics (Vol 20, 21)

Euler number calculation with spots

Fast and robust deconvolution of tumor infiltrating lymphocyte from...

Data from: Error and anomaly detection for intra-participant time-series data