Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.
This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details
Contains scans of a bin filled with different parts ( screws, nuts, rods, spheres, sprockets). For each part type, RGB image and organized 3D point cloud obtained with structured light sensor are provided. In addition, unorganized 3D point cloud representing an empty bin and a small Matlab script to read the files is also provided. 3D data contain a lot of outliers and the data were used to demonstrate a new filtering technique.
No description was included in this Dataset collected from the OSF
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AOL Dataset for Browsing History and Topics of Interest
This record provides the datasets of the paper The Privacy-Utility Trade-off in the Topics API (DOI: 10.1145/3658644.3670368; arXiv: 2406.15309).
The datasets generating code and the experimental results can be found in 10.5281/zenodo.11229402 (github.com/nunesgh/topics-api-analysis).
Files
AOL-treated.csv
: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies. It contains singletons (individuals with only one domain in their browsing histories) and one outlier (one user with 150.802 domain visits in three months) that are dropped in some analyses.AOL-treated-unique-domains.csv
: Auxiliary dataset containing all the unique domains from AOL-treated.csv
.Citizen-Lab-Classification.csv
: Auxiliary dataset containing the Citizen Lab Classification data, as of commit ebd0ee8, treated for inconsistencies and filtered according to Mozilla's Public Suffix List, as of commit 5e6ac3a, extended by the discontinued TLDs: .bg.ac.yu, .ac.yu, .cg.yu, .co.yu, .edu.yu, .gov.yu, .net.yu, .org.yu, .yu, .or.tp, .tp, and .an.AOL-treated-Citizen-Lab-Classification-domain-match.csv
: Auxiliary dataset containing domains matched from AOL-treated-unique-domains.csv
with domains and respective topics from Citizen-Lab-Classification.csv
.Google-Topics-Classification-v1.txt
: Auxiliary dataset containing the Google Topics API taxonomy v1 data as provided by Google with the Chrome browser.AOL-treated-Google-Topics-Classification-v1-domain-match.csv
: Auxiliary dataset containing domains matched from AOL-treated-unique-domains.csv
with domains and respective topics from Google-Topics-Classification-v1.txt
.AOL-reduced-Citizen-Lab-Classification.csv
: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies, and for analyses of topics of interest vulnerability and utility, as enabled by the Topics API. It contains singletons and the outlier that are dropped in some analyses.AOL-reduced-Google-Topics-Classification-v1.csv
: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies, and for analyses of topics of interest vulnerability and utility, as enabled by the Topics API. It contains singletons and the outlier that are dropped in some analyses.AOL-experimental.csv
: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.AOL-experimental-Citizen-Lab-Classification.csv
: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.AOL-experimental-Google-Topics-Classification-v1.csv
: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.License
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
National, regional
Households
Sample survey data [ssd]
The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for Round 2 consisted of the following sections
Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This index enables users to identify the extent of the relationship grids provided on LDS, which are used to convert heights provided in terms of one of 13 historic local vertical datums to NZVD2016. The polygons comprising the index show the extent of the conversion grids. Users can view the following polygon attributes: Shape_VDR: Vertical Datum Relationship grid area LVD: Local Vertical Datum Control: Number of control marks used to compute the relationship grid Mean: Mean vertical datum relationship value at control points Std: Standard deviation of vertical datum relationship value at control points Min: Minimum vertical datum relationship value at control points Max: Maximum vertical datum relationship value at control points Range: Range of vertical datum relationship value at control points Ref: Reference control mark for the local vertical datum Ref_value: Vertical datum relationship value at the reference mark Grid: Formal grid id Users should note that the values represented in this dataset have been calculated with the outliers excluded. These same outliers were excluded during the computation of the relationship grids, but were included when calculating the 95% confidence intervals More information on converting heights between vertical datums can be found on the LINZ website.
Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.
Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks
01_DATA # preprocessing and filtering of raw activity data from ChEMBL
- Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
- filt_stats.R # Filtering and preparation of raw data
- Filtered # output data sets from filt_stats.R
- toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity
02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
- datastore # files with all compounds and their calculated molecular descriptors based on SMILES
- scripts
- calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
- chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105
03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
- datastore # output files with statistics calculated by make_Z.R
- scripts
-make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models
04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
- datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
- scripts
-calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated
05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.
- datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
- scripts
- data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
- Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
- Rforest.R # based on analysis of Rforest_CV.R learning of final models
rregrs_output
# early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.
Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.
Overview statistics:
Number of subjects: 30
Number of transmitter locations: 6
Number of receiver locations: 6
Number of measurement frequencies: 19
Input voltage: 1 V
Load resistance: 50 ohm and 1 megaohm
Measurement group statistics:
Height: 174.10 (7.15)
Weight: 72.85 (16.26)
BMI: 23.94 (4.70)
Body fat %: 21.53 (7.55)
Age group: 29.00 (11.25)
Male/female ratio: 50%
Included files:
experiment_protocol_description.docx - protocol used in the experiments
electrode_placement_schematic.png - schematic of placement locations
electrode_placement_photo.jpg - visualization on the experiment, on a volunteer subject
RawData - the full measurement results and experiment info sheets
all_measurements.csv - the most important results extracted to .csv
all_measurements_filtered.csv - same, but after z-score filtering
all_measurements_by_freq.csv - the most important results extracted to .csv, single frequency per row
all_measurements_by_freq_filtered.csv - same, but after z-score filtering
summary_of_subjects.csv - key statistics on the subjects from the experiment info sheets
process_json_files.py - script that creates .csv from the raw data
filter_results.py - outlier removal based on z-score
plot_sample_curves.py - visualization of a randomly selected measurement result subset
plot_measurement_group.py - visualization of the measurement group
CSV file columns:
subject_id - participant's random unique ID
experiment_id - measurement session's number for the participant
height - participant's height, cm
weight - participant's weight, kg
BMI - body mass index, computed from the valued above
body_fat_% - body fat composition, as measured by bioimpedance scales
age_group - age rounded to 10 years, e.g. 20, 30, 40 etc.
male - 1 if male, 0 if female
tx_point - transmitter point number
rx_point - receiver point number
distance - distance, in relative units, between the tx and rx points. Not scaled in terms of participant's height and limb lengths!
tx_point_fat_level - transmitter point location's average fat content metric. Not scaled for each participant individually.
rx_point_fat_level - receiver point location's average fat content metric. Not scaled for each participant individually.
total_fat_level - sum of rx and tx fat levels
bias - constant term to simplify data analytics, always equal to 1.0
CSV file columns, frequency-specific:
tx_abs_Z_... - transmitter-side impedance, as computed by the process_json_files.py
script from the voltage drop
rx_gain_50_f_... - experimentally measured gain on the receiver, in dB, using 50 ohm load impedance
rx_gain_1M_f_... - experimentally measured gain on the receiver, in dB, using 1 megaohm load impedance
Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.
References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.
Contact information: info@edi.lv
Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains analyses and summaries of hydrochemistry data for the Galilee subregion, and includes an additional quality assurance of the source hydrochemistry and waterlevel data to remove anomalous and outlier values.
Several bores were removed from the 'chem master sheet' in the QLD Hydrochemistry QA QC GAL v02 (GUID: e3fb6c9b-e224-4d2e-ad11-4bcba882b0af) dataset based on their TDS values. Bores with high or unrealistic TDS that were removed are found at the bottom of the 'updated data' sheet.
Outlier water level values from the JK GAL Bore Waterlevels v01 (GUID: 2f8fe7e6-021f-4070-9f63-aa996b77469d) dataset were identified and removed. Those bores are identified in the 'outliers not used' sheet
Pivot tables were created to summarise data, and create various histograms for analysis and interpretation. These are found in the 'chemistry histogram', 'Pivot tables', 'summaries'.
Bioregional Assessment Programme (2016) Hydrochemistry analysis of the Galilee subregion. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/fd944f9f-14f6-4e20-bb8a-61d1116412ec.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From QLD DNRM Hydrochemistry with QA/QC
Derived From QLD Hydrochemistry QA QC GAL v02
Derived From QLD DNRM Galilee Mine Groundwater Bores - Water Levels
Derived From Galilee bore water levels v01
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014
Derived From RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original)
Derived From Geoscience Australia, 1 second SRTM Digital Elevation Model (DEM)
Derived From Carmichael Coal Mine and Rail Project Environmental Impact Statement
Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
The PointDenoisingBenchmark dataset features 28 different shapes, split into 18 training shapes and 10 test shapes.
PointDenoisingBenchmark for outliers removal: contains noisy point clouds with different levels of gaussian noise and the corresponding clean ground truths. PointDenoisingBenchmark for denoising: contains noisy point clouds with different levels of noise and density of outliers and the corresponding clean ground truths.
Description and NotesDescription: Methane concentration from the Greenland NEEM-2011-S1 Ice Core from 71 to 408m depth (~270-1961 CE). Methane concentrations analysed online by laser spectrometer (SARA, Spectroscopy by Amplified Resonant Absorption, developed at Laboratoire Interdisciplinaire de Physique, Grenoble, France) on gas extracted from an ice core processed using a continuous melter system (Desert Research Institute). Methane data have a 5 second integration time (raw data acquisition rate 0.6 Hz). Analytical precision, from Allan Variance test, is 0.9 ppb (2 sigma). Long-term reproducibility is 2.6% (2 sigma). Gaps in the record are due to problems during online analysis. Online analysis conducted August-September 2011.Note: Lat-Long provided is for main NEEM borehole. The NEEM-2011-S1 core was drilled 200 m distance away in 2011 to 410 m depth.Methane concentrations are reported on NOAA2004 scale (instrument calibrated on dry synthetic air standards).A correction factor of 1.079 has been applied to all data to correct for methane dissolution in melted ice core sample prior to gas extraction. Correction factor calculated using empirical data (concentrations not aligned/tied to existing discrete methane measurements).Additional methods description provided in: Stowasser, C., Buizert, C., Gkinis, V., Chappellaz, J., Schupbach, S., Bigler, M., Fain, X., Sperlich, P., Baumgartner, M., Schilt, A., Blunier, T., 2012. Continuous measurements of methane mixing ratios from ice cores. Atmos. Meas. Tech. 5, 999-1013. Morville, J., Kassi, S., Chenevier, M., Romanini, D., 2005. Fast, low-noise, mode bymode, cavity-enhanced absorption spectroscopy by diode-laser self-locking. Appl. Phys. B Lasers Opt. 80, 1027-01038.* NEEM (North Greenland Eemian Ice Drilling) project information http://neem.dk/ NEEM-2011-S1 CH4 outliers.Data points removed from dataset according to specified cut-off value.Please refer to Rhodes et al. (2013) for full discussion of origins outlying data points. Briefly, these high frequency features are not artifacts of the continuous method and have been replicated by traditional discrete analyses. Comparison to chemistry measurements suggests they are related to biological in situ production of methane.
Overview
Measurements of surface sensible heat flux, momentum flux, wind components, and virtual temperature.
Data Details
Data Quality
The Argonne National Laboratory sonic anemometer measurements are visually inspected weekly for data outliers or instrument problems. The final dataset sent to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process used for Atmospheric Radiation Measurement (ARM) program eddy correlation (ECOR) data.
Uncertainty
The uncertainties of the basic sonic anemometer measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturer. Based on historical experience with this measurement technique, flux measurement uncertainty is +/- 10 percent, although the uncertainty can be much greater during stable atmospheric conditions when turbulence intensity and atmospheric gradients are small and advection from beyond the normal fetch can occur. In particular, the Physics Site-12 tower's sonic anemometer measurements can have greater uncertainty when the wind blows through the tower structure.
Constraints
During stable atmospheric conditions, turbulence intensity and atmospheric gradients often are small, approaching or exceeding the measurement resolution of sonic anemometers. Under these conditions, advection from beyond the normal fetch also can occur, making interpretation of the fluxes difficult. Notably, the Physics Site-12 tower sonic anemometer measurements can be affected when the wind blows through the tower structure. Some unusual biases in the vertical velocities measured at the Physics Site-12 tower with west wind conditions also have not been adequately explained.
Dataset consisting of continuous measurements of temperature and salinity from bottom-moored instruments. The measurements are obtained at fixed depths in varying locations across the Hornsund fjord. All data is averaged to one-hour intervals. The files are named with the mooring ID consisting of the measured parameters (CTD for Conductivity-Temperature-Depth, TD for Temperature-Depth, T for Temperature) and a running number followed by the deployment and recovery dates (format YYYYMMDD) as well as the stage of data processing (for this dataset "hourly"). However, when observations are made at one of the stations included in the CTD monitoring program, the station name is used instead of a mooring ID. The header in each file consists of 10 lines and includes information on geographical location (decimal degrees), deployment and recovery dates (YYYY-MM-DDThh:mm:ss), bottom and instrument depths, the equipment used for measurements and source of financial support. There are 4-7 data columns. For the T and TD moorings, the columns are Date/Time (YYYY-MM-DDThh:mm:ss), Pressure (dbar), Depth (m) and Temperature (°C). For the moorings without pressure sensor (only T), pressure and depth columns are marked as NaN and the average instrument depth can be found in the header. The CTD moorings include additional columns for Potential temperature (°C), Practical salinity and Density represented as Sigma-Theta (kg/m**3). Conductivity is not included in this dataset, but can be found in the raw data. Suspicious data and outliers are detected and removed and the data is smoothed. No interpolation is performed and missing data are marked with NaN. The data columns are tab-delimited and the data stored in ASCII-formatted .txt-files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
Change Log
Version 2
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.