43 datasets found

f
Data from: Valid Inference Corrected for Outlier Removal
figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9762731.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Shuxiao Chen; Jacob Bien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.
d
Supporting data for \"A Standard Operating Procedure for Outlier Removal in...
search.dataone.org
dataverse.azure.uit.no
+1more
Updated Jul 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. https://search.dataone.org/view/sha256%3A08484b821e24ce46dbeb405a81e84d7457a8726456522e23d340739f2ff809ae
Explore at:
Dataset updated
Jul 29, 2024
Dataset provided by
DataverseNO
Authors
Holsbø, Einar
Description
This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details
Data for Filtering Organized 3D Point Clouds for Bin Picking Applications
datasets.ai
catalog.data.gov
0, 34, 47
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). Data for Filtering Organized 3D Point Clouds for Bin Picking Applications [Dataset]. https://datasets.ai/datasets/data-for-filtering-organized-3d-point-clouds-for-bin-picking-applications
Explore at:
0, 34, 47Available download formats
Dataset updated
Aug 6, 2024
Dataset authored and provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Contains scans of a bin filled with different parts ( screws, nuts, rods, spheres, sprockets). For each part type, RGB image and organized 3D point cloud obtained with structured light sensor are provided. In addition, unorganized 3D point cloud representing an empty bin and a small Matlab script to read the files is also provided. 3D data contain a lot of outliers and the data were used to demonstrate a new filtering technique.
Outlier removal, sum scores, and the inflation of the Type I error rate
osf.io
Updated Sep 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marjan Bakker; Jelte Wicherts (2016). Outlier removal, sum scores, and the inflation of the Type I error rate [Dataset]. https://osf.io/95xqz
Explore at:
Dataset updated
Sep 20, 2016
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Marjan Bakker; Jelte Wicherts
Description
No description was included in this Dataset collected from the OSF
Predictive Validity Data Set
figshare.com
txt
Updated Dec 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Abeyta (2022). Predictive Validity Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.17030021.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17030021.v1
Dataset updated
Dec 18, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Antonio Abeyta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.
c
11: Streamwater sample constituent concentration outliers from 15 watersheds...
s.cnmilf.com
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). 11: Streamwater sample constituent concentration outliers from 15 watersheds in Gwinnett County, Georgia for water years 2003-2020 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/11-streamwater-sample-constituent-concentration-outliers-from-15-watersheds-in-gwinne-2003
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Gwinnett County, Georgia
Description
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
Data from: AOL Dataset for Browsing History and Topics of Interest
zenodo.org
data.niaid.nih.gov
csv, txt
Updated Jun 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Henrique Nunes; Gabriel Henrique Nunes (2024). AOL Dataset for Browsing History and Topics of Interest [Dataset]. http://doi.org/10.5281/zenodo.11229615
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11229615
Dataset updated
Jun 24, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriel Henrique Nunes; Gabriel Henrique Nunes
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
AOL Dataset for Browsing History and Topics of Interest

This record provides the datasets of the paper The Privacy-Utility Trade-off in the Topics API (DOI: 10.1145/3658644.3670368; arXiv: 2406.15309).

The datasets generating code and the experimental results can be found in 10.5281/zenodo.11229402 (github.com/nunesgh/topics-api-analysis).

Files

AOL-treated.csv: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies. It contains singletons (individuals with only one domain in their browsing histories) and one outlier (one user with 150.802 domain visits in three months) that are dropped in some analyses.

AOL-treated-unique-domains.csv: Auxiliary dataset containing all the unique domains from AOL-treated.csv.

Citizen-Lab-Classification.csv: Auxiliary dataset containing the Citizen Lab Classification data, as of commit ebd0ee8, treated for inconsistencies and filtered according to Mozilla's Public Suffix List, as of commit 5e6ac3a, extended by the discontinued TLDs: .bg.ac.yu, .ac.yu, .cg.yu, .co.yu, .edu.yu, .gov.yu, .net.yu, .org.yu, .yu, .or.tp, .tp, and .an.

AOL-treated-Citizen-Lab-Classification-domain-match.csv: Auxiliary dataset containing domains matched from AOL-treated-unique-domains.csv with domains and respective topics from Citizen-Lab-Classification.csv.

Google-Topics-Classification-v1.txt: Auxiliary dataset containing the Google Topics API taxonomy v1 data as provided by Google with the Chrome browser.

AOL-treated-Google-Topics-Classification-v1-domain-match.csv: Auxiliary dataset containing domains matched from AOL-treated-unique-domains.csv with domains and respective topics from Google-Topics-Classification-v1.txt.

AOL-reduced-Citizen-Lab-Classification.csv: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies, and for analyses of topics of interest vulnerability and utility, as enabled by the Topics API. It contains singletons and the outlier that are dropped in some analyses.
This dataset can be used for analyses including the (data-dependent) randomness of trimming-down or filling-up the top-s sets of topics for each individual so each set has s topics. Privacy results for Generalization and utility results for Generalization, Bounded Noise, and Differential Privacy are expected to slightly vary with each run of the analyses over this dataset.

AOL-reduced-Google-Topics-Classification-v1.csv: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies, and for analyses of topics of interest vulnerability and utility, as enabled by the Topics API. It contains singletons and the outlier that are dropped in some analyses.
This dataset can be used for analyses including the (data-dependent) randomness of trimming-down or filling-up the top-s sets of topics for each individual so each set has s topics. Privacy results for Generalization and utility results for Generalization, Bounded Noise, and Differential Privacy are expected to slightly vary with each run of the analyses over this dataset.

AOL-experimental.csv: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.

AOL-experimental-Citizen-Lab-Classification.csv: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.

AOL-experimental-Google-Topics-Classification-v1.csv: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.

License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
d
NZ Height Conversion Index - Dataset - data.govt.nz - discover and use data
catalogue.data.govt.nz
Updated Sep 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). NZ Height Conversion Index - Dataset - data.govt.nz - discover and use data [Dataset]. https://catalogue.data.govt.nz/dataset/nz-height-conversion-index1
Explore at:
Dataset updated
Sep 30, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New Zealand
Description
This index enables users to identify the extent of the relationship grids provided on LDS, which are used to convert heights provided in terms of one of 13 historic local vertical datums to NZVD2016. The polygons comprising the index show the extent of the conversion grids. Users can view the following polygon attributes: Shape_VDR: Vertical Datum Relationship grid area LVD: Local Vertical Datum Control: Number of control marks used to compute the relationship grid Mean: Mean vertical datum relationship value at control points Std: Standard deviation of vertical datum relationship value at control points Min: Minimum vertical datum relationship value at control points Max: Maximum vertical datum relationship value at control points Range: Range of vertical datum relationship value at control points Ref: Reference control mark for the local vertical datum Ref_value: Vertical datum relationship value at the reference mark Grid: Formal grid id Users should note that the values represented in this dataset have been calculated with the outliers excluded. These same outliers were excluded during the computation of the relationship grids, but were included when calculating the 95% confidence intervals More information on converting heights between vertical datums can be found on the LINZ website.
d
Data from: Surface Meteorological Station - ANL 10m tower, Walla Walla - Raw...
catalog.data.gov
data.openei.org
+1more
Updated Aug 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wind Energy Technologies Office (WETO) (2021). Surface Meteorological Station - ANL 10m tower, Walla Walla - Raw Data [Dataset]. https://catalog.data.gov/dataset/sodar-vaisala-triton-wind-profiler-aon7-reviewed-data
Explore at:
Dataset updated
Aug 7, 2021
Dataset provided by
Wind Energy Technologies Office (WETO)
Description
Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.
d
Surface Meteorological Station - ANL 10m tower, Yakima - Raw Data
catalog.data.gov
data.openei.org
+1more
Updated Aug 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wind Energy Technologies Office (WETO) (2021). Surface Meteorological Station - ANL 10m tower, Yakima - Raw Data [Dataset]. https://catalog.data.gov/dataset/sodar-vaisala-triton-wind-profiler-aon8-raw-data
Explore at:
Dataset updated
Aug 7, 2021
Dataset provided by
Wind Energy Technologies Office (WETO)
Description
Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.
Machine learning pipeline to train toxicity prediction model of...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3529162
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Ewald; Jan Ewald
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

01_DATA # preprocessing and filtering of raw activity data from ChEMBL
- Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
- filt_stats.R # Filtering and preparation of raw data
- Filtered # output data sets from filt_stats.R
- toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
- datastore # files with all compounds and their calculated molecular descriptors based on SMILES
- scripts
- calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
- chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
- datastore # output files with statistics calculated by make_Z.R
- scripts
-make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
- datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
- scripts
-calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

- datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
- scripts
- data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
- Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
- Rforest.R # based on analysis of Rforest_CV.R learning of final models

rregrs_output
# early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2
Z
Dataset on the Human Body as a Signal Propagation Medium
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. Ormanis (2024). Dataset on the Human Body as a Signal Propagation Medium [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8214496
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
V. Medvedevs
V. Aristovs
J. Ormanis
A. Sevcenko
A. Elsts
V. Abolins
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.

Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.

Overview statistics:

Number of subjects: 30

Number of transmitter locations: 6

Number of receiver locations: 6

Number of measurement frequencies: 19

Input voltage: 1 V

Load resistance: 50 ohm and 1 megaohm

Measurement group statistics:

Height: 174.10 (7.15)

Weight: 72.85 (16.26)

BMI: 23.94 (4.70)

Body fat %: 21.53 (7.55)

Age group: 29.00 (11.25)

Male/female ratio: 50%

Included files:

experiment_protocol_description.docx - protocol used in the experiments

electrode_placement_schematic.png - schematic of placement locations

electrode_placement_photo.jpg - visualization on the experiment, on a volunteer subject

RawData - the full measurement results and experiment info sheets

all_measurements.csv - the most important results extracted to .csv

all_measurements_filtered.csv - same, but after z-score filtering

all_measurements_by_freq.csv - the most important results extracted to .csv, single frequency per row

all_measurements_by_freq_filtered.csv - same, but after z-score filtering

summary_of_subjects.csv - key statistics on the subjects from the experiment info sheets

process_json_files.py - script that creates .csv from the raw data

filter_results.py - outlier removal based on z-score

plot_sample_curves.py - visualization of a randomly selected measurement result subset

plot_measurement_group.py - visualization of the measurement group

CSV file columns:

subject_id - participant's random unique ID

experiment_id - measurement session's number for the participant

height - participant's height, cm

weight - participant's weight, kg

BMI - body mass index, computed from the valued above

body_fat_% - body fat composition, as measured by bioimpedance scales

age_group - age rounded to 10 years, e.g. 20, 30, 40 etc.

male - 1 if male, 0 if female

tx_point - transmitter point number

rx_point - receiver point number

distance - distance, in relative units, between the tx and rx points. Not scaled in terms of participant's height and limb lengths!

tx_point_fat_level - transmitter point location's average fat content metric. Not scaled for each participant individually.

rx_point_fat_level - receiver point location's average fat content metric. Not scaled for each participant individually.

total_fat_level - sum of rx and tx fat levels

bias - constant term to simplify data analytics, always equal to 1.0

CSV file columns, frequency-specific:

tx_abs_Z_... - transmitter-side impedance, as computed by the process_json_files.py script from the voltage drop

rx_gain_50_f_... - experimentally measured gain on the receiver, in dB, using 50 ohm load impedance

rx_gain_1M_f_... - experimentally measured gain on the receiver, in dB, using 1 megaohm load impedance

Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.

References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.

Contact information: info@edi.lv
d
Surface Meteorological Station - ANL 10m tower, Goldendale - Raw Data
catalog.data.gov
data.openei.org
Updated Aug 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wind Energy Technologies Office (WETO) (2021). Surface Meteorological Station - ANL 10m tower, Goldendale - Raw Data [Dataset]. https://catalog.data.gov/dataset/sodar-vaisala-triton-wind-profiler-aon7-processed-data
Explore at:
Dataset updated
Aug 7, 2021
Dataset provided by
Wind Energy Technologies Office (WETO)
Description
Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.
Hydrochemistry analysis of the Galilee subregion
researchdata.edu.au
demo.dev.magda.io
Updated Dec 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2018). Hydrochemistry analysis of the Galilee subregion [Dataset]. https://researchdata.edu.au/hydrochemistry-analysis-galilee-subregion/2991745
Explore at:
Dataset updated
Dec 6, 2018
Dataset provided by
Data.govhttps://data.gov/
Authors
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Galilee
Description
Abstract

This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

This dataset contains analyses and summaries of hydrochemistry data for the Galilee subregion, and includes an additional quality assurance of the source hydrochemistry and waterlevel data to remove anomalous and outlier values.

Dataset History

Several bores were removed from the 'chem master sheet' in the QLD Hydrochemistry QA QC GAL v02 (GUID: e3fb6c9b-e224-4d2e-ad11-4bcba882b0af) dataset based on their TDS values. Bores with high or unrealistic TDS that were removed are found at the bottom of the 'updated data' sheet.

Outlier water level values from the JK GAL Bore Waterlevels v01 (GUID: 2f8fe7e6-021f-4070-9f63-aa996b77469d) dataset were identified and removed. Those bores are identified in the 'outliers not used' sheet

Pivot tables were created to summarise data, and create various histograms for analysis and interpretation. These are found in the 'chemistry histogram', 'Pivot tables', 'summaries'.

Dataset Citation

Bioregional Assessment Programme (2016) Hydrochemistry analysis of the Galilee subregion. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/fd944f9f-14f6-4e20-bb8a-61d1116412ec.

Dataset Ancestors

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204

Derived From QLD DNRM Hydrochemistry with QA/QC

Derived From QLD Hydrochemistry QA QC GAL v02

Derived From QLD DNRM Galilee Mine Groundwater Bores - Water Levels

Derived From Galilee bore water levels v01

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014

Derived From RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original)

Derived From Geoscience Australia, 1 second SRTM Digital Elevation Model (DEM)

Derived From Carmichael Coal Mine and Rail Project Environmental Impact Statement

Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
P
PointDenoisingBenchmark Dataset
paperswithcode.com
opendatalab.com
Updated Jan 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov (2019). PointDenoisingBenchmark Dataset [Dataset]. https://paperswithcode.com/dataset/pointcleannet
Explore at:
Dataset updated
Jan 3, 2019
Authors
Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov
Description
The PointDenoisingBenchmark dataset features 28 different shapes, split into 18 training shapes and 10 test shapes.

PointDenoisingBenchmark for outliers removal: contains noisy point clouds with different levels of gaussian noise and the corresponding clean ground truths. PointDenoisingBenchmark for denoising: contains noisy point clouds with different levels of noise and density of outliers and the corresponding clean ground truths.
d
Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous...
b2find.dkrz.de
Updated Oct 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous record: outliers, v2 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/66235d57-bc41-5c02-b4cb-05318374b889
Explore at:
Dataset updated
Oct 21, 2023
Area covered
North Greenland
Description
Description and NotesDescription: Methane concentration from the Greenland NEEM-2011-S1 Ice Core from 71 to 408m depth (~270-1961 CE). Methane concentrations analysed online by laser spectrometer (SARA, Spectroscopy by Amplified Resonant Absorption, developed at Laboratoire Interdisciplinaire de Physique, Grenoble, France) on gas extracted from an ice core processed using a continuous melter system (Desert Research Institute). Methane data have a 5 second integration time (raw data acquisition rate 0.6 Hz). Analytical precision, from Allan Variance test, is 0.9 ppb (2 sigma). Long-term reproducibility is 2.6% (2 sigma). Gaps in the record are due to problems during online analysis. Online analysis conducted August-September 2011.Note: Lat-Long provided is for main NEEM borehole. The NEEM-2011-S1 core was drilled 200 m distance away in 2011 to 410 m depth.Methane concentrations are reported on NOAA2004 scale (instrument calibrated on dry synthetic air standards).A correction factor of 1.079 has been applied to all data to correct for methane dissolution in melted ice core sample prior to gas extraction. Correction factor calculated using empirical data (concentrations not aligned/tied to existing discrete methane measurements).Additional methods description provided in: Stowasser, C., Buizert, C., Gkinis, V., Chappellaz, J., Schupbach, S., Bigler, M., Fain, X., Sperlich, P., Baumgartner, M., Schilt, A., Blunier, T., 2012. Continuous measurements of methane mixing ratios from ice cores. Atmos. Meas. Tech. 5, 999-1013. Morville, J., Kassi, S., Chenevier, M., Romanini, D., 2005. Fast, low-noise, mode bymode, cavity-enhanced absorption spectroscopy by diode-laser self-locking. Appl. Phys. B Lasers Opt. 80, 1027-01038.* NEEM (North Greenland Eemian Ice Drilling) project information http://neem.dk/ NEEM-2011-S1 CH4 outliers.Data points removed from dataset according to specified cut-off value.Please refer to Rhodes et al. (2013) for full discussion of origins outlying data points. Briefly, these high frequency features are not artifacts of the continuous method and have been replicated by traditional discrete analyses. Comparison to chemistry measurements suggests they are related to biological in situ production of methane.
d
Surface Meteorological Station - ANL 80m, Sonic, Physics site-12 - Reviewed...
datasets.ai
data.openei.org
+2more
0
Updated Sep 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2024). Surface Meteorological Station - ANL 80m, Sonic, Physics site-12 - Reviewed Data [Dataset]. https://datasets.ai/datasets/surface-meteorological-station-anl-10m-1-sonics-1-ebbr-physics-site-3-raw-data
Explore at:
0Available download formats
Dataset updated
Sep 9, 2024
Dataset authored and provided by
Department of Energy
Description
Overview

Measurements of surface sensible heat flux, momentum flux, wind components, and virtual temperature.

Data Details

X (column 1) is a component of wind cm/s plus toward north.

Y (column 2) is a component of wind cm/s plus toward east.

Z (column 3) is a component of wind cm/s plus up.

T (column 4) is sonic virtual temperature in degrees C*100.

hh:mm:ss is data collection time in UTC.

Data Quality

The Argonne National Laboratory sonic anemometer measurements are visually inspected weekly for data outliers or instrument problems. The final dataset sent to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process used for Atmospheric Radiation Measurement (ARM) program eddy correlation (ECOR) data.

Uncertainty

The uncertainties of the basic sonic anemometer measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturer. Based on historical experience with this measurement technique, flux measurement uncertainty is +/- 10 percent, although the uncertainty can be much greater during stable atmospheric conditions when turbulence intensity and atmospheric gradients are small and advection from beyond the normal fetch can occur. In particular, the Physics Site-12 tower's sonic anemometer measurements can have greater uncertainty when the wind blows through the tower structure.

Constraints

During stable atmospheric conditions, turbulence intensity and atmospheric gradients often are small, approaching or exceeding the measurement resolution of sonic anemometers. Under these conditions, advection from beyond the normal fetch also can occur, making interpretation of the fluxes difficult. Notably, the Physics Site-12 tower sonic anemometer measurements can be affected when the wind blows through the tower structure. Some unusual biases in the vertical velocities measured at the Physics Site-12 tower with west wind conditions also have not been adequately explained.
c
One-hour averaged temperature and salinity timeseries from bottom-moored...
polar.cenagis.edu.pl
Updated Feb 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). One-hour averaged temperature and salinity timeseries from bottom-moored instruments in Hornsund fjord - Dataset - POLAR-PL Catalog [Dataset]. https://polar.cenagis.edu.pl/dataset/temp_sal_one_hour_averaged_mooring_data
Explore at:
Dataset updated
Feb 27, 2025
Area covered
Hornsund
Description
Dataset consisting of continuous measurements of temperature and salinity from bottom-moored instruments. The measurements are obtained at fixed depths in varying locations across the Hornsund fjord. All data is averaged to one-hour intervals. The files are named with the mooring ID consisting of the measured parameters (CTD for Conductivity-Temperature-Depth, TD for Temperature-Depth, T for Temperature) and a running number followed by the deployment and recovery dates (format YYYYMMDD) as well as the stage of data processing (for this dataset "hourly"). However, when observations are made at one of the stations included in the CTD monitoring program, the station name is used instead of a mooring ID. The header in each file consists of 10 lines and includes information on geographical location (decimal degrees), deployment and recovery dates (YYYY-MM-DDThh:mm:ss), bottom and instrument depths, the equipment used for measurements and source of financial support. There are 4-7 data columns. For the T and TD moorings, the columns are Date/Time (YYYY-MM-DDThh:mm:ss), Pressure (dbar), Depth (m) and Temperature (°C). For the moorings without pressure sensor (only T), pressure and depth columns are marked as NaN and the average instrument depth can be found in the header. The CTD moorings include additional columns for Potential temperature (°C), Practical salinity and Density represented as Sigma-Theta (kg/m**3). Conductivity is not included in this dataset, but can be found in the raw data. Suspicious data and outliers are detected and removed and the data is smoothed. No interpolation is performed and missing data are marked with NaN. The data columns are tab-delimited and the data stored in ASCII-formatted .txt-files.
z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
explore.openaire.eu
+1more
bin, csv
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.8338435
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8338435
Dataset updated
Jul 11, 2024
Dataset provided by
Solenix Engineering GmbH
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.

Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

Change Log

Version 2

Metadata: we include a metadata.csv with information about:

Anomaly categories

Root cause channel (signal in which the anomaly is first visible)

Affected channel (signal in which the anomaly might propagate) through coupled system dynamics

Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.

Two data files: CSV and parquet for convenience.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1

Data from: Valid Inference Corrected for Outlier Removal

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.9762731.v1

Dataset updated

May 30, 2023

Dataset provided by

Taylor & Francis

Authors

Shuxiao Chen; Jacob Bien

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.

Clear search

Close search

Google apps

Main menu

Data from: Valid Inference Corrected for Outlier Removal

Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

Data for Filtering Organized 3D Point Clouds for Bin Picking Applications

Outlier removal, sum scores, and the inflation of the Type I error rate

Predictive Validity Data Set

11: Streamwater sample constituent concentration outliers from 15 watersheds...

Data from: AOL Dataset for Browsing History and Topics of Interest

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

NZ Height Conversion Index - Dataset - data.govt.nz - discover and use data

Data from: Surface Meteorological Station - ANL 10m tower, Walla Walla - Raw...

Surface Meteorological Station - ANL 10m tower, Yakima - Raw Data

Machine learning pipeline to train toxicity prediction model of...

Dataset on the Human Body as a Signal Propagation Medium

Surface Meteorological Station - ANL 10m tower, Goldendale - Raw Data

Hydrochemistry analysis of the Galilee subregion

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

PointDenoisingBenchmark Dataset

Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous...

Surface Meteorological Station - ANL 80m, Sonic, Physics site-12 - Reviewed...

One-hour averaged temperature and salinity timeseries from bottom-moored...

Controlled Anomalies Time Series (CATS) Dataset

Data from: Valid Inference Corrected for Outlier Removal