43 datasets found
  1. f

    Data from: Valid Inference Corrected for Outlier Removal

    • figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Shuxiao Chen; Jacob Bien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.

  2. d

    Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

    • search.dataone.org
    • dataverse.azure.uit.no
    • +1more
    Updated Jul 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. https://search.dataone.org/view/sha256%3A08484b821e24ce46dbeb405a81e84d7457a8726456522e23d340739f2ff809ae
    Explore at:
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    DataverseNO
    Authors
    Holsbø, Einar
    Description

    This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

  3. Data for Filtering Organized 3D Point Clouds for Bin Picking Applications

    • datasets.ai
    • catalog.data.gov
    0, 34, 47
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2024). Data for Filtering Organized 3D Point Clouds for Bin Picking Applications [Dataset]. https://datasets.ai/datasets/data-for-filtering-organized-3d-point-clouds-for-bin-picking-applications
    Explore at:
    0, 34, 47Available download formats
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Contains scans of a bin filled with different parts ( screws, nuts, rods, spheres, sprockets). For each part type, RGB image and organized 3D point cloud obtained with structured light sensor are provided. In addition, unorganized 3D point cloud representing an empty bin and a small Matlab script to read the files is also provided. 3D data contain a lot of outliers and the data were used to demonstrate a new filtering technique.

  4. Outlier removal, sum scores, and the inflation of the Type I error rate

    • osf.io
    Updated Sep 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marjan Bakker; Jelte Wicherts (2016). Outlier removal, sum scores, and the inflation of the Type I error rate [Dataset]. https://osf.io/95xqz
    Explore at:
    Dataset updated
    Sep 20, 2016
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Marjan Bakker; Jelte Wicherts
    Description

    No description was included in this Dataset collected from the OSF

  5. Predictive Validity Data Set

    • figshare.com
    txt
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Abeyta (2022). Predictive Validity Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.17030021.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 18, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Antonio Abeyta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.

  6. c

    11: Streamwater sample constituent concentration outliers from 15 watersheds...

    • s.cnmilf.com
    • data.usgs.gov
    • +2more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). 11: Streamwater sample constituent concentration outliers from 15 watersheds in Gwinnett County, Georgia for water years 2003-2020 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/11-streamwater-sample-constituent-concentration-outliers-from-15-watersheds-in-gwinne-2003
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Gwinnett County, Georgia
    Description

    This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.

  7. Data from: AOL Dataset for Browsing History and Topics of Interest

    • zenodo.org
    • data.niaid.nih.gov
    csv, txt
    Updated Jun 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Henrique Nunes; Gabriel Henrique Nunes (2024). AOL Dataset for Browsing History and Topics of Interest [Dataset]. http://doi.org/10.5281/zenodo.11229615
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    Jun 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel Henrique Nunes; Gabriel Henrique Nunes
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AOL Dataset for Browsing History and Topics of Interest

    This record provides the datasets of the paper The Privacy-Utility Trade-off in the Topics API (DOI: 10.1145/3658644.3670368; arXiv: 2406.15309).

    The datasets generating code and the experimental results can be found in 10.5281/zenodo.11229402 (github.com/nunesgh/topics-api-analysis).

    Files

    1. AOL-treated.csv: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies. It contains singletons (individuals with only one domain in their browsing histories) and one outlier (one user with 150.802 domain visits in three months) that are dropped in some analyses.
    2. AOL-treated-unique-domains.csv: Auxiliary dataset containing all the unique domains from AOL-treated.csv.
    3. Citizen-Lab-Classification.csv: Auxiliary dataset containing the Citizen Lab Classification data, as of commit ebd0ee8, treated for inconsistencies and filtered according to Mozilla's Public Suffix List, as of commit 5e6ac3a, extended by the discontinued TLDs: .bg.ac.yu, .ac.yu, .cg.yu, .co.yu, .edu.yu, .gov.yu, .net.yu, .org.yu, .yu, .or.tp, .tp, and .an.
    4. AOL-treated-Citizen-Lab-Classification-domain-match.csv: Auxiliary dataset containing domains matched from AOL-treated-unique-domains.csv with domains and respective topics from Citizen-Lab-Classification.csv.
    5. Google-Topics-Classification-v1.txt: Auxiliary dataset containing the Google Topics API taxonomy v1 data as provided by Google with the Chrome browser.
    6. AOL-treated-Google-Topics-Classification-v1-domain-match.csv: Auxiliary dataset containing domains matched from AOL-treated-unique-domains.csv with domains and respective topics from Google-Topics-Classification-v1.txt.
    7. AOL-reduced-Citizen-Lab-Classification.csv: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies, and for analyses of topics of interest vulnerability and utility, as enabled by the Topics API. It contains singletons and the outlier that are dropped in some analyses.
      This dataset can be used for analyses including the (data-dependent) randomness of trimming-down or filling-up the top-s sets of topics for each individual so each set has s topics. Privacy results for Generalization and utility results for Generalization, Bounded Noise, and Differential Privacy are expected to slightly vary with each run of the analyses over this dataset.
    8. AOL-reduced-Google-Topics-Classification-v1.csv: This dataset can be used for analyses of browsing history vulnerability and utility, as enabled by third-party cookies, and for analyses of topics of interest vulnerability and utility, as enabled by the Topics API. It contains singletons and the outlier that are dropped in some analyses.
      This dataset can be used for analyses including the (data-dependent) randomness of trimming-down or filling-up the top-s sets of topics for each individual so each set has s topics. Privacy results for Generalization and utility results for Generalization, Bounded Noise, and Differential Privacy are expected to slightly vary with each run of the analyses over this dataset.
    9. AOL-experimental.csv: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.
    10. AOL-experimental-Citizen-Lab-Classification.csv: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.
    11. AOL-experimental-Google-Topics-Classification-v1.csv: This dataset can be used to empirically verify code correctness for 10.5281/zenodo.11229402. All privacy and utility results are expected to remain the same with each run of the analyses over this dataset.

    License

    Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

  8. COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 2 consisted of the following sections

    Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

  9. d

    NZ Height Conversion Index - Dataset - data.govt.nz - discover and use data

    • catalogue.data.govt.nz
    Updated Sep 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). NZ Height Conversion Index - Dataset - data.govt.nz - discover and use data [Dataset]. https://catalogue.data.govt.nz/dataset/nz-height-conversion-index1
    Explore at:
    Dataset updated
    Sep 30, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New Zealand
    Description

    This index enables users to identify the extent of the relationship grids provided on LDS, which are used to convert heights provided in terms of one of 13 historic local vertical datums to NZVD2016. The polygons comprising the index show the extent of the conversion grids. Users can view the following polygon attributes: Shape_VDR: Vertical Datum Relationship grid area LVD: Local Vertical Datum Control: Number of control marks used to compute the relationship grid Mean: Mean vertical datum relationship value at control points Std: Standard deviation of vertical datum relationship value at control points Min: Minimum vertical datum relationship value at control points Max: Maximum vertical datum relationship value at control points Range: Range of vertical datum relationship value at control points Ref: Reference control mark for the local vertical datum Ref_value: Vertical datum relationship value at the reference mark Grid: Formal grid id Users should note that the values represented in this dataset have been calculated with the outliers excluded. These same outliers were excluded during the computation of the relationship grids, but were included when calculating the 95% confidence intervals More information on converting heights between vertical datums can be found on the LINZ website.

  10. d

    Data from: Surface Meteorological Station - ANL 10m tower, Walla Walla - Raw...

    • catalog.data.gov
    • data.openei.org
    • +1more
    Updated Aug 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wind Energy Technologies Office (WETO) (2021). Surface Meteorological Station - ANL 10m tower, Walla Walla - Raw Data [Dataset]. https://catalog.data.gov/dataset/sodar-vaisala-triton-wind-profiler-aon7-reviewed-data
    Explore at:
    Dataset updated
    Aug 7, 2021
    Dataset provided by
    Wind Energy Technologies Office (WETO)
    Description

    Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.

  11. d

    Surface Meteorological Station - ANL 10m tower, Yakima - Raw Data

    • catalog.data.gov
    • data.openei.org
    • +1more
    Updated Aug 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wind Energy Technologies Office (WETO) (2021). Surface Meteorological Station - ANL 10m tower, Yakima - Raw Data [Dataset]. https://catalog.data.gov/dataset/sodar-vaisala-triton-wind-profiler-aon8-raw-data
    Explore at:
    Dataset updated
    Aug 7, 2021
    Dataset provided by
    Wind Energy Technologies Office (WETO)
    Description

    Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.

  12. Machine learning pipeline to train toxicity prediction model of...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Ewald; Jan Ewald
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

    01_DATA # preprocessing and filtering of raw activity data from ChEMBL
    - Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
    - filt_stats.R # Filtering and preparation of raw data
    - Filtered # output data sets from filt_stats.R
    - toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

    02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
    - datastore # files with all compounds and their calculated molecular descriptors based on SMILES
    - scripts
    - calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
    - chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

    03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
    - datastore # output files with statistics calculated by make_Z.R
    - scripts
    -make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

    04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
    - datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
    - scripts
    -calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

    05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

    - datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
    - scripts
    - data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
    - Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
    - Rforest.R # based on analysis of Rforest_CV.R learning of final models

    rregrs_output
    # early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2

  13. Z

    Dataset on the Human Body as a Signal Propagation Medium

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Ormanis (2024). Dataset on the Human Body as a Signal Propagation Medium [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8214496
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    V. Medvedevs
    V. Aristovs
    J. Ormanis
    A. Sevcenko
    A. Elsts
    V. Abolins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.

    Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.

    Overview statistics:

    Number of subjects: 30

    Number of transmitter locations: 6

    Number of receiver locations: 6

    Number of measurement frequencies: 19

    Input voltage: 1 V

    Load resistance: 50 ohm and 1 megaohm

    Measurement group statistics:

    Height: 174.10 (7.15)

    Weight: 72.85 (16.26)

    BMI: 23.94 (4.70)

    Body fat %: 21.53 (7.55)

    Age group: 29.00 (11.25)

    Male/female ratio: 50%

    Included files:

    experiment_protocol_description.docx - protocol used in the experiments

    electrode_placement_schematic.png - schematic of placement locations

    electrode_placement_photo.jpg - visualization on the experiment, on a volunteer subject

    RawData - the full measurement results and experiment info sheets

    all_measurements.csv - the most important results extracted to .csv

    all_measurements_filtered.csv - same, but after z-score filtering

    all_measurements_by_freq.csv - the most important results extracted to .csv, single frequency per row

    all_measurements_by_freq_filtered.csv - same, but after z-score filtering

    summary_of_subjects.csv - key statistics on the subjects from the experiment info sheets

    process_json_files.py - script that creates .csv from the raw data

    filter_results.py - outlier removal based on z-score

    plot_sample_curves.py - visualization of a randomly selected measurement result subset

    plot_measurement_group.py - visualization of the measurement group

    CSV file columns:

    subject_id - participant's random unique ID

    experiment_id - measurement session's number for the participant

    height - participant's height, cm

    weight - participant's weight, kg

    BMI - body mass index, computed from the valued above

    body_fat_% - body fat composition, as measured by bioimpedance scales

    age_group - age rounded to 10 years, e.g. 20, 30, 40 etc.

    male - 1 if male, 0 if female

    tx_point - transmitter point number

    rx_point - receiver point number

    distance - distance, in relative units, between the tx and rx points. Not scaled in terms of participant's height and limb lengths!

    tx_point_fat_level - transmitter point location's average fat content metric. Not scaled for each participant individually.

    rx_point_fat_level - receiver point location's average fat content metric. Not scaled for each participant individually.

    total_fat_level - sum of rx and tx fat levels

    bias - constant term to simplify data analytics, always equal to 1.0

    CSV file columns, frequency-specific:

    tx_abs_Z_... - transmitter-side impedance, as computed by the process_json_files.py script from the voltage drop

    rx_gain_50_f_... - experimentally measured gain on the receiver, in dB, using 50 ohm load impedance

    rx_gain_1M_f_... - experimentally measured gain on the receiver, in dB, using 1 megaohm load impedance

    Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.

    References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.

    Contact information: info@edi.lv

  14. d

    Surface Meteorological Station - ANL 10m tower, Goldendale - Raw Data

    • catalog.data.gov
    • data.openei.org
    Updated Aug 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wind Energy Technologies Office (WETO) (2021). Surface Meteorological Station - ANL 10m tower, Goldendale - Raw Data [Dataset]. https://catalog.data.gov/dataset/sodar-vaisala-triton-wind-profiler-aon7-processed-data
    Explore at:
    Dataset updated
    Aug 7, 2021
    Dataset provided by
    Wind Energy Technologies Office (WETO)
    Description

    Overview Basic meteorological measurements. Data Quality The Argonne National Laboratory Surface Meteorology Systems (MET) measurements collected at collocated radar wind profiler sites are visually inspected weekly for data outliers or instrument problems. Of note, the surface MET stations have had few data quality issues. The final dataset provided to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process as is used for ARM MET data. Uncertainty The uncertainties of the MET measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturers. Constraints There are no constraints on MET measurements concerning acceptable wind directions or meteorological conditions.

  15. Hydrochemistry analysis of the Galilee subregion

    • researchdata.edu.au
    • demo.dev.magda.io
    Updated Dec 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2018). Hydrochemistry analysis of the Galilee subregion [Dataset]. https://researchdata.edu.au/hydrochemistry-analysis-galilee-subregion/2991745
    Explore at:
    Dataset updated
    Dec 6, 2018
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Galilee
    Description

    Abstract

    This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    This dataset contains analyses and summaries of hydrochemistry data for the Galilee subregion, and includes an additional quality assurance of the source hydrochemistry and waterlevel data to remove anomalous and outlier values.

    Dataset History

    1. Several bores were removed from the 'chem master sheet' in the QLD Hydrochemistry QA QC GAL v02 (GUID: e3fb6c9b-e224-4d2e-ad11-4bcba882b0af) dataset based on their TDS values. Bores with high or unrealistic TDS that were removed are found at the bottom of the 'updated data' sheet.

    2. Outlier water level values from the JK GAL Bore Waterlevels v01 (GUID: 2f8fe7e6-021f-4070-9f63-aa996b77469d) dataset were identified and removed. Those bores are identified in the 'outliers not used' sheet

    3. Pivot tables were created to summarise data, and create various histograms for analysis and interpretation. These are found in the 'chemistry histogram', 'Pivot tables', 'summaries'.

    Dataset Citation

    Bioregional Assessment Programme (2016) Hydrochemistry analysis of the Galilee subregion. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/fd944f9f-14f6-4e20-bb8a-61d1116412ec.

    Dataset Ancestors

  16. P

    PointDenoisingBenchmark Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jan 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov (2019). PointDenoisingBenchmark Dataset [Dataset]. https://paperswithcode.com/dataset/pointcleannet
    Explore at:
    Dataset updated
    Jan 3, 2019
    Authors
    Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov
    Description

    The PointDenoisingBenchmark dataset features 28 different shapes, split into 18 training shapes and 10 test shapes.

    PointDenoisingBenchmark for outliers removal: contains noisy point clouds with different levels of gaussian noise and the corresponding clean ground truths. PointDenoisingBenchmark for denoising: contains noisy point clouds with different levels of noise and density of outliers and the corresponding clean ground truths.

  17. d

    Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous...

    • b2find.dkrz.de
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous record: outliers, v2 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/66235d57-bc41-5c02-b4cb-05318374b889
    Explore at:
    Dataset updated
    Oct 21, 2023
    Area covered
    North Greenland
    Description

    Description and NotesDescription: Methane concentration from the Greenland NEEM-2011-S1 Ice Core from 71 to 408m depth (~270-1961 CE). Methane concentrations analysed online by laser spectrometer (SARA, Spectroscopy by Amplified Resonant Absorption, developed at Laboratoire Interdisciplinaire de Physique, Grenoble, France) on gas extracted from an ice core processed using a continuous melter system (Desert Research Institute). Methane data have a 5 second integration time (raw data acquisition rate 0.6 Hz). Analytical precision, from Allan Variance test, is 0.9 ppb (2 sigma). Long-term reproducibility is 2.6% (2 sigma). Gaps in the record are due to problems during online analysis. Online analysis conducted August-September 2011.Note: Lat-Long provided is for main NEEM borehole. The NEEM-2011-S1 core was drilled 200 m distance away in 2011 to 410 m depth.Methane concentrations are reported on NOAA2004 scale (instrument calibrated on dry synthetic air standards).A correction factor of 1.079 has been applied to all data to correct for methane dissolution in melted ice core sample prior to gas extraction. Correction factor calculated using empirical data (concentrations not aligned/tied to existing discrete methane measurements).Additional methods description provided in: Stowasser, C., Buizert, C., Gkinis, V., Chappellaz, J., Schupbach, S., Bigler, M., Fain, X., Sperlich, P., Baumgartner, M., Schilt, A., Blunier, T., 2012. Continuous measurements of methane mixing ratios from ice cores. Atmos. Meas. Tech. 5, 999-1013. Morville, J., Kassi, S., Chenevier, M., Romanini, D., 2005. Fast, low-noise, mode bymode, cavity-enhanced absorption spectroscopy by diode-laser self-locking. Appl. Phys. B Lasers Opt. 80, 1027-01038.* NEEM (North Greenland Eemian Ice Drilling) project information http://neem.dk/ NEEM-2011-S1 CH4 outliers.Data points removed from dataset according to specified cut-off value.Please refer to Rhodes et al. (2013) for full discussion of origins outlying data points. Briefly, these high frequency features are not artifacts of the continuous method and have been replicated by traditional discrete analyses. Comparison to chemistry measurements suggests they are related to biological in situ production of methane.

  18. d

    Surface Meteorological Station - ANL 80m, Sonic, Physics site-12 - Reviewed...

    • datasets.ai
    • data.openei.org
    • +2more
    0
    Updated Sep 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy (2024). Surface Meteorological Station - ANL 80m, Sonic, Physics site-12 - Reviewed Data [Dataset]. https://datasets.ai/datasets/surface-meteorological-station-anl-10m-1-sonics-1-ebbr-physics-site-3-raw-data
    Explore at:
    0Available download formats
    Dataset updated
    Sep 9, 2024
    Dataset authored and provided by
    Department of Energy
    Description

    Overview

    Measurements of surface sensible heat flux, momentum flux, wind components, and virtual temperature.

    Data Details

    • X (column 1) is a component of wind cm/s plus toward north.
    • Y (column 2) is a component of wind cm/s plus toward east.
    • Z (column 3) is a component of wind cm/s plus up.
    • T (column 4) is sonic virtual temperature in degrees C*100.
    • hh:mm:ss is data collection time in UTC.

    Data Quality

    The Argonne National Laboratory sonic anemometer measurements are visually inspected weekly for data outliers or instrument problems. The final dataset sent to the DAP will have all outliers or problematic data removed using automated and visual processes, including minimum/maximum checks, in a similar process used for Atmospheric Radiation Measurement (ARM) program eddy correlation (ECOR) data.

    Uncertainty

    The uncertainties of the basic sonic anemometer measurements are taken to be the accuracy of the individual measurements as specified by the instrument manufacturer. Based on historical experience with this measurement technique, flux measurement uncertainty is +/- 10 percent, although the uncertainty can be much greater during stable atmospheric conditions when turbulence intensity and atmospheric gradients are small and advection from beyond the normal fetch can occur. In particular, the Physics Site-12 tower's sonic anemometer measurements can have greater uncertainty when the wind blows through the tower structure.

    Constraints

    During stable atmospheric conditions, turbulence intensity and atmospheric gradients often are small, approaching or exceeding the measurement resolution of sonic anemometers. Under these conditions, advection from beyond the normal fetch also can occur, making interpretation of the fluxes difficult. Notably, the Physics Site-12 tower sonic anemometer measurements can be affected when the wind blows through the tower structure. Some unusual biases in the vertical velocities measured at the Physics Site-12 tower with west wind conditions also have not been adequately explained.

  19. c

    One-hour averaged temperature and salinity timeseries from bottom-moored...

    • polar.cenagis.edu.pl
    Updated Feb 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). One-hour averaged temperature and salinity timeseries from bottom-moored instruments in Hornsund fjord - Dataset - POLAR-PL Catalog [Dataset]. https://polar.cenagis.edu.pl/dataset/temp_sal_one_hour_averaged_mooring_data
    Explore at:
    Dataset updated
    Feb 27, 2025
    Area covered
    Hornsund
    Description

    Dataset consisting of continuous measurements of temperature and salinity from bottom-moored instruments. The measurements are obtained at fixed depths in varying locations across the Hornsund fjord. All data is averaged to one-hour intervals. The files are named with the mooring ID consisting of the measured parameters (CTD for Conductivity-Temperature-Depth, TD for Temperature-Depth, T for Temperature) and a running number followed by the deployment and recovery dates (format YYYYMMDD) as well as the stage of data processing (for this dataset "hourly"). However, when observations are made at one of the stations included in the CTD monitoring program, the station name is used instead of a mooring ID. The header in each file consists of 10 lines and includes information on geographical location (decimal degrees), deployment and recovery dates (YYYY-MM-DDThh:mm:ss), bottom and instrument depths, the equipment used for measurements and source of financial support. There are 4-7 data columns. For the T and TD moorings, the columns are Date/Time (YYYY-MM-DDThh:mm:ss), Pressure (dbar), Depth (m) and Temperature (°C). For the moorings without pressure sensor (only T), pressure and depth columns are marked as NaN and the average instrument depth can be found in the header. The CTD moorings include additional columns for Potential temperature (°C), Practical salinity and Density represented as Sigma-Theta (kg/m**3). Conductivity is not included in this dataset, but can be found in the raw data. Suspicious data and outliers are detected and removed and the data is smoothed. No interpolation is performed and missing data are marked with NaN. The data columns are tab-delimited and the data stored in ASCII-formatted .txt-files.

  20. z

    Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, csv
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.8338435
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Solenix Engineering GmbH
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.
    • Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    Change Log

    Version 2

    • Metadata: we include a metadata.csv with information about:
      • Anomaly categories
      • Root cause channel (signal in which the anomaly is first visible)
      • Affected channel (signal in which the anomaly might propagate) through coupled system dynamics
    • Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.
    • Two data files: CSV and parquet for convenience.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1

Data from: Valid Inference Corrected for Outlier Removal

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Shuxiao Chen; Jacob Bien
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.

Search
Clear search
Close search
Google apps
Main menu