28 datasets found
  1. f

    Data from: Valid Inference Corrected for Outlier Removal

    • figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Shuxiao Chen; Jacob Bien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.

  2. Outlier removal, sum scores, and the inflation of the Type I error rate

    • osf.io
    Updated Sep 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marjan Bakker; Jelte Wicherts (2016). Outlier removal, sum scores, and the inflation of the Type I error rate [Dataset]. https://osf.io/95xqz
    Explore at:
    Dataset updated
    Sep 20, 2016
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Marjan Bakker; Jelte Wicherts
    Description

    No description was included in this Dataset collected from the OSF

  3. Data for Filtering Organized 3D Point Clouds for Bin Picking Applications

    • datasets.ai
    • catalog.data.gov
    0, 34, 47
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2024). Data for Filtering Organized 3D Point Clouds for Bin Picking Applications [Dataset]. https://datasets.ai/datasets/data-for-filtering-organized-3d-point-clouds-for-bin-picking-applications
    Explore at:
    0, 34, 47Available download formats
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Contains scans of a bin filled with different parts ( screws, nuts, rods, spheres, sprockets). For each part type, RGB image and organized 3D point cloud obtained with structured light sensor are provided. In addition, unorganized 3D point cloud representing an empty bin and a small Matlab script to read the files is also provided. 3D data contain a lot of outliers and the data were used to demonstrate a new filtering technique.

  4. d

    Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

    • search.dataone.org
    • dataverse.azure.uit.no
    • +1more
    Updated Jul 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. https://search.dataone.org/view/sha256%3A08484b821e24ce46dbeb405a81e84d7457a8726456522e23d340739f2ff809ae
    Explore at:
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    DataverseNO
    Authors
    Holsbø, Einar
    Description

    This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

  5. Number of statistics, number of errors, number of large errors, and number...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marjan Bakker; Jelte M. Wicherts (2023). Number of statistics, number of errors, number of large errors, and number of gross errors for each journal separately for articles in which outliers were removed and for articles that did not report any removal of outliers. [Dataset]. http://doi.org/10.1371/journal.pone.0103360.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Marjan Bakker; Jelte M. Wicherts
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of statistics, number of errors, number of large errors, and number of gross errors for each journal separately for articles in which outliers were removed and for articles that did not report any removal of outliers.

  6. f

    Timings and statistical data of point model by our method.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang (2023). Timings and statistical data of point model by our method. [Dataset]. http://doi.org/10.1371/journal.pone.0201280.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Timings and statistical data of point model by our method.

  7. Performance analysis of our algorithm on 3D models.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Performance analysis of our algorithm on 3D models. [Dataset]. https://plos.figshare.com/articles/dataset/Performance_analysis_of_our_algorithm_on_3D_models_/6918089
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance analysis of our algorithm on 3D models.

  8. COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 2 consisted of the following sections

    Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

  9. c

    11: Streamwater sample constituent concentration outliers from 15 watersheds...

    • s.cnmilf.com
    • data.usgs.gov
    • +2more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). 11: Streamwater sample constituent concentration outliers from 15 watersheds in Gwinnett County, Georgia for water years 2003-2020 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/11-streamwater-sample-constituent-concentration-outliers-from-15-watersheds-in-gwinne-2003
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Georgia, Gwinnett County
    Description

    This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.

  10. P

    PointDenoisingBenchmark Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jan 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov (2019). PointDenoisingBenchmark Dataset [Dataset]. https://paperswithcode.com/dataset/pointcleannet
    Explore at:
    Dataset updated
    Jan 3, 2019
    Authors
    Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov
    Description

    The PointDenoisingBenchmark dataset features 28 different shapes, split into 18 training shapes and 10 test shapes.

    PointDenoisingBenchmark for outliers removal: contains noisy point clouds with different levels of gaussian noise and the corresponding clean ground truths. PointDenoisingBenchmark for denoising: contains noisy point clouds with different levels of noise and density of outliers and the corresponding clean ground truths.

  11. d

    Stream water-quality summary statistics and outliers, streamwater load...

    • catalog.data.gov
    • search.dataone.org
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Stream water-quality summary statistics and outliers, streamwater load models and yield estimates, and peak flow modeling parameters for 13 watersheds in Gwinnett County, Georgia [Dataset]. https://catalog.data.gov/dataset/stream-water-quality-summary-statistics-and-outliers-streamwater-load-models-and-yield-est
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Gwinnett County
    Description

    Data release includes the following five data tables: (1) water-quality constituent outliers that were removed from the calibration of regression models used to estimate streamwater solute loads, (2) parameters used to model peak streamflow recurrence intervals, (3) models used to estimate streamwater constituent loads, (4) statistical summaries of water-quality observations, and (5) estimated annual streamwater constituent yields. An associated metadata file is included for each of the five data tables.

  12. AT_2003_BACI_1

    • search.dataone.org
    Updated Oct 14, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne (2013). AT_2003_BACI_1 [Dataset]. https://search.dataone.org/view/knb-lter-bes.349.570
    Explore at:
    Dataset updated
    Oct 14, 2013
    Dataset provided by
    Long Term Ecological Research Networkhttp://www.lternet.edu/
    Authors
    Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne
    Time period covered
    Jan 1, 2004 - Nov 17, 2011
    Area covered
    Description

    MD Property View 2003 A and T Database. For more information on the A and T Database refer to the enclosed documentation. This layer was edited to remove spatial outliers in the A and T Database. Spatial outliers are those points that were not geocoded and as a result fell outside of the Baltimore City Boundary; 416 spatial outliers were removed from this layer. The field BLOCKLOT2 can be used to join this layer with the Baltimore City parcel layer. This is part of a collection of 221 Baltimore Ecosystem Study metadata records that point to a geodatabase. The geodatabase is available online and is considerably large. Upon request, and under certain arrangements, it can be shipped on media, such as a usb hard drive. The geodatabase is roughly 51.4 Gb in size, consisting of 4,914 files in 160 folders. Although this metadata record and the others like it are not rich with attributes, it is nonetheless made available because the data that it represents could be indeed useful.

  13. Data from: Pacman profiling: a simple procedure to identify stratigraphic...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Jul 8, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Lazarus; Manuel Weinkauf; Patrick Diver (2011). Pacman profiling: a simple procedure to identify stratigraphic outliers in high-density deep-sea microfossil data [Dataset]. http://doi.org/10.5061/dryad.2m7b0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2011
    Authors
    David Lazarus; Manuel Weinkauf; Patrick Diver
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Marine, Global
    Description

    The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compiling species occurrences by time and marking as outliers calibrated fractions of the youngest and oldest occurrence data for each species. A subset of biostratigraphic marker species whose ranges have been previously documented is used to calibrate the fraction of occurrences to mark as outliers. These outlier occurrences are compiled for samples, and profiles of outlier frequency are made from the sections used to compile the data; the profiles can then identify samples and sections with problematic data caused, for example, by taxonomic errors, incorrect age models, or reworking of sediment. These samples/sections can then be targeted for re-study.

  14. Dataset - Uncertainty Reduction in Biochemical Kinetic Models: Enforcing...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Feb 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ljubisa Miskovic; Ljubisa Miskovic; Jonas Béal; Michael Moret; Vassily Hatzimanikatis; Vassily Hatzimanikatis; Jonas Béal; Michael Moret (2021). Dataset - Uncertainty Reduction in Biochemical Kinetic Models: Enforcing Desired Model Properties [Dataset]. http://doi.org/10.5281/zenodo.3240300
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 4, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ljubisa Miskovic; Ljubisa Miskovic; Jonas Béal; Michael Moret; Vassily Hatzimanikatis; Vassily Hatzimanikatis; Jonas Béal; Michael Moret
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data needed to reproduce the results from the manuscript “Uncertainty Reduction in Biochemical Kinetic Models: Enforcing Desired Model Properties" by L. Miskovic, J. Beal, M. Moret, and V. Hatzimanikatis

    1. Data generated with the ORACLE workflow that was used in the iSCHRUNK training:

    • Classification label vectors for the three analyzed metabolic concentration cases:
      • Reference case: class_vector_train_ref.mat
      • Extreme1 case: class_vector_train_ex1.mat
      • Extreme2 case: class_vector_train_ex2.mat
    • Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.
      • Reference case: training_set_ref.mat
      • Extreme1 case: training_set_ex1.mat
      • Extreme2 case: training_set_ex2.mat
    • Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.
      • Reference case: ccXTR_ref.mat
      • Extreme1 case: ccXTR_ex1.mat
      • Extreme2 case: ccXTR_ex2.mat
    • Thermodynamics-based Flux Analysis (TFA) models for the three cases:
      • Reference case: tfa_ref.mat
      • Extreme1 case: tfa_ex1.mat
      • Extreme2 case: tfa_ex2.mat

    2. Validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Figure 4).

    • Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.
      • ccXTR_ValidNeg.mat
    • Parameter sets used in validation
      • validation_set_neg.mat

    3. Validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Table 3).

    • Negative control:
      • Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.
        • Reference case: ccXTR_ValidRef_neg_agg.mat
        • Extreme1 case: ccXTR_ValidEx1_neg_agg.mat
        • Extreme2 case: ccXTR_ValidEx2_neg_agg.mat
      • Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.
        • Reference case: validation_set_ref_neg_agg.mat
        • Extreme1 case: validation_set_ref_neg_agg.mat
        • Extreme2 case: tvalidation_set_ref_neg_agg.mat
    • Positive control:
      • Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.
        • Reference case: ccXTR_ValidRef_pos_agg.mat
        • Extreme1 case: ccXTR_ValidEx1_pos_agg.mat
        • Extreme2 case: ccXTR_ValidEx2_pos_agg.mat
      • Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.
        • Reference case: validation_set_ref_pos_agg.mat
        • Extreme1 case: validation_set_ex1_pos_agg.mat
        • Extreme2 case: validation_set_ex2_pos_agg.mat

    4. Reassignment study: validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Figure 6 and Table 4).

    • Negative control:
      • Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes. For the statistics and the figures we have used the population with removed outliers.
        • Reference case: ccXTR_Valid_reassignment_neg.mat
      • Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.
        • Reference case: validation_set_neg_reassignment.mat
    • Positive control:
      • Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes. For the statistics and the figures we have used the population with removed outliers.
        • Reference case: ccXTR_Valid_reassignment_pos.mat
      • Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σA, which is constrained between 0 and 1.
        • Reference case: validation_set_pos_reassignment.mat

  15. d

    Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous...

    • b2find.dkrz.de
    Updated Apr 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous record: outliers removed, v2 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/5b01a790-eb9a-51fc-a1f2-56f4b676c2ca
    Explore at:
    Dataset updated
    Apr 27, 2023
    Area covered
    North Greenland
    Description

    Description and NotesDescription: Methane concentration from the Greenland NEEM-2011-S1 Ice Core from 71 to 408m depth (~270-1961 CE). Methane concentrations analysed online by laser spectrometer (SARA, Spectroscopy by Amplified Resonant Absorption, developed at Laboratoire Interdisciplinaire de Physique, Grenoble, France) on gas extracted from an ice core processed using a continuous melter system (Desert Research Institute). Methane data have a 5 second integration time (raw data acquisition rate 0.6 Hz). Analytical precision, from Allan Variance test, is 0.9 ppb (2 sigma). Long-term reproducibility is 2.6% (2 sigma). Gaps in the record are due to problems during online analysis. Online analysis conducted August-September 2011.Note: Lat-Long provided is for main NEEM borehole. The NEEM-2011-S1 core was drilled 200 m distance away in 2011 to 410 m depth.Methane concentrations are reported on NOAA2004 scale (instrument calibrated on dry synthetic air standards).A correction factor of 1.079 has been applied to all data to correct for methane dissolution in melted ice core sample prior to gas extraction. Correction factor calculated using empirical data (concentrations not aligned/tied to existing discrete methane measurements).Additional methods description provided in: Stowasser, C., Buizert, C., Gkinis, V., Chappellaz, J., Schupbach, S., Bigler, M., Fain, X., Sperlich, P., Baumgartner, M., Schilt, A., Blunier, T., 2012. Continuous measurements of methane mixing ratios from ice cores. Atmos. Meas. Tech. 5, 999-1013. Morville, J., Kassi, S., Chenevier, M., Romanini, D., 2005. Fast, low-noise, mode bymode, cavity-enhanced absorption spectroscopy by diode-laser self-locking. Appl. Phys. B Lasers Opt. 80, 1027-01038.* NEEM (North Greenland Eemian Ice Drilling) project information http://neem.dk/ NEEM-2011-S1 CH4 no outliers.Data minus data points exceeding cut-off value. Cut-off value is 2*median absolute deviation (MAD)> 15 yr running median. Different MAD values used for 250-1000 AD and 1100-1835 AD sections of record.

  16. CAMA_2003_BACI_1

    • search.dataone.org
    Updated Oct 14, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne (2013). CAMA_2003_BACI_1 [Dataset]. https://search.dataone.org/view/knb-lter-bes.363.570
    Explore at:
    Dataset updated
    Oct 14, 2013
    Dataset provided by
    Long Term Ecological Research Networkhttp://www.lternet.edu/
    Authors
    Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne
    Time period covered
    Jan 1, 2004 - Nov 17, 2011
    Area covered
    Description

    MD Property View 2003 CAMA Database. For more information on the CAMA Database refer to the enclosed documentation. This layer was edited to remove spatial outliers in the CAMA Database. Spatial outliers are those points that were not geocoded and as a result fell outside of the Baltimore City Boundary. 254 spatial outliers were removed from this layer. This is part of a collection of 221 Baltimore Ecosystem Study metadata records that point to a geodatabase. The geodatabase is available online and is considerably large. Upon request, and under certain arrangements, it can be shipped on media, such as a usb hard drive. The geodatabase is roughly 51.4 Gb in size, consisting of 4,914 files in 160 folders. Although this metadata record and the others like it are not rich with attributes, it is nonetheless made available because the data that it represents could be indeed useful.

  17. r

    Data from: Male responses to sperm competition risk when rivals vary in...

    • researchdata.edu.au
    • data.niaid.nih.gov
    • +2more
    Updated 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences (2019). Data from: Male responses to sperm competition risk when rivals vary in their number and familiarity [Dataset]. http://doi.org/10.5061/DRYAD.M097580
    Explore at:
    Dataset updated
    2019
    Dataset provided by
    The University of Western Australia
    DRYAD
    Authors
    Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences
    Description

    Males of many species adjust their reproductive investment to the number of rivals present simultaneously. However, few studies have investigated whether males sum previous encounters with rivals, and the total level of competition has never been explicitly separated from social familiarity. Social familiarity can be an important component of kin recognition and has been suggested as a cue that males use to avoid harming females when competing with relatives. Previous work has succeeded in independently manipulating social familiarity and relatedness among rivals, but experimental manipulations of familiarity are confounded with manipulations of the total number of rivals that males encounter. Using the seed beetle Callosobruchus maculatus we manipulated three factors: familiarity among rival males, the number of rivals encountered simultaneously, and the total number of rivals encountered over a 48-hour period. Males produced smaller ejaculates when exposed to more rivals in total, regardless of the maximum number of rivals they encountered simultaneously. Males did not respond to familiarity. Our results demonstrate that males of this species can sum the number of rivals encountered over separate days, and therefore the confounding of familiarity with the total level of competition in previous studies should not be ignored.,Lymbery et al 2018 Full datasetContains all the data used in the statistical analyses for the associated manuscript. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Full Dataset.xlsxLymbery et al 2018 Reduced dataset 1Contains data used in the attached manuscript following the removal of three outliers for the purposes of data distribution, as described in the associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After 1st Round of Outlier Removal.xlsxLymbery et al 2018 Reduced dataset 2Contains the data used in the statistical analyses for the associated manuscript, after the removal of all outliers stated in the manuscript and associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After Final Outlier Removal.xlsxLymbery et al 2018 R ScriptContains all the R code used for statistical analysis in this manuscript, with annotations to aid interpretation.,

  18. Predictive Validity Data Set

    • figshare.com
    txt
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Abeyta (2022). Predictive Validity Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.17030021.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 18, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Antonio Abeyta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.

  19. f

    Pearson correlations (r) between siblings for Eyes scores and Eyes scores...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gillian Ragsdale; Robert A. Foley (2023). Pearson correlations (r) between siblings for Eyes scores and Eyes scores adjusted by removing the low-scoring outliers (Eyes Adj >17). [Dataset]. http://doi.org/10.1371/journal.pone.0023236.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Gillian Ragsdale; Robert A. Foley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    **Correlation is significant at the 0.01 level (2-tailed).*Correlation is significant at the 0.05 level (2-tailed).'Correlation is significant at the 0.1 level (2-tailed).For each model, the two categories of sibling pairs are derived from Table 2. In each case, a possible fit (in bold) is indicated by the second correlation being less than the first.

  20. d

    TreeShrink: fast and accurate detection of outlier long branches in...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siavash Mirarab; Uyen Mai (2023). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees [Dataset]. http://doi.org/10.6076/D1HC71
    Explore at:
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Siavash Mirarab; Uyen Mai
    Time period covered
    Jan 1, 2023
    Description

    Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink†problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1

Data from: Valid Inference Corrected for Outlier Removal

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Shuxiao Chen; Jacob Bien
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.

Search
Clear search
Close search
Google apps
Main menu