28 datasets found

f
Data from: Valid Inference Corrected for Outlier Removal
figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9762731.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Shuxiao Chen; Jacob Bien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.
Outlier removal, sum scores, and the inflation of the Type I error rate
osf.io
Updated Sep 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marjan Bakker; Jelte Wicherts (2016). Outlier removal, sum scores, and the inflation of the Type I error rate [Dataset]. https://osf.io/95xqz
Explore at:
Dataset updated
Sep 20, 2016
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Marjan Bakker; Jelte Wicherts
Description
No description was included in this Dataset collected from the OSF
Data for Filtering Organized 3D Point Clouds for Bin Picking Applications
datasets.ai
catalog.data.gov
0, 34, 47
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). Data for Filtering Organized 3D Point Clouds for Bin Picking Applications [Dataset]. https://datasets.ai/datasets/data-for-filtering-organized-3d-point-clouds-for-bin-picking-applications
Explore at:
0, 34, 47Available download formats
Dataset updated
Aug 6, 2024
Dataset authored and provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Contains scans of a bin filled with different parts ( screws, nuts, rods, spheres, sprockets). For each part type, RGB image and organized 3D point cloud obtained with structured light sensor are provided. In addition, unorganized 3D point cloud representing an empty bin and a small Matlab script to read the files is also provided. 3D data contain a lot of outliers and the data were used to demonstrate a new filtering technique.
d
Supporting data for \"A Standard Operating Procedure for Outlier Removal in...
search.dataone.org
dataverse.azure.uit.no
+1more
Updated Jul 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. https://search.dataone.org/view/sha256%3A08484b821e24ce46dbeb405a81e84d7457a8726456522e23d340739f2ff809ae
Explore at:
Dataset updated
Jul 29, 2024
Dataset provided by
DataverseNO
Authors
Holsbø, Einar
Description
This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details
Number of statistics, number of errors, number of large errors, and number...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marjan Bakker; Jelte M. Wicherts (2023). Number of statistics, number of errors, number of large errors, and number of gross errors for each journal separately for articles in which outliers were removed and for articles that did not report any removal of outliers. [Dataset]. http://doi.org/10.1371/journal.pone.0103360.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0103360.t004
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Marjan Bakker; Jelte M. Wicherts
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of statistics, number of errors, number of large errors, and number of gross errors for each journal separately for articles in which outliers were removed and for articles that did not report any removal of outliers.
f
Timings and statistical data of point model by our method.
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang (2023). Timings and statistical data of point model by our method. [Dataset]. http://doi.org/10.1371/journal.pone.0201280.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0201280.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Timings and statistical data of point model by our method.
Performance analysis of our algorithm on 3D models.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Performance analysis of our algorithm on 3D models. [Dataset]. https://plos.figshare.com/articles/dataset/Performance_analysis_of_our_algorithm_on_3D_models_/6918089
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0201280.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance analysis of our algorithm on 3D models.
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
c
11: Streamwater sample constituent concentration outliers from 15 watersheds...
s.cnmilf.com
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). 11: Streamwater sample constituent concentration outliers from 15 watersheds in Gwinnett County, Georgia for water years 2003-2020 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/11-streamwater-sample-constituent-concentration-outliers-from-15-watersheds-in-gwinne-2003
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Georgia, Gwinnett County
Description
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
P
PointDenoisingBenchmark Dataset
paperswithcode.com
opendatalab.com
Updated Jan 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov (2019). PointDenoisingBenchmark Dataset [Dataset]. https://paperswithcode.com/dataset/pointcleannet
Explore at:
Dataset updated
Jan 3, 2019
Authors
Marie-Julie Rakotosaona; Vittorio La Barbera; Paul Guerrero; Niloy J. Mitra; Maks Ovsjanikov
Description
The PointDenoisingBenchmark dataset features 28 different shapes, split into 18 training shapes and 10 test shapes.

PointDenoisingBenchmark for outliers removal: contains noisy point clouds with different levels of gaussian noise and the corresponding clean ground truths. PointDenoisingBenchmark for denoising: contains noisy point clouds with different levels of noise and density of outliers and the corresponding clean ground truths.
d
Stream water-quality summary statistics and outliers, streamwater load...
catalog.data.gov
search.dataone.org
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Stream water-quality summary statistics and outliers, streamwater load models and yield estimates, and peak flow modeling parameters for 13 watersheds in Gwinnett County, Georgia [Dataset]. https://catalog.data.gov/dataset/stream-water-quality-summary-statistics-and-outliers-streamwater-load-models-and-yield-est
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Gwinnett County
Description
Data release includes the following five data tables: (1) water-quality constituent outliers that were removed from the calibration of regression models used to estimate streamwater solute loads, (2) parameters used to model peak streamflow recurrence intervals, (3) models used to estimate streamwater constituent loads, (4) statistical summaries of water-quality observations, and (5) estimated annual streamwater constituent yields. An associated metadata file is included for each of the five data tables.
AT_2003_BACI_1
search.dataone.org
Updated Oct 14, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne (2013). AT_2003_BACI_1 [Dataset]. https://search.dataone.org/view/knb-lter-bes.349.570
Explore at:
Dataset updated
Oct 14, 2013
Dataset provided by
Long Term Ecological Research Networkhttp://www.lternet.edu/
Authors
Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne
Time period covered
Jan 1, 2004 - Nov 17, 2011
Area covered

Description
MD Property View 2003 A and T Database. For more information on the A and T Database refer to the enclosed documentation. This layer was edited to remove spatial outliers in the A and T Database. Spatial outliers are those points that were not geocoded and as a result fell outside of the Baltimore City Boundary; 416 spatial outliers were removed from this layer. The field BLOCKLOT2 can be used to join this layer with the Baltimore City parcel layer. This is part of a collection of 221 Baltimore Ecosystem Study metadata records that point to a geodatabase. The geodatabase is available online and is considerably large. Upon request, and under certain arrangements, it can be shipped on media, such as a usb hard drive. The geodatabase is roughly 51.4 Gb in size, consisting of 4,914 files in 160 folders. Although this metadata record and the others like it are not rich with attributes, it is nonetheless made available because the data that it represents could be indeed useful.
Data from: Pacman profiling: a simple procedure to identify stratigraphic...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Jul 8, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Lazarus; Manuel Weinkauf; Patrick Diver (2011). Pacman profiling: a simple procedure to identify stratigraphic outliers in high-density deep-sea microfossil data [Dataset]. http://doi.org/10.5061/dryad.2m7b0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2m7b0
Dataset updated
Jul 8, 2011
Authors
David Lazarus; Manuel Weinkauf; Patrick Diver
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Marine, Global
Description
The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compiling species occurrences by time and marking as outliers calibrated fractions of the youngest and oldest occurrence data for each species. A subset of biostratigraphic marker species whose ranges have been previously documented is used to calibrate the fraction of occurrences to mark as outliers. These outlier occurrences are compiled for samples, and profiles of outlier frequency are made from the sections used to compile the data; the profiles can then identify samples and sections with problematic data caused, for example, by taxonomic errors, incorrect age models, or reworking of sediment. These samples/sections can then be targeted for re-study.
Dataset - Uncertainty Reduction in Biochemical Kinetic Models: Enforcing...
zenodo.org
data.niaid.nih.gov
bin
Updated Feb 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ljubisa Miskovic; Ljubisa Miskovic; Jonas Béal; Michael Moret; Vassily Hatzimanikatis; Vassily Hatzimanikatis; Jonas Béal; Michael Moret (2021). Dataset - Uncertainty Reduction in Biochemical Kinetic Models: Enforcing Desired Model Properties [Dataset]. http://doi.org/10.5281/zenodo.3240300
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3240300
Dataset updated
Feb 4, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ljubisa Miskovic; Ljubisa Miskovic; Jonas Béal; Michael Moret; Vassily Hatzimanikatis; Vassily Hatzimanikatis; Jonas Béal; Michael Moret
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data needed to reproduce the results from the manuscript “Uncertainty Reduction in Biochemical Kinetic Models: Enforcing Desired Model Properties" by L. Miskovic, J. Beal, M. Moret, and V. Hatzimanikatis

1. Data generated with the ORACLE workflow that was used in the iSCHRUNK training:

Classification label vectors for the three analyzed metabolic concentration cases:

Reference case: class_vector_train_ref.mat

Extreme1 case: class_vector_train_ex1.mat

Extreme2 case: class_vector_train_ex2.mat

Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σ_A, which is constrained between 0 and 1.

Reference case: training_set_ref.mat

Extreme1 case: training_set_ex1.mat

Extreme2 case: training_set_ex2.mat

Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.

Reference case: ccXTR_ref.mat

Extreme1 case: ccXTR_ex1.mat

Extreme2 case: ccXTR_ex2.mat

Thermodynamics-based Flux Analysis (TFA) models for the three cases:

Reference case: tfa_ref.mat

Extreme1 case: tfa_ex1.mat

Extreme2 case: tfa_ex2.mat

2. Validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Figure 4).

Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.

ccXTR_ValidNeg.mat

Parameter sets used in validation

validation_set_neg.mat

3. Validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Table 3).

Negative control:

Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.

Reference case: ccXTR_ValidRef_neg_agg.mat

Extreme1 case: ccXTR_ValidEx1_neg_agg.mat

Extreme2 case: ccXTR_ValidEx2_neg_agg.mat

Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σ_A, which is constrained between 0 and 1.

Reference case: validation_set_ref_neg_agg.mat

Extreme1 case: validation_set_ref_neg_agg.mat

Extreme2 case: tvalidation_set_ref_neg_agg.mat

Positive control:

Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes for the three cases. For the statistics and the figures we have used the population with removed outliers.

Reference case: ccXTR_ValidRef_pos_agg.mat

Extreme1 case: ccXTR_ValidEx1_pos_agg.mat

Extreme2 case: ccXTR_ValidEx2_pos_agg.mat

Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σ_A, which is constrained between 0 and 1.

Reference case: validation_set_ref_pos_agg.mat

Extreme1 case: validation_set_ex1_pos_agg.mat

Extreme2 case: validation_set_ex2_pos_agg.mat

4. Reassignment study: validation data generated with the ORACLE workflow with the parameters constrained using the information obtained with the iSCHRUNK (Figure 6 and Table 4).

Negative control:

Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes. For the statistics and the figures we have used the population with removed outliers.

Reference case: ccXTR_Valid_reassignment_neg.mat

Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σ_A, which is constrained between 0 and 1.

Reference case: validation_set_neg_reassignment.mat

Positive control:

Flux control coefficients of the xylose uptake rate (XTR) with respect to the network enzymes. For the statistics and the figures we have used the population with removed outliers.

Reference case: ccXTR_Valid_reassignment_pos.mat

Parameter sets used for training for the three analyzed metabolite concentration cases. As parameters, we used the degree of saturation of the enzyme active site, σ_A, which is constrained between 0 and 1.

Reference case: validation_set_pos_reassignment.mat
d
Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous...
b2find.dkrz.de
Updated Apr 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous record: outliers removed, v2 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/5b01a790-eb9a-51fc-a1f2-56f4b676c2ca
Explore at:
Dataset updated
Apr 27, 2023
Area covered
North Greenland
Description
Description and NotesDescription: Methane concentration from the Greenland NEEM-2011-S1 Ice Core from 71 to 408m depth (~270-1961 CE). Methane concentrations analysed online by laser spectrometer (SARA, Spectroscopy by Amplified Resonant Absorption, developed at Laboratoire Interdisciplinaire de Physique, Grenoble, France) on gas extracted from an ice core processed using a continuous melter system (Desert Research Institute). Methane data have a 5 second integration time (raw data acquisition rate 0.6 Hz). Analytical precision, from Allan Variance test, is 0.9 ppb (2 sigma). Long-term reproducibility is 2.6% (2 sigma). Gaps in the record are due to problems during online analysis. Online analysis conducted August-September 2011.Note: Lat-Long provided is for main NEEM borehole. The NEEM-2011-S1 core was drilled 200 m distance away in 2011 to 410 m depth.Methane concentrations are reported on NOAA2004 scale (instrument calibrated on dry synthetic air standards).A correction factor of 1.079 has been applied to all data to correct for methane dissolution in melted ice core sample prior to gas extraction. Correction factor calculated using empirical data (concentrations not aligned/tied to existing discrete methane measurements).Additional methods description provided in: Stowasser, C., Buizert, C., Gkinis, V., Chappellaz, J., Schupbach, S., Bigler, M., Fain, X., Sperlich, P., Baumgartner, M., Schilt, A., Blunier, T., 2012. Continuous measurements of methane mixing ratios from ice cores. Atmos. Meas. Tech. 5, 999-1013. Morville, J., Kassi, S., Chenevier, M., Romanini, D., 2005. Fast, low-noise, mode bymode, cavity-enhanced absorption spectroscopy by diode-laser self-locking. Appl. Phys. B Lasers Opt. 80, 1027-01038.* NEEM (North Greenland Eemian Ice Drilling) project information http://neem.dk/ NEEM-2011-S1 CH4 no outliers.Data minus data points exceeding cut-off value. Cut-off value is 2*median absolute deviation (MAD)> 15 yr running median. Different MAD values used for 250-1000 AD and 1100-1835 AD sections of record.
CAMA_2003_BACI_1
search.dataone.org
Updated Oct 14, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne (2013). CAMA_2003_BACI_1 [Dataset]. https://search.dataone.org/view/knb-lter-bes.363.570
Explore at:
Dataset updated
Oct 14, 2013
Dataset provided by
Long Term Ecological Research Networkhttp://www.lternet.edu/
Authors
Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne
Time period covered
Jan 1, 2004 - Nov 17, 2011
Area covered

Description
MD Property View 2003 CAMA Database. For more information on the CAMA Database refer to the enclosed documentation. This layer was edited to remove spatial outliers in the CAMA Database. Spatial outliers are those points that were not geocoded and as a result fell outside of the Baltimore City Boundary. 254 spatial outliers were removed from this layer. This is part of a collection of 221 Baltimore Ecosystem Study metadata records that point to a geodatabase. The geodatabase is available online and is considerably large. Upon request, and under certain arrangements, it can be shipped on media, such as a usb hard drive. The geodatabase is roughly 51.4 Gb in size, consisting of 4,914 files in 160 folders. Although this metadata record and the others like it are not rich with attributes, it is nonetheless made available because the data that it represents could be indeed useful.
r
Data from: Male responses to sperm competition risk when rivals vary in...
researchdata.edu.au
data.niaid.nih.gov
+2more
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences (2019). Data from: Male responses to sperm competition risk when rivals vary in their number and familiarity [Dataset]. http://doi.org/10.5061/DRYAD.M097580
Explore at:
Unique identifier
https://doi.org/10.5061/DRYAD.M097580
Dataset updated
2019
Dataset provided by
The University of Western Australia
DRYAD
Authors
Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences
Description
Males of many species adjust their reproductive investment to the number of rivals present simultaneously. However, few studies have investigated whether males sum previous encounters with rivals, and the total level of competition has never been explicitly separated from social familiarity. Social familiarity can be an important component of kin recognition and has been suggested as a cue that males use to avoid harming females when competing with relatives. Previous work has succeeded in independently manipulating social familiarity and relatedness among rivals, but experimental manipulations of familiarity are confounded with manipulations of the total number of rivals that males encounter. Using the seed beetle Callosobruchus maculatus we manipulated three factors: familiarity among rival males, the number of rivals encountered simultaneously, and the total number of rivals encountered over a 48-hour period. Males produced smaller ejaculates when exposed to more rivals in total, regardless of the maximum number of rivals they encountered simultaneously. Males did not respond to familiarity. Our results demonstrate that males of this species can sum the number of rivals encountered over separate days, and therefore the confounding of familiarity with the total level of competition in previous studies should not be ignored.,Lymbery et al 2018 Full datasetContains all the data used in the statistical analyses for the associated manuscript. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Full Dataset.xlsxLymbery et al 2018 Reduced dataset 1Contains data used in the attached manuscript following the removal of three outliers for the purposes of data distribution, as described in the associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After 1st Round of Outlier Removal.xlsxLymbery et al 2018 Reduced dataset 2Contains the data used in the statistical analyses for the associated manuscript, after the removal of all outliers stated in the manuscript and associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After Final Outlier Removal.xlsxLymbery et al 2018 R ScriptContains all the R code used for statistical analysis in this manuscript, with annotations to aid interpretation.,
Predictive Validity Data Set
figshare.com
txt
Updated Dec 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Abeyta (2022). Predictive Validity Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.17030021.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17030021.v1
Dataset updated
Dec 18, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Antonio Abeyta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.
f
Pearson correlations (r) between siblings for Eyes scores and Eyes scores...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gillian Ragsdale; Robert A. Foley (2023). Pearson correlations (r) between siblings for Eyes scores and Eyes scores adjusted by removing the low-scoring outliers (Eyes Adj >17). [Dataset]. http://doi.org/10.1371/journal.pone.0023236.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0023236.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Gillian Ragsdale; Robert A. Foley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
**Correlation is significant at the 0.01 level (2-tailed).*Correlation is significant at the 0.05 level (2-tailed).'Correlation is significant at the 0.1 level (2-tailed).For each model, the two categories of sibling pairs are derived from Table 2. In each case, a possible fit (in bold) is indicated by the second correlation being less than the first.
d
TreeShrink: fast and accurate detection of outlier long branches in...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siavash Mirarab; Uyen Mai (2023). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees [Dataset]. http://doi.org/10.6076/D1HC71
Explore at:
Unique identifier
https://doi.org/10.6076/D1HC71
Dataset updated
Nov 30, 2023
Dataset provided by
Dryad Digital Repository
Authors
Siavash Mirarab; Uyen Mai
Time period covered
Jan 1, 2023
Description
Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this â€œk-shrinkâ€ problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1

Data from: Valid Inference Corrected for Outlier Removal

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.9762731.v1

Dataset updated

May 30, 2023

Dataset provided by

Taylor & Francis

Authors

Shuxiao Chen; Jacob Bien

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.

Clear search

Close search

Google apps

Main menu

Data from: Valid Inference Corrected for Outlier Removal

Outlier removal, sum scores, and the inflation of the Type I error rate

Data for Filtering Organized 3D Point Clouds for Bin Picking Applications

Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

Number of statistics, number of errors, number of large errors, and number...

Timings and statistical data of point model by our method.

Performance analysis of our algorithm on 3D models.

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

11: Streamwater sample constituent concentration outliers from 15 watersheds...

PointDenoisingBenchmark Dataset

Stream water-quality summary statistics and outliers, streamwater load...

AT_2003_BACI_1

Data from: Pacman profiling: a simple procedure to identify stratigraphic...

Dataset - Uncertainty Reduction in Biochemical Kinetic Models: Enforcing...

Methane in NEEM-2011-S1 ice core from North Greenland, 1800 years continuous...

CAMA_2003_BACI_1

Data from: Male responses to sperm competition risk when rivals vary in...

Predictive Validity Data Set

Pearson correlations (r) between siblings for Eyes scores and Eyes scores...

TreeShrink: fast and accurate detection of outlier long branches in...

Data from: Valid Inference Corrected for Outlier Removal