Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterThis data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
11K rows of r/DirtyWritingPrompts including score (which is upvotes - downvotes) You probably want to remove the last 1K row because it is sorted from most upvoted to least and few hundred of the last samples have negative upvotes lmao. We also removed any story under 400 characters.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.
Methods Overview
This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.
Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.
The methods for compilation are contained in the supplementary information of the manuscript but also here:
Bird data
For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).
Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.
The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.
Environmental data
Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This resource is for historic purposes only and was provided for the GovHack competition (3-5 July 2015). After the event it was discovered that the latitude and longitude columns had been inadvertently inverted. For any project using this data please use the updated version of the resource (link) located here.\r \r We have elected not to remove this resource at this time so as to ensure that any GovHack entries using this data are not disadvantaged during the judging process. We intend to remove this version of the data after the GovHack judging has been completed.\r ==\r
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################
###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################
###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)
###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.
###To look at samples of the data, uncomment this line:
###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe
im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe
im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe
################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer
#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))
###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.
library("foreach", lib.loc="~/R/win-library/3.3")
###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):
###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')
#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)
#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).
#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))
#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")
#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }
#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")
#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)
#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)
#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:
library(...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Along track temperature, Salinity, backscatter, Chlorophyll Fluoresence, and normalized water leaving radiance (nLw).
On the bow of the R/V Roger Revelle was a Satlantic SeaWiFS Aircraft Simulator (MicroSAS) system, used to estimate water-leaving radiance from the ship, analogous to to the nLw derived by the SeaWiFS and MODIS satellite sensors, but free from atmospheric error (hence, it can provide data below clouds).
The system consisted of a down-looking radiance sensor and a sky-viewing radiance sensor, both mounted on a steerable holder on the bow. A downwelling irradiance sensor was mounted at the top of the ship's meterological mast, on the bow, far from any potentially shading structures. These data were used to estimate normalized water-leaving radiance as a function of wavelength. The radiance detector was set to view the water at 40deg from nadir as recommended by Mueller et al. [2003b]. The water radiance sensor was able to view over an azimuth range of ~180deg across the ship's heading with no viewing of the ship's wake. The direction of the sensor was adjusted to view the water 90-120deg from the sun's azimuth, to minimize sun glint. This was continually adjusted as the time and ship's gyro heading were used to calculate the sun's position using an astronomical solar position subroutine interfaced with a stepping motor which was attached to the radiometer mount (designed and fabricated at Bigelow Laboratory for Ocean Sciences). Protocols for operation and calibration were performed according to Mueller [Mueller et al., 2003a; Mueller et al., 2003b; Mueller et al., 2003c]. Before 1000h and after 1400h, data quality was poorer as the solar zenith angle was too low. Post-cruise, the 10Hz data were filtered to remove as much residual white cap and glint as possible (we accept the lowest 5% of the data). Reflectance plaque measurements were made several times at local apparent noon on sunny days to verify the radiometer calibrations.
Within an hour of local apparent noon each day, a Satlantic OCP sensor was deployed off the stern of the R/V Revelle after the ship oriented so that the sun was off the stern. The ship would secure the starboard Z-drive, and use port Z-drive and bow thruster to move the ship ahead at about 25cm s-1. The OCP was then trailed aft and brought to the surface ~100m aft of the ship, then allowed to sink to 100m as downwelling spectral irradiance and upwelling spectral radiance were recorded continuously along with temperature and salinity. This procedure ensured there were no ship shadow effects in the radiometry.
Instruments include a WETLabs wetstar fluorometer, a WETLabs ECOTriplet and a SeaBird microTSG.
Radiometry was done using a Satlantic 7 channel microSAS system with Es, Lt and Li sensors.
Chl data is based on inter calibrating surface discrete Chlorophyll measure with the temporally closest fluorescence measurement and applying the regression results to all fluorescence data.
Data have been corrected for instrument biofouling and drift based on weekly purewater calibrations of the system. Radiometric data has been processed using standard Satlantic processing software and has been checked with periodic plaque measurements using a 2% spectralon standard.
Lw is calculated from Lt and Lsky and is "what Lt would be if the
sensor were looking straight down". Since our sensors are mounted at
40o, based on various NASA protocols, we need to do that conversion.
Lwn adds Es to the mix. Es is used to normalize Lw. Nlw is related to Rrs, Remote Sensing Reflectance
Techniques used are as described in:
Balch WM, Drapeau DT, Bowler BC, Booth ES, Windecker LA, Ashe A (2008) Space-time variability of carbon standing stocks and fixation rates in the Gulf of Maine, along the GNATS transect between Portland, ME, USA, and Yarmouth, Nova Scotia, Canada. J Plankton Res 30:119-139
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary:
Marine geophysical exchange files for R/V Kilo Moana: 2002 to 2018 includes 328 geophysical archive files spanning km0201, the vessel's very first expedition, through km1812, the last survey included in this data synthesis.
Data formats (you will likely require only one of these):
MGD77T (M77T): ASCII - the current standard format for marine geophysical data exchange, tab delimited, low human readability
MGD77: ASCII - legacy format for marine geophysical data exchange (no longer recommended due to truncated data precision and low human readability)
GMT DAT: ASCII - the Generic Mapping Tools format in which these archive files were built, best human readability but largest file size
MGD77+: highly flexible and disk space saving binary NetCDF-based format, enables adding additional columns and application of errata-based data correction methods (i.e., Chandler et al, 2012), not human readable
The process by which formats were converted is explained below.
Data Reduction and Explanation:
R/V Kilo Moana routinely acquired bathymetry data using two concurrently operated sonar systems hence, for this analysis, a best effort was made to extract center beam depth values from the appropriate sonar system. No resampling or decimation of center beam depth data has been performed with the exception that all depth measurements were required to be temporally separated by at least 1 second. The initial sonar systems were the Kongsberg EM120 for deep and EM1002 for shallow water mapping. The vessel's deep sonar system was upgraded to Kongsberg EM122 in January of 2010 and the shallow system to EM710 in March 2012.
The vessel deployed a Lacoste and Romberg spring-type gravity meter (S-33) from 2002 until March 2012 when it was replaced with a Bell Labs BGM-3 forced feedback-type gravity meter. Of considerable importance is that gravity tie-in logs were by and large inadequate for the rigorous removal of gravity drift and tares. Hence a best effort has been made to remove gravity meter drift via robust regression to satellite-derived gravity data. Regression slope and intercept are analogous to instrument drift and DC shift hence their removal markedly improves the agreement between shipboard and satellite gravity anomalies for most surveys. These drift corrections were applied to both observed gravity and free air anomaly fields. If the corrections are undesired by users, the correction coefficients have been supplied within the metadata headers for all gravity surveys, thereby allowing users to undo these drift corrections.
The L&R gravity meter had a 180 second hardware filter so for this analysis the data were Gaussian filtered another 180 seconds and resampled at 10 seconds. BGM-3 data are not hardware filtered hence a 360 second Gaussian filter was applied for this analysis. BGM-3 gravity anomalies were resampled at 15 second intervals. For both meter types, data gaps exceeding the filter length were not through-interpolated. Eotvos corrections were computed via the standard formula (e.g., Dehlinger, 1978) and were subjected to identical filtering of the respective gravity meter.
The vessel also deployed a Geometrics G-882 cesium vapor magnetometer on several expeditions. A Gaussian filter length of 135 seconds has been applied and resampling was performed at 15 second intervals with the same exception that no interpolation was performed through data gaps exceeding the filter length.
Archive file production:
At all depth, gravity and magnetic measurement times, vessel GPS navigation was resampled using linear interpolation as most geophysical measurement times did not exactly coincide with GPS position times. The geophysical fields were then merged with resampled vessel navigation and listed sequentially in the GMT DAT format to produce data records.
Archive file header fields were populated with relevant information such as port names, PI names, instrument and data processing details, and others whereas survey geographic and temporal boundary fields were automatically computed from the data records.
Archive file conversion:
Once completed, each marine geophysical data exchange file was converted to the other formats using the Generic Mapping Tools program known as mgd77convert. For example, conversions to the other formats were carried out as follows:
mgd77convert km0201.dat -Ft -Tm # gives mgd77t (m77t file extension)
mgd77convert km0201.dat -Ft -Ta # gives mgd77
mgd77convert km0201.dat -Ft -Tc # gives mgd77+ (nc file extension)
Disclaimers:
These data have not been edited in detail using a visual data editor and data outliers are known to exist. Several hardware malfunctions are known to have occurred during the 2002 to 2018 time frame and these malfunctions are apparent in some of the data sets. No guarantee is made that the data are accurate and they are not meant to be used for vessel navigation. Close scrutiny and further removal of outliers and other artifacts is recommended before making scientific determinations from these data.
The archive file production method employed for this analysis is explained in detail by Hamilton et al (2019).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1. The additional file (Additional file 1.zip) is a compressed folder containing four .csv files. Table S1: Targeted metabolite data, Table S2: Microbiomics ASV count table resulting from 16S amplicon sequencing, Table S3: Microbiomics ASV count table resulting from ITS amplicon sequencing, Table S4: Transcriptomics read count table (transposed format), and Table S5: Labels and class information including color code of the analyzed samples. Besides of being the data source for the present case study, these data tables can be used as test dataset after removal of the table header (first line). We highly recommend opening the files in a text editor of your choice and remove the headers there. When doing this step in Excel an error may occur.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Variation in observed global generic richness over the Phanerozoic must be partly explained by changes in the numbers of fossils and their geographic spread over time. The influence of sampling intensity (i.e., the number of samples) has been well addressed, but the extent to which the geographic distribution of samples might influence recovered biodiversity is comparatively unknown. To investigate this question, we create models of genus richness through time by resampling the same occurrence dataset of modern global biodiversity using spatially explicit sampling intensities defined by the paleo-coordinates of fossil occurrences from successive time intervals. Our steady-state null model explains about half of observed change in uncorrected fossil diversity and a quarter of variation in sampling-standardized diversity estimates. The inclusion in linear models of two additional explanatory variables associated with the spatial array of fossil data (absolute latitudinal range of occurrences, percent of occurrences from shallow environments) and a Cenozoic step increase the accuracy of steady-state models, accounting for 67% of variation in sampling-standardized estimates and more than one third of the variation in first differences. Our results make clear that the spatial distribution of samples is at least as important as numerical sampling intensity in determining the trajectory of recovered fossil biodiversity through time, and caution the overinterpretation of both the variation and the trend that emerges from analyses of global Phanerozoic diversity. Methods Fossil data were downloaded from the Palebobiology Database and manually cleaned to remove errors (i.e., non-marine organisms being included in the marine dataset). Modern marine invertebrate data were downloaded from the Ocean Biodiversity Information system using the R API. Further data transformations and statistical analyses were performed on the datasets using the R code provided.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.