Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
These data and computer code (written in R, https://www.r-project.org) were created to statistically evaluate a suite of spatiotemporal covariates that could potentially explain pronghorn (Antilocapra americana) mortality risk in the Northern Sagebrush Steppe (NSS) ecosystem (50.0757o N, −108.7526o W). Known-fate data were collected from 170 adult female pronghorn monitored with GPS collars from 2003-2011, which were used to construct a time-to-event (TTE) dataset with a daily timescale and an annual recurrent origin of 11 November. Seasonal risk periods (winter, spring, summer, autumn) were defined by median migration dates of collared pronghorn. We linked this TTE dataset with spatiotemporal covariates that were extracted and collated from pronghorn seasonal activity areas (estimated using 95% minimum convex polygons) to form a final dataset. Specifically, average fence and road densities (km/km2), average snow water equivalent (SWE; kg/m2), and maximum decadal normalized difference vegetation index (NDVI) were considered as predictors. We tested for these main effects of spatiotemporal risk covariates as well as the hypotheses that pronghorn mortality risk from roads or fences could be intensified during severe winter weather (i.e., interactions: SWE*road density and SWE*fence density). We also compare an analogous frequentist implementation to estimate model-averaged risk coefficients. Ultimately, the study aimed to develop the first broad-scale, spatially explicit map of predicted annual pronghorn survivorship based on anthropogenic features and environmental gradients to identify areas for conservation and habitat restoration efforts.
Methods We combined relocations from GPS-collared adult female pronghorn (n = 170) with raster data that described potentially important spatiotemporal risk covariates. We first collated relocation and time-to-event data to remove individual pronghorn from the analysis that had no spatial data available. We then constructed seasonal risk periods based on the median migration dates determined from a previous analysis; thus, we defined 4 seasonal periods as winter (11 November–21 March), spring (22 March–10 April), summer (11 April–30 October), and autumn (31 October–10 November). We used the package 'amt' in Program R to rarify relocation data to a common 4-hr interval using a 30-min tolerance. We used the package 'adehabitatHR' in Program R to estimate seasonal activity areas using 95% minimum convex polygon. We constructed annual- and seasonal-specific risk covariates by averaging values within individual activity areas. We specifically extracted values for linear features (road and fence densities), a proxy for snow depth (SWE), and a measure of forage productivity (NDVI). We resampled all raster data to a common resolution of 1 km2. Given that fence density models characterized regional-scale variation in fence density (i.e., 1.5 km2), this resolution seemed appropriate for our risk analysis. We fit Bayesian proportional hazards (PH) models using a time-to-event approach to model the effects of spatiotemporal covariates on pronghorn mortality risk. We aimed to develop a model to understand the relative effects of risk covariates for pronghorn in the NSS. The effect of fence or road densities may depend on SWE such that the variables interact in affecting mortality risk. Thus, our full candidate model included four main effects and two interaction terms. We used reversible-jump Markov Chain Monte Carlo (RJMCMC) to determine relative support for a nested set of Bayesian PH models. This allowed us to conduct Bayesian model selection and averaging in one step by using two custom samplers provided for the R package 'nimble'. For brevity, we provide the final time-to-event dataset and analysis code rather than include all of the code, GIS, etc. used to estimate seasonal activity areas and extract and collate spatial risk covariates for each individual. Rather we provide the data and all code to reproduce the risk regression results presented in the manuscript.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data contain bathymetric data from the Namibia continental slope. The data were acquired on R/V Meteor research expeditions M76/1 in 2008, and R/V Maria S. Merian expedition MSM19/1c in 2011. The purpose of the data was the exploration of the Namibian continental slope and espressially the investigation of large seafloor depressions. The bathymetric data were acquired with the 191-beam 12 kHz Kongsberg EM120 system. The data were processed using the public software package MBSystems. The loaded data were cleaned semi-automatically and manually, removing outliers and other erroneous data. Initial velocity fields were adjusted to remove artifacts from the data. Gridding was done in 10x10 m grid cells for the MSM19-1c dataset and 50x50 m for the M76 dataset using the Gaussian Weighted Mean algorithm.
This dataset contains cleaned GBIF (www.gbif.org) occurrence records and associated climate and environmental data for all arthropod prey of listed species in California drylands as identified in Lortie et al. (2023): https://besjournals.onlinelibrary.wiley.com/doi/full/10.1002/2688-8319.12251. All arthropod records were downloaded from GBIF (https://doi.org/10.15468/dl.ngym3r) on 14 November 2022. Records were imported into R using the rgbif package and cleaned with the coordinateCleaner package to remove occurrence data with likely errors. Environmental data include bioclimatic variables from WorldClim (www.worldclim.org), landcover and NDVI data from MODIS and the LPDAAC (https://lpdaac.usgs.gov/), elevation data from the USGS (https://www.sciencebase.gov/catalog/item/542aebf9e4b057766eed286a), and distance to the nearest road from the census bureau's TIGER/Line road shapefile (https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html). All environmental data were combined into a stacked raster and we extracted the environmental variables for each occurrence record from this raster to make the final dataset.
This data set contains QA/QC-ed (Quality Assurance and Quality Control) water level data for the PLM1 and PLM6 wells. PLM1 and PLM6 are location identifiers used by the Watershed Function SFA project for two groundwater monitoring wells along an elevation gradient located along the lower montane life zone of a hillslope near the Pumphouse location at the East River Watershed, Colorado, USA. These wells are used to monitor subsurface water and carbon inventories and fluxes, and to determine the seasonally dependent flow of groundwater under the PLM hillslope. The downslope flow of groundwater in combination with data on groundwater chemistry (see related references) can be used to estimate rates of solute export from the hillslope to the floodplain and river. QA/QC analysis of measured groundwater levels in monitoring wells PLM-1 and PLM-6 included identification and flagging of duplicated values of timestamps, gap filling of missing timestamps and water levels, removal of abnormal/bad and outliers of measured water levels. The QA/QC analysis also tested the application of different QA/QC methods and the development of regular (5-minute, 1-hour, and 1-day) time series datasets, which can serve as a benchmark for testing other QA/QC techniques, and will be applicable for ecohydrological modeling. Themore » package includes a Readme file, one R code file used to perform QA/QC, a series of 8 data csv files (six QA/QC-ed regular time series datasets of varying intervals (5-min, 1-hr, 1-day) and two files with QA/QC flagging of original data), and three files for the reporting format adoption of this dataset (InstallationMethods, file level metadata (flmd), and data dictionary (dd) files).QA/QC-ed data herein were derived from the original/raw data publication available at Williams et al., 2020 (DOI: 10.15485/1818367). For more information about running R code file (10.15485_1866836_QAQC_PLM1_PLM6.R) to reproduce QA/QC output files, see README (QAQC_PLM_readme.docx). This dataset replaces the previously published raw data time series, and is the final groundwater data product for the PLM wells in the East River. Complete metadata information on the PLM1 and PLM6 wells are available in a related dataset on ESS-DIVE: Varadharajan C, et al (2022). https://doi.org/10.15485/1660962. These data products are part of the Watershed Function Scientific Focus Area collection effort to further scientific understanding of biogeochemical dynamics from genome to watershed scales.2022/09/09 Update: Converted data files using ESS-DIVE’s Hydrological Monitoring Reporting Format. With the adoption of this reporting format, the addition of three new files (v1_20220909_flmd.csv, V1_20220909_dd.csv, and InstallationMethods.csv) were added. The file-level metadata file (v1_20220909_flmd.csv) contains information specific to the files contained within the dataset. The data dictionary file (v1_20220909_dd.csv) contains definitions of column headers and other terms across the dataset. The installation methods file (InstallationMethods.csv) contains a description of methods associated with installation and deployment at PLM1 and PLM6 wells. Additionally, eight data files were re-formatted to follow the reporting format guidance (er_plm1_waterlevel_2016-2020.csv, er_plm1_waterlevel_1-hour_2016-2020.csv, er_plm1_waterlevel_daily_2016-2020.csv, QA_PLM1_Flagging.csv, er_plm6_waterlevel_2016-2020.csv, er_plm6_waterlevel_1-hour_2016-2020.csv, er_plm6_waterlevel_daily_2016-2020.csv, QA_PLM6_Flagging.csv). The major changes to the data files include the addition of header_rows above the data containing metadata about the particular well, units, and sensor description.2023/01/18 Update: Dataset updated to include additional QA/QC-ed water level data up until 2022-10-12 for ER-PLM1 and 2022-10-13 for ER-PLM6. Reporting format specific files (v2_20230118_flmd.csv, v2_20230118_dd.csv, v2_20230118_InstallationMethods.csv) were updated to reflect the additional data. R code file (QAQC_PLM1_PLM6.R) was added to replace the previously uploaded HTML files to enable execution of the associated code. R code file (QAQC_PLM1_PLM6.R) and ReadMe file (QAQC_PLM_readme.docx) were revised to clarify where original data was retrieved from and to remove local file paths.« less
https://www.bco-dmo.org/dataset/2317/licensehttps://www.bco-dmo.org/dataset/2317/license
Meteorology and sea surface temperature (MET) 1 minute data from eight R/V Oceanus cruises in the Gulf of Maine and Georges Bank area during 1998 access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=The sea surface temperature as measured by the hull sensor is not shown since the sea surface temperature as measured via the engine inlet (field name is temp_ss1) is more accurate. awards_0_award_nid=54610 awards_0_award_number=unknown GB NSF awards_0_funder_name=National Science Foundation awards_0_funding_acronym=NSF awards_0_funding_source_nid=350 awards_0_program_manager=David L. Garrison awards_0_program_manager_nid=50534 awards_1_award_nid=54626 awards_1_award_number=unknown GB NOAA awards_1_funder_name=National Oceanic and Atmospheric Administration awards_1_funding_acronym=NOAA awards_1_funding_source_nid=352 cdm_data_type=Other comment=Emet 1 minute data starting 1996. rcg 2/10/1998 Remove SST_engine_intake parameter from display 7/8/1998 rcg File: OC317W.DAT Output of OCMETA.FOR(4/14/99) Change SSTEMP to SSTEMP3. 5/24/1999 rcg Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.2317.1 Easternmost_Easting=-65.2285 geospatial_lat_max=43.8375 geospatial_lat_min=39.6182 geospatial_lat_units=degrees_north geospatial_lon_max=-65.2285 geospatial_lon_min=-71.0428 geospatial_lon_units=degrees_east infoUrl=https://www.bco-dmo.org/dataset/2317 institution=BCO-DMO instruments_0_acronym=TSG instruments_0_dataset_instrument_description=Thermosalinograph used to obtain a continuous record of sea surface temperature and salinity. instruments_0_dataset_instrument_nid=4226 instruments_0_description=A thermosalinograph (TSG) is used to obtain a continuous record of sea surface temperature and salinity. On many research vessels the TSG is integrated into the ship's underway seawater sampling system and reported with the underway or alongtrack data. instruments_0_instrument_external_identifier=https://vocab.nerc.ac.uk/collection/L05/current/133/ instruments_0_instrument_name=Thermosalinograph instruments_0_instrument_nid=470 instruments_0_supplied_name=Thermosalinograph keywords_vocabulary=GCMD Science Keywords metadata_source=https://www.bco-dmo.org/api/dataset/2317 Northernmost_Northing=43.8375 param_mapping={'2317': {'lat': 'master - latitude', 'lon': 'master - longitude', 'press_bar': 'flag - depth'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/2317/parameters people_0_affiliation=Woods Hole Oceanographic Institution people_0_affiliation_acronym=WHOI people_0_person_name=Dr Richard Payne people_0_person_nid=50490 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=Woods Hole Oceanographic Institution people_1_affiliation_acronym=WHOI BCO-DMO people_1_person_name=Robert C. Groman people_1_person_nid=50380 people_1_role=BCO-DMO Data Manager people_1_role_type=related project=GB projects_0_acronym=GB projects_0_description=The U.S. GLOBEC Georges Bank Program is a large multi- disciplinary multi-year oceanographic effort. The proximate goal is to understand the population dynamics of key species on the Bank - Cod, Haddock, and two species of zooplankton (Calanus finmarchicus and Pseudocalanus) - in terms of their coupling to the physical environment and in terms of their predators and prey. The ultimate goal is to be able to predict changes in the distribution and abundance of these species as a result of changes in their physical and biotic environment as well as to anticipate how their populations might respond to climate change. The effort is substantial, requiring broad-scale surveys of the entire Bank, and process studies which focus both on the links between the target species and their physical environment, and the determination of fundamental aspects of these species' life history (birth rates, growth rates, death rates, etc). Equally important are the modelling efforts that are ongoing which seek to provide realistic predictions of the flow field and which utilize the life history information to produce an integrated view of the dynamics of the populations. The U.S. GLOBEC Georges Bank Executive Committee (EXCO) provides program leadership and effective communication with the funding agencies. projects_0_geolocation=Georges Bank, Gulf of Maine, Northwest Atlantic Ocean projects_0_name=U.S. GLOBEC Georges Bank projects_0_project_nid=2037 projects_0_project_website=http://globec.whoi.edu/globec_program.html projects_0_start_date=1991-01 sourceUrl=(local files) Southernmost_Northing=39.6182 standard_name_vocabulary=CF Standard Name Table v55 subsetVariables=year,depth_w,depth_cs,ed_lw,temp_ss1,temp_ss5,numb_records version=1 Westernmost_Easting=-71.0428 xml_source=osprey2erddap.update_xml() v1.3
The attached file details the workflow for the processing and analysis of active acoustic data (Simrad EK60; 12, 38, 120 and 200 kHz) collected from RSV Aurora Australis during the 2006 BROKE-West voyage. The attached file is in Echoview(R) (https://www.echoview.com/) version 8 format.
The Echoview file is suitable for working with fisheries acoustics, i.e. water column backscatter, data collected using a Simrad EK60 and the file is set-up to read 38, 120 and 200 kHz split-beam data. The file has operators to remove acoustic noise, e.g. spikes and dropped pings, and operators for removing surface noise and seabed echoes. Echoes arising from krill are isolated using the ‘dB-difference’ method recommended by CCAMLR. The Echoview file is set-up to export the results of krill echo integration as both intervals and swarms. Full details of the method are available in Jarvis et al. (2010) and the krill swarms methods are described in Bestley et al. (2017).
This dataset is composed of 60 spherical convergence maps (30 per class), separated into a training set of 40 maps and a testing set of 20 maps.
As a preprocessing, we recommend to remove the mean of each map and to smooth them with a Gaussian symmetric beam of 3 arcmins (sphtfunc.smoothing in Healpy).
This dataset was created by R. Sgier and T. Kacprzak based on the work in paper by Sgier R. et al. 2018, "Fast Generation of Covariance Matrices for Weak Lensing", arxiv.org/abs/1801.05745.
This dataset is used in "DeepSphere: Efficient spherical Convolutional Neural Network with HEALPix sampling for cosmological applications" arxiv.org/abs/1810.12186.
This dataset contains polylines depicting non-woodland linear tree and shrub features in Cornwall and much of Devon, derived from lidar data collected by the Tellus South West project. Data from a lidar (light detection and ranging) survey of South West England was used with existing open source GIS datasets to map non-woodland linear features consisting of woody vegetation. The output dataset is the product of several steps of filtering and masking the lidar data using GIS landscape feature datasets available from the Tellus South West project (digital terrain model (DTM) and digital surface model (DSM)), the Ordnance Survey (OS VectorMap District and OpenMap Local, to remove buildings) and the Forestry Commission (Forestry Commission National Forest Inventory Great Britain 2015, to remove woodland parcels). The dataset was tiled as 20 x 20 km shapefiles, coded by the bottom-left 10 km hectad name. Ground-truthing suggests an accuracy of 73.2% for hedgerow height classes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Existing panel data methods remove unobserved individual effects before change point estimation through data transformations such as first-differencing. In this paper, we show that multiple change points can be consistently estimated in short panels via ordinary least squares. Since no data variation is removed before change point estimation, our method has better small-sample properties compared to first-differencing methods. We also propose two tests that identify whether the change points found by our method originate in the slope parameters or in the covariance of the regressors with individual effects. We illustrate our method via modeling the environmental Kuznets curve and the US house price expectations after the financial crisis.
Seven in-stream HOBO pressure transducers and one reference HOBO pressure transducer have been deployed within the Lake Sunapee, NH, USA watershed. Six of the transducers have been in operation since 2010 and an additional transducer was added in 2016. The transducers record data every 15 minutes, and data are downloaded approximately three times per year (early Spring, mid Summer and late Fall). Shortly after download, the data are processed to estimate stream depth using HOBOware's Barometric Compensation Assistant and converted to a .csv file in HOBOware. The data have been QAQC'd to recode obviously errant data to NA using R programming language. No data transformation has occurred beyond basic QAQC of the data to remove known data issues and obviously errant data. The barometric pressure data from the reference transducer located on land are also included in this data package.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
These are easements concerning the riparian properties of the railways and established in areas defined by the Act of 15 July 1845 on the Police of Railways and by Article 6 of the Decree of 30 October 1935, as amended, creating visibility easements on public roads, namely: — prohibition on the construction of any construction, other than a fence wall, within a distance of two metres from a railway (art. 5 of the Law of 15 July 1845), — prohibition, without prior authorisation, of excavations in an area equal to the vertical height of a railway embankment of more than three metres, measured from the foot of the slope (art. 6 of the Law of 15 July 1845), — prohibition on establishing thatched blankets, straw and hay grindstones, and any other deposition of flammable materials, at a distance of less than 20 metres from a railway served by fire machines, measured from the foot of the slope (art. 7 of the Law of 15 July 1845), — prohibition on depositing stones or non-flammable objects without prior prefectural authorisation less than five metres from a railway (art. 8 of the Law of 15 July 1845), —Servitudes of visibility at the crossing of a public road and a railway (art. 6 of the Decree-Law of 30 October 1935 and Art. R. 114-6 of the Highway Code), easements defined by a clearance plan drawn up by the authority managing the highway and which may include, as the case may be, in accordance with Article 2 of the decree: •the obligation to remove fence walls or replace them with grids, to remove annoying plantations, to bring back and hold the terrain and any superstructure to a level that is most equal to the level is determined by the above-mentioned decommitment plan, •the absolute prohibition of building, placing fences, filling, planting and making any installations above the level set by the clearance plan Texts in force: Law of 15 July 1845 on the Railway Police — Title I: measures relating to the conservation of iron (Articles 1 to 11); Road Traffic Code (created by Act No. 89-413 and Decree No. 89-631) and in particular the following articles: —L. 123-6 and R.123-3 relating to alignment on national roads, — L. 114-1 to L. 114-6 relating to visibility easements at grade crossings, — R. 131-1 et seq. and R. 141-1 et seq. for the implementation of decommitment plans on departmental or municipal roads. The linear entities of this data relate to the use of certain resources and equipment, they affect land use. With the collection of easements from third parties, DDT-77 cannot guarantee the completeness and accuracy of the deferral of these easements on a large-scale map.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The raw data file is available online for public access (https://data.ontario.ca/dataset/lake-simcoe-monitoring). Download the 1980-2019 csv files and open up the file named "Simcoe_Zooplankton&Bythotrephes.csv". Copy and paste the zooplankton sheet into a new excel file called "Simcoe_Zooplankton.csv". The column ZDATE in the excel file needs to be switched from GENERAL to SHORT DATE so that the dates in the ZDATE column read "YYYY/MM/DD". Save as .csv in appropriate R folder. The data file "simcoe_manual_subset_weeks_5" is the raw data that has been subset for the main analysis of the article using the .R file "Simcoe MS - 5 Station Subset Data". The .csv file produced from this must then be manually edited to remove data points that do not have 5 stations per sampling period as well as by combining data points that should fall into a single week. The "simcoe_manual_subset_weeks_5.csv" is then used for the calculation of variability, stabilization, asynchrony, and Shannon Diversity for each year in the .R file "Simcoe MS - 5 Station Calculations". The final .R file "Simcoe MS - 5 Station Analysis contains the final statistical analyses as well as code to reproduce the original figures. Data and code for main and supplementary analyses are also available on GitHub (https://github.com/reillyoc/ZPseasonalPEs).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.
Methods Overview
This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.
Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.
The methods for compilation are contained in the supplementary information of the manuscript but also here:
Bird data
For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).
Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.
The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.
Environmental data
Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CTD data were acquired when the RMT instrument was in the water.
Data Acquisition:
There is a FSI CTD sensor housed in a fibreglass box that is attached to the top bar of the RMT. The RMT software running in the aft control room establishes a Telnet connection to the aft control terminal server which connects to the CTD sensor using various hardware connections. Included are the calibration data for the CTD sensor that were used for the duration of the voyage.
The RMT software receives packet of CTD data and every second the most recent CTD data are written out to a data file. Additional information about the motor is also logged with the CTD data.
Data are only written to the data file when the net is in the water. The net in and out of water status is determined by the conductivity value. The net is deemed to be in the water when the conductivity averaged over a 10 second period is greater than 0. When the average value is less than 0 the net is deemed to be out of the water. New data files were automatically created for each trawl.
Data Processing:
If the net did not open when first attempted then the net was 'jerked' open. This meant the winch operator adjusted the winch control so that it was at maximum speed and then turned it on for a very short time. This had the effect of dropping the net a short distance very quickly. This dislodges the net hook from its cradle and the net opens. The scientist responsible for the trawl would have noted the time in the trawl log book that the winch operator turned on the winch to jerk the net.
The data files will have started the 'net open' counter 10 seconds after the user clicks the 'Net Open' button. If this time did not match the time written in the trawl log book by the scientist, then the net open time in the CSV file was adjusted. The value in the 'Net Open Time' column will increment from the time the net started to open to the time that the net started to close.
The pressure was also plotted to ensure that the time written down in the log book was correct. When the net opens there is a visible change in the CTD pressure value received. The net 'flies' up as the drag in the water increases as the net opens. If the time noted was incorrect then the scientist responsible for the log book, So Kawaguchi, was notifed of the problem and the data file was not adjusted.
The original log files that were produced by the RMT software were trimmed to remove any columns that did not pertain to the CTD data. These columns include the motor information and the ITI data. The ITI data gives information about the distance from the net to the ship but was not working for the duration of the BROKE-West voyage. This trimming was completed using a purpose built java application. This java class is part of the NOODLES source code.
Dataset Format:
The dataset is in a zip format. There is a .CSV file for each trawl, 125 in total. There were 51 Routine trawls and 74 Target Trawls. The file naming convention is as follows:
[Routine/Target]NNN-rmt-2006-MM-DD.csv
Where,
NNN is the trawl number from 001 to 124. MM is the month, 01 or 02 DD is the day of the month.
Also included in the zip file are the calibration files for each of the CTD sensors and the current documentation on the RMT software.
Each CSV file contains the following columns:
Date (UTC) Time (UTC) Ship Latitude (decimal degrees) Ship Longitude (decimal degrees) Conductivity (mS/cm) Temperature (Deg C) Pressure (DBar) Salinity (PSU) Sound Velocity (m/s) Fluorometer (ug/L chlA) Net Open Time (mm:ss) If the net is not open this value will be 0, else the number of minutes and seconds since the net opened will be displayed.
When the user clicks the 'Net Open' button there is a delay of 10 seconds before the net starts to open. The value displayed in the 'Net Open Time' column starts incrementing once this 10 seconds delay has passed. Similarly when the user clicks the 'Net Close' button there is a delay of 6 seconds before the net starts to close. Thus the counter stops once this 6 seconds has passed.
Acronyms Used:
CTD: Conductivity, Temperature, Pressure RMT: Rectangular Midwater Trawl CSV: Comma seperated value FSI: Falmouth Scientific Inc ITI: Intelligent Trawl Interface
This work was completed as part of ASAC projects 2655 and 2679 (ASAC_2655, ASAC_2679).
A 150-kHz narrowband RD Instruments Acoustic Doppler Current Profiler (ADCP) internally recorded 34,805 current ensembles in 362 days from an Ice-Ocean Buoy (IOEB) deployed during the SHEBA project. The IOEB was initially deployed about 50 km from the main camp and drifted from 75.1 N, 141 W to 80.6 N, 160 W between October 1, 1997 and September 30, 1998. The ADCP was located at a depth of 14m below the ice surface and was configured to record data at 15 minute intervals from 40 8m wide bins extending downward 320m below the instrument. The retrieved 24 Mb raw data are processed to remove noise, correct for platform drift and geomagnetic declination, remove bottom hits, and output 2-hr average Earth-referenced current profiles along with ancillary data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Pokemon’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mlomuscio/pokemon on 14 February 2022.
--- Dataset description provided by original source is as follows ---
I acquired the data from Alberto Barradas at https://www.kaggle.com/abcsds/pokemon. I needed to edit some of the variable names and remove the Total variable in order for my students to use this data for class. Otherwise, I would have just had them use his version of the data.
This dataset is for my Introduction to Data Science and Machine Learning Course. Using a modified Pokémon dataset acquired from Kaggle.com, I created example code for students demonstrating how to explore data with R.
Barradas provides the following description of each variable. I have modified the variable names to make them easier to deal with.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers