52 datasets found
  1. R code

    • figshare.com
    txt
    Updated Jun 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 5, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Christine Dodge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers

  2. d

    Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Contiguous United States
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.

  3. n

    Data and R code from: Spatiotemporal risk factors predict landscape-scale...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Aug 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Eacker; Andrew Jakes; Paul Jones (2022). Data and R code from: Spatiotemporal risk factors predict landscape-scale survivorship for a northern ungulate [Dataset]. http://doi.org/10.5061/dryad.pvmcvdnnt
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 31, 2022
    Dataset provided by
    Smithsonian's National Zoo and Conservation Biology Institute
    Alberta Wildlife Association
    Taurus Wildlife Consulting
    Authors
    Daniel Eacker; Andrew Jakes; Paul Jones
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    These data and computer code (written in R, https://www.r-project.org) were created to statistically evaluate a suite of spatiotemporal covariates that could potentially explain pronghorn (Antilocapra americana) mortality risk in the Northern Sagebrush Steppe (NSS) ecosystem (50.0757o N, −108.7526o W). Known-fate data were collected from 170 adult female pronghorn monitored with GPS collars from 2003-2011, which were used to construct a time-to-event (TTE) dataset with a daily timescale and an annual recurrent origin of 11 November. Seasonal risk periods (winter, spring, summer, autumn) were defined by median migration dates of collared pronghorn. We linked this TTE dataset with spatiotemporal covariates that were extracted and collated from pronghorn seasonal activity areas (estimated using 95% minimum convex polygons) to form a final dataset. Specifically, average fence and road densities (km/km2), average snow water equivalent (SWE; kg/m2), and maximum decadal normalized difference vegetation index (NDVI) were considered as predictors. We tested for these main effects of spatiotemporal risk covariates as well as the hypotheses that pronghorn mortality risk from roads or fences could be intensified during severe winter weather (i.e., interactions: SWE*road density and SWE*fence density). We also compare an analogous frequentist implementation to estimate model-averaged risk coefficients. Ultimately, the study aimed to develop the first broad-scale, spatially explicit map of predicted annual pronghorn survivorship based on anthropogenic features and environmental gradients to identify areas for conservation and habitat restoration efforts.

    Methods We combined relocations from GPS-collared adult female pronghorn (n = 170) with raster data that described potentially important spatiotemporal risk covariates. We first collated relocation and time-to-event data to remove individual pronghorn from the analysis that had no spatial data available. We then constructed seasonal risk periods based on the median migration dates determined from a previous analysis; thus, we defined 4 seasonal periods as winter (11 November–21 March), spring (22 March–10 April), summer (11 April–30 October), and autumn (31 October–10 November). We used the package 'amt' in Program R to rarify relocation data to a common 4-hr interval using a 30-min tolerance. We used the package 'adehabitatHR' in Program R to estimate seasonal activity areas using 95% minimum convex polygon. We constructed annual- and seasonal-specific risk covariates by averaging values within individual activity areas. We specifically extracted values for linear features (road and fence densities), a proxy for snow depth (SWE), and a measure of forage productivity (NDVI). We resampled all raster data to a common resolution of 1 km2. Given that fence density models characterized regional-scale variation in fence density (i.e., 1.5 km2), this resolution seemed appropriate for our risk analysis. We fit Bayesian proportional hazards (PH) models using a time-to-event approach to model the effects of spatiotemporal covariates on pronghorn mortality risk. We aimed to develop a model to understand the relative effects of risk covariates for pronghorn in the NSS. The effect of fence or road densities may depend on SWE such that the variables interact in affecting mortality risk. Thus, our full candidate model included four main effects and two interaction terms. We used reversible-jump Markov Chain Monte Carlo (RJMCMC) to determine relative support for a nested set of Bayesian PH models. This allowed us to conduct Bayesian model selection and averaging in one step by using two custom samplers provided for the R package 'nimble'. For brevity, we provide the final time-to-event dataset and analysis code rather than include all of the code, GIS, etc. used to estimate seasonal activity areas and extract and collate spatial risk covariates for each individual. Rather we provide the data and all code to reproduce the risk regression results presented in the manuscript.

  4. f

    Data from: Valid Inference Corrected for Outlier Removal

    • tandf.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Shuxiao Chen; Jacob Bien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.

  5. Air-conditioner Location Running Hours Data for GovHack 2015

    • researchdata.edu.au
    Updated Jul 9, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Industry, Science and Resources (DISR) (2014). Air-conditioner Location Running Hours Data for GovHack 2015 [Dataset]. https://researchdata.edu.au/air-conditioner-location-govhack-2015/2979421
    Explore at:
    Dataset updated
    Jul 9, 2014
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Department of Industry, Science and Resources (DISR)
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This resource is for historic purposes only and was provided for the GovHack competition (3-5 July 2015). After the event it was discovered that the latitude and longitude columns had been inadvertently inverted. For any project using this data please use the updated version of the resource (link) located here.\r \r We have elected not to remove this resource at this time so as to ensure that any GovHack entries using this data are not disadvantaged during the judging process. We intend to remove this version of the data after the GovHack judging has been completed.\r ==\r

  6. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  7. d

    Data from: Data on invasive corallimorphs Palmyra

    • catalog.data.gov
    • data.usgs.gov
    Updated Oct 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data on invasive corallimorphs Palmyra [Dataset]. https://catalog.data.gov/dataset/data-on-invasive-corallimorphs-palmyra
    Explore at:
    Dataset updated
    Oct 1, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Invasive marine species are well documented but options to manage them are limited. At Palmyra Atoll National Wildlife Refuge (Central North Pacific), native invasive corallimorpharians, Rhodactis howesii, have smothered live native corals since 2007. Laboratory and field trials were conducted evaluating two control methods to remove R. howesii overgrowing the benthos at Palmyra Atoll (Palmyra): 1) toothpaste mixed with chlorine, citric acid, or sodium hydroxide (NaOH), and 2) hot water. Paste mixed with NaOH had the most efficacious kill in mesocosm trials and resulted in >90% kill over a 98 m2 area three days after treatment. Hot water at 82C was most effective in mesocosms; in the field hot water was less effective than paste but still resulted in a kill of ca. 75% over 100 m2 three days after treatment. Costs of paste and heat (excluding capital equipment and costs of regulatory approval should this method be deployed large scale) were $70/m2 and $59/m2 respectively. Invasive R. howesii currently occupy ca 5,800,000 m2 of reef at Palmyra with ca. 276,000 m2 comprising heavily infested areas. Several potential management strategies are discussed based on costs of treatment, area covered, and the biology of the invasion. The methods described here expand the set of tools available to manage invasive species in complex marine habitats.

  8. R

    R K.v2i.coco Remove_hashmarkerli Dataset

    • universe.roboflow.com
    zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    afb (2025). R K.v2i.coco Remove_hashmarkerli Dataset [Dataset]. https://universe.roboflow.com/afb-nnhqq/r-k.v2i.coco-remove_hashmarkerli-bjhdy/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2025
    Dataset authored and provided by
    afb
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Variables measured
    YARD12
    Description

    R K.v2i.coco Remove_hashmarkerli

    ## Overview
    
    R K.v2i.coco Remove_hashmarkerli is a dataset for computer vision tasks - it contains YARD12 annotations for 263 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [BY-NC-SA 4.0 license](https://creativecommons.org/licenses/BY-NC-SA 4.0).
    
  9. t

    Wenau, Stefan, Spieß, Volkhard, Zabel, Matthias (2021). Dataset: Multibeam...

    • service.tib.eu
    Updated Nov 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Wenau, Stefan, Spieß, Volkhard, Zabel, Matthias (2021). Dataset: Multibeam bathymetry processed data (EM 120 echosounder dataset compilation) of RV METEOR & RV MARIA S. MERIAN during cruise M76/1 & MSM19/1c, Namibian continental slope. https://doi.org/10.1594/PANGAEA.932434 [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-932434
    Explore at:
    Dataset updated
    Nov 29, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data contain bathymetric data from the Namibia continental slope. The data were acquired on R/V Meteor research expeditions M76/1 in 2008, and R/V Maria S. Merian expedition MSM19/1c in 2011. The purpose of the data was the exploration of the Namibian continental slope and espressially the investigation of large seafloor depressions. The bathymetric data were acquired with the 191-beam 12 kHz Kongsberg EM120 system. The data were processed using the public software package MBSystems. The loaded data were cleaned semi-automatically and manually, removing outliers and other erroneous data. Initial velocity fields were adjusted to remove artifacts from the data. Gridding was done in 10x10 m grid cells for the MSM19-1c dataset and 50x50 m for the M76 dataset using the Gaussian Weighted Mean algorithm.

  10. d

    Underway Data from R/V Melville, R/V Roger Revelle cruises MV1101, RR1202 in...

    • search.dataone.org
    • bco-dmo.org
    • +2more
    Updated Dec 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William M. Balch (2021). Underway Data from R/V Melville, R/V Roger Revelle cruises MV1101, RR1202 in the Southern Ocean (30-60S); 2011-2012 (Great Calcite Belt project) [Dataset]. https://search.dataone.org/view/http%3A%2F%2Flod.bco-dmo.org%2Fid%2Fdataset%2F560142
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Biological and Chemical Oceanography Data Management Office (BCO-DMO)
    Authors
    William M. Balch
    Area covered
    Southern Ocean
    Description

    Along track temperature, Salinity, backscatter, Chlorophyll Fluoresence, and normalized water leaving radiance (nLw).

    On the bow of the vessel was a Satlantic SeaWiFS Aircraft Simulator (MicroSAS) system, used to estimate water-leaving radiance from the ship, analogous to to the nLw derived by the SeaWiFS and MODIS satellite sensors, but free from atmospheric error (hence, it can provide data below clouds).

    The system consisted of a down-looking radiance sensor and a sky-viewing radiance sensor, both mounted on a steerable holder on the bow. A downwelling irradiance sensor was mounted at the top of the ship's meterological mast, on the bow, far from any potentially shading structures. These data were used to estimate normalized water-leaving radiance as a function of wavelength. The radiance detector was set to view the water at 40deg from nadir as recommended by Mueller et al. [2003b]. The water radiance sensor was able to view over an azimuth range of ~180deg across the ship's heading with no viewing of the ship's wake. The direction of the sensor was adjusted to view the water 90-120deg from the sun's azimuth, to minimize sun glint. This was continually adjusted as the time and ship's gyro heading were used to calculate the sun's position using an astronomical solar position subroutine interfaced with a stepping motor which was attached to the radiometer mount (designed and fabricated at Bigelow Laboratory for Ocean Sciences). Protocols for operation and calibration were performed according to Mueller [Mueller et al., 2003a; Mueller et al., 2003b; Mueller et al., 2003c]. Before 1000h and after 1400h, data quality was poorer as the solar zenith angle was too low. Post-cruise, the 10Hz data were filtered to remove as much residual white cap and glint as possible (we accept the lowest 5% of the data). Reflectance plaque measurements were made several times at local apparent noon on sunny days to verify the radiometer calibrations.

    Within an hour of local apparent noon each day, a Satlantic OCP sensor was deployed off the stern of the vessel after the ship oriented so that the sun was off the stern. The ship would secure the starboard Z-drive, and use port Z-drive and bow thruster to move the ship ahead at about 25cm s-1. The OCP was then trailed aft and brought to the surface ~100m aft of the ship, then allowed to sink to 100m as downwelling spectral irradiance and upwelling spectral radiance were recorded continuously along with temperature and salinity. This procedure ensured there were no ship shadow effects in the radiometry.

    Instruments include a WETLabs wetstar fluorometer, a WETLabs ECOTriplet and a SeaBird microTSG.
    Radiometry was done using a Satlantic 7 channel microSAS system with Es, Lt and Li sensors.

    Chl data is based on inter calibrating surface discrete Chlorophyll measure with the temporally closest fluorescence measurement and applying the regression results to all fluorescence data.

    Data have been corrected for instrument biofouling and drift based on weekly purewater calibrations of the system. Radiometric data has been processed using standard Satlantic processing software and has been checked with periodic plaque measurements using a 2% spectralon standard.

    Lw is calculated from Lt and Lsky and is \"what Lt would be if the
    sensor were looking straight down\". Since our sensors are mounted at
    40o, based on various NASA protocols, we need to do that conversion.

    Lwn adds Es to the mix. Es is used to normalize Lw. Nlw is related to Rrs, Remote Sensing Reflectance

    Techniques used are as described in:
    Balch WM, Drapeau DT, Bowler BC, Booth ES, Windecker LA, Ashe A (2008) Space–time variability of carbon standing stocks and fixation rates in the Gulf of Maine, along the GNATS transect between Portland, ME, USA, and Yarmouth, Nova Scotia, Canada.
    J Plankton Res 30:119–139

  11. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  12. r

    Data from: A Parameterized Complexity View on Collapsing k-Cores

    • resodate.org
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junjie Luo; Hendrik Molter; Ondřej Suchý (2021). A Parameterized Complexity View on Collapsing k-Cores [Dataset]. http://doi.org/10.14279/depositonce-12762
    Explore at:
    Dataset updated
    Dec 6, 2021
    Dataset provided by
    DepositOnce
    Technische Universität Berlin
    Authors
    Junjie Luo; Hendrik Molter; Ondřej Suchý
    Description

    We study the NP-hard graph problem COLLAPSED K-CORE where, given an undirected graph G and integers b, x, and k, we are asked to remove b vertices such that the k-core of remaining graph, that is, the (uniquely determined) largest induced subgraph with minimum degree k, has size at most x. COLLAPSED K-CORE was introduced by Zhang et al. (2017) and it is motivated by the study of engagement behavior of users in a social network and measuring the resilience of a network against user drop outs. COLLAPSED K-CORE is a generalization of R-DEGENERATE VERTEX DELETION (which is known to be NP-hard for all r ≥ 0) where, given an undirected graph G and integers b and r, we are asked to remove b vertices such that the remaining graph is r-degenerate, that is, every its subgraph has minimum degree at most r. We investigate the parameterized complexity of COLLAPSED K-CORE with respect to the parameters b, x, and k, and several structural parameters of the input graph. We reveal a dichotomy in the computational complexity of COLLAPSED K-CORE for k ≤ 2 and k ≥ 3. For the latter case it is known that for all x ≥ 0 COLLAPSED K-CORE is W[P]-hard when parameterized by b. For k ≤ 2 we show that COLLAPSED K-CORE is W[1]-hard when parameterized by b and in FPT when parameterized by (b + x). Furthermore, we outline that COLLAPSED K-CORE is in FPT when parameterized by the treewidth of the input graph and presumably does not admit a polynomial kernel when parameterized by the vertex cover number of the input graph.

  13. Supplementary data and code 1 for "Significant shifts in latitudinal optima...

    • figshare.com
    zip
    Updated Apr 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paulo Mateus Martins (2024). Supplementary data and code 1 for "Significant shifts in latitudinal optima of North American birds" (PNAS) [Dataset]. http://doi.org/10.6084/m9.figshare.24881544.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 1, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Paulo Mateus Martins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Significant shifts in latitudinal optima of North American birds (PNAS)Paulo Mateus Martins, Marti J. Anderson, Winston L. Sweatman, and Andrew J. PunnettOverviewThis file contains the raw 2022 release of the North American breeding bird survey dataset (Ziolkowski Jr et al. 2022), as well as the filtered version used in our paper and the code that generated it. We also included code for using BirdLife's species distribution shapefiles to classify species as eastern or western based on their occurrence in the BBS dataset and to calculated the percentage of their range covered by the BBS sampling extent. Note that this code requires species distribution shapefiles, which are not provided but can be obtained directly from https://datazone.birdlife.org/species/requestdis.ReferenceD. J. Ziolkowski Jr., M. Lutmerding, V. I. Aponte, M. A. R. Hudson, North American breeding bird survey dataset 1966–2021: U.S. Geological Survey data release (2022), https://doi.org/10.5066/P97WAZE5Detailed file descriptioninfo_birds_names_shp: A data frame that links BBS species names (column Species) to shapefiles (column Species_BL). See the code2_sampling coverage.dat_raw_BBS_data_v2022: This R environment contains the raw BBS data from the 2022 release (https://www.sciencebase.gov/catalog/item/625f151ed34e85fa62b7f926). This object contains data frames created with the files "Routes.zip" (route information), "SpeciesList.txt" (bird taxonomy), and "50-StopData.zip" (actual counts per route and year). This object is the starting point for creating the dataset used in the paper, which was filtered to remove taxonomic uncertainties, as demonstrated in the "code1_build_long_wide_datasets" R script.code1_build_long_wide_datasets: This code filters the original dataset (dat_raw_BBS_data_v2022) to remove taxonomic uncertainties, assigns routes as either eastern or western based on regionalization using the dynamically constrained agglomerative clustering and partitioning method (see the Methods section of the paper), and generates the full long and wide versions of the dataset used in the analyses (dat2_filtered_data_long, dat3_filtered_data_wide).dat2_filtered_data_long: The filtered raw dataset in long form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.dat3_filtered_data_wide: The filtered raw dataset in its widest form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.code2_sampling coverage: This code determines how much of a bird distribution is covered by the BBS sampling extent (refer to Dataset S1). It is important to note that this script requires bird species distribution shapefiles from BirdLife International, which we are not permitted to share. The shapefiles can be requested directly at https://datazone.birdlife.org/species/requestdis

  14. Data and R code from: Global Phanerozoic biodiversity, can variation be...

    • data.niaid.nih.gov
    zip
    Updated Jul 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Phillipi (2024). Data and R code from: Global Phanerozoic biodiversity, can variation be explained by spatial sampling intensity [Dataset]. http://doi.org/10.5061/dryad.2280gb621
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 27, 2024
    Dataset provided by
    Syracuse University
    Authors
    Daniel Phillipi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Variation in observed global generic richness over the Phanerozoic must be partly explained by changes in the numbers of fossils and their geographic spread over time. The influence of sampling intensity (i.e., the number of samples) has been well addressed, but the extent to which the geographic distribution of samples might influence recovered biodiversity is comparatively unknown. To investigate this question, we create models of genus richness through time by resampling the same occurrence dataset of modern global biodiversity using spatially explicit sampling intensities defined by the paleo-coordinates of fossil occurrences from successive time intervals. Our steady-state null model explains about half of observed change in uncorrected fossil diversity and a quarter of variation in sampling-standardized diversity estimates. The inclusion in linear models of two additional explanatory variables associated with the spatial array of fossil data (absolute latitudinal range of occurrences, percent of occurrences from shallow environments) and a Cenozoic step increase the accuracy of steady-state models, accounting for 67% of variation in sampling-standardized estimates and more than one third of the variation in first differences. Our results make clear that the spatial distribution of samples is at least as important as numerical sampling intensity in determining the trajectory of recovered fossil biodiversity through time, and caution the overinterpretation of both the variation and the trend that emerges from analyses of global Phanerozoic diversity. Methods Fossil data were downloaded from the Palebobiology Database and manually cleaned to remove errors (i.e., non-marine organisms being included in the marine dataset). Modern marine invertebrate data were downloaded from the Ocean Biodiversity Information system using the R API. Further data transformations and statistical analyses were performed on the datasets using the R code provided.

  15. Data Mining Project - Boston

    • kaggle.com
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  16. (Other) Suburb/Locality Boundaries - Geoscape Administrative Boundaries

    • researchdata.edu.au
    • data.gov.au
    Updated Sep 9, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Industry, Science and Resources (DISR) (2014). (Other) Suburb/Locality Boundaries - Geoscape Administrative Boundaries [Dataset]. https://researchdata.edu.au/other-suburblocality-boundaries-administrative-boundaries/2976691
    Explore at:
    Dataset updated
    Sep 9, 2014
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Department of Industry, Science and Resources (DISR)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    The digital Suburb/Locality Boundaries and their legal identifiers have been derived from the cadastre data from each state and territory jurisdiction and are available below.\r \r Suburb/Locality Boundaries are part of Geoscape Administrative Boundaries, which is built and maintained by Geoscape Australia using authoritative government data. Further information about contributors to Administrative Boundaries is available here.\r \r The full Administrative Boundaries dataset comprises seven Geoscape products:\r \r * Localities\r * Local Government Areas (LGAs)\r * Wards\r * Australian Bureau of Statistics (ABS) Boundaries,\r * Electoral Boundaries\r * State Boundaries and\r * Town Points\r \r Updated versions of Administrative Boundaries are published on a quarterly basis.\r Users have the option to download datasets with feature coordinates referencing either GDA94 or GDA2020 datums.\r \r There were no updates in the May 2025 release\r \r \r Notable changes in the August 2021 release:\r \r * The Localities, Local Government Areas and Wards products have been redesigned to provide a new flattened data model, offering a simpler, easier-to-use structure. This will also:\r - change the composition of identifiers in these products.\r - provide state identifiers as an abbreviation (eg. NSW) rather than a code.\r - remove the static SA “Hundreds” data from Localities.\r * More information on the changes to Localities, Local Government Areas and Wards is available here.\r * The Australian Bureau of Statistics (ABS) Boundaries will be updated to include the 2021 Australian Statistical Geography Standard (ASGS).\r * Further information on the August changes to Geoscape datasets is available here.\r \r Further information on Administrative Boundaries, including FAQs on the data, is available here through Geoscape Australia’s network of partners. They provide a range of commercial products based on Administrative Boundaries, including software solutions, consultancy and support.\r \r Note: On 1 October 2020, PSMA Australia Limited began trading as Geoscape Australia. \r \r

    License Information\r

    \r The Australian Government has negotiated the release of Administrative Boundaries to the whole economy under an open CCBY 4.0 license.\r \r Users must only use the data in ways that are consistent with the Australian Privacy Principles issued under the Privacy Act 1988 (Cth).\r \r Users must also note the following attribution requirements:\r \r Preferred attribution for the Licensed Material:\r \r

    Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International licence (CC BY 4.0).\r \r Preferred attribution for Adapted Material:\r \r Incorporates or developed using Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International licence (CC BY 4.0).\r

  17. n

    Effect of data source on estimates of regional bird richness in northeastern...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk (2021). Effect of data source on estimates of regional bird richness in northeastern United States [Dataset]. http://doi.org/10.5061/dryad.m905qfv0h
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 4, 2021
    Dataset provided by
    Columbia University
    University of Michigan
    New York State Department of Environmental Conservation
    Agricultural Research Service
    Massachusetts Audubon Society
    Gettysburg College
    Hebrew University of Jerusalem
    University of Vermont
    Authors
    Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Northeastern United States, United States
    Description

    Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.

    Methods Overview

    This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.

    Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.

    The methods for compilation are contained in the supplementary information of the manuscript but also here:

    Bird data

    For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).

    Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.

    The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.

    Environmental data

    Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.

  18. Z

    Marine geophysical data exchange files for R/V Kilo Moana: 2002 to 2018

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamilton, Michael (2021). Marine geophysical data exchange files for R/V Kilo Moana: 2002 to 2018 [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4699568
    Explore at:
    Dataset updated
    Apr 27, 2021
    Dataset authored and provided by
    Hamilton, Michael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary:

    Marine geophysical exchange files for R/V Kilo Moana: 2002 to 2018 includes 328 geophysical archive files spanning km0201, the vessel's very first expedition, through km1812, the last survey included in this data synthesis.

    Data formats (you will likely require only one of these):

    MGD77T (M77T): ASCII - the current standard format for marine geophysical data exchange, tab delimited, low human readability

    MGD77: ASCII - legacy format for marine geophysical data exchange (no longer recommended due to truncated data precision and low human readability)

    GMT DAT: ASCII - the Generic Mapping Tools format in which these archive files were built, best human readability but largest file size

    MGD77+: highly flexible and disk space saving binary NetCDF-based format, enables adding additional columns and application of errata-based data correction methods (i.e., Chandler et al, 2012), not human readable

    The process by which formats were converted is explained below.

    Data Reduction and Explanation:

    R/V Kilo Moana routinely acquired bathymetry data using two concurrently operated sonar systems hence, for this analysis, a best effort was made to extract center beam depth values from the appropriate sonar system. No resampling or decimation of center beam depth data has been performed with the exception that all depth measurements were required to be temporally separated by at least 1 second. The initial sonar systems were the Kongsberg EM120 for deep and EM1002 for shallow water mapping. The vessel's deep sonar system was upgraded to Kongsberg EM122 in January of 2010 and the shallow system to EM710 in March 2012.

    The vessel deployed a Lacoste and Romberg spring-type gravity meter (S-33) from 2002 until March 2012 when it was replaced with a Bell Labs BGM-3 forced feedback-type gravity meter. Of considerable importance is that gravity tie-in logs were by and large inadequate for the rigorous removal of gravity drift and tares. Hence a best effort has been made to remove gravity meter drift via robust regression to satellite-derived gravity data. Regression slope and intercept are analogous to instrument drift and DC shift hence their removal markedly improves the agreement between shipboard and satellite gravity anomalies for most surveys. These drift corrections were applied to both observed gravity and free air anomaly fields. If the corrections are undesired by users, the correction coefficients have been supplied within the metadata headers for all gravity surveys, thereby allowing users to undo these drift corrections.

    The L&R gravity meter had a 180 second hardware filter so for this analysis the data were Gaussian filtered another 180 seconds and resampled at 10 seconds. BGM-3 data are not hardware filtered hence a 360 second Gaussian filter was applied for this analysis. BGM-3 gravity anomalies were resampled at 15 second intervals. For both meter types, data gaps exceeding the filter length were not through-interpolated. Eotvos corrections were computed via the standard formula (e.g., Dehlinger, 1978) and were subjected to identical filtering of the respective gravity meter.

    The vessel also deployed a Geometrics G-882 cesium vapor magnetometer on several expeditions. A Gaussian filter length of 135 seconds has been applied and resampling was performed at 15 second intervals with the same exception that no interpolation was performed through data gaps exceeding the filter length.

    Archive file production:

    At all depth, gravity and magnetic measurement times, vessel GPS navigation was resampled using linear interpolation as most geophysical measurement times did not exactly coincide with GPS position times. The geophysical fields were then merged with resampled vessel navigation and listed sequentially in the GMT DAT format to produce data records.

    Archive file header fields were populated with relevant information such as port names, PI names, instrument and data processing details, and others whereas survey geographic and temporal boundary fields were automatically computed from the data records.

    Archive file conversion:

    Once completed, each marine geophysical data exchange file was converted to the other formats using the Generic Mapping Tools program known as mgd77convert. For example, conversions to the other formats were carried out as follows:

    mgd77convert km0201.dat -Ft -Tm # gives mgd77t (m77t file extension)

    mgd77convert km0201.dat -Ft -Ta # gives mgd77

    mgd77convert km0201.dat -Ft -Tc # gives mgd77+ (nc file extension)

    Disclaimers:

    These data have not been edited in detail using a visual data editor and data outliers are known to exist. Several hardware malfunctions are known to have occurred during the 2002 to 2018 time frame and these malfunctions are apparent in some of the data sets. No guarantee is made that the data are accurate and they are not meant to be used for vessel navigation. Close scrutiny and further removal of outliers and other artifacts is recommended before making scientific determinations from these data.

    The archive file production method employed for this analysis is explained in detail by Hamilton et al (2019).

  19. Data from: Automatic Definition of Robust Microbiome Sub-states in...

    • zenodo.org
    txt, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson (2020). Data from: Automatic Definition of Robust Microbiome Sub-states in Longitudinal Data [Dataset]. http://doi.org/10.5281/zenodo.167376
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Output files of the application of our R software (available at https://github.com/wilkinsonlab/robust-clustering-metagenomics) to different microbiome datasets already published.

    Prefixes:

    Suffixes:

    • _All: all taxa

    • _Dominant: only 1% most abundant taxa

    • _NonDominant: remaining taxa after removing above dominant taxa

    • _GenusAll: taxa aggregated at genus level

    • _GenusDominant: taxa aggregated at genes level and then to select only 1% most abundant taxa

    • _GenusNonDominant: taxa aggregated at genus level and then to remove 1% most abundant taxa

    Each folder contains 3 output files related to the same input dataset:
    - data.normAndDist_definitiveClustering_XXX.RData: R data file with a) a phyloseq object (including OTU table, meta-data and cluster assigned to each sample); and b) a distance matrix object.
    - definitiveClusteringResults_XXX.txt: text file with assessment measures of the selected clustering.
    - sampleId-cluster_pairs_XXX.txt: text file. Two columns, comma separated file: sampleID,clusterID

    Abstract of the associated paper:

    The analysis of microbiome dynamics would allow us to elucidate patterns within microbial community evolution; however, microbiome state-transition dynamics have been scarcely studied. This is in part because a necessary first-step in such analyses has not been well-defined: how to deterministically describe a microbiome's "state". Clustering in states have been widely studied, although no standard has been concluded yet. We propose a generic, domain-independent and automatic procedure to determine a reliable set of microbiome sub-states within a specific dataset, and with respect to the conditions of the study. The robustness of sub-state identification is established by the combination of diverse techniques for stable cluster verification. We reuse four distinct longitudinal microbiome datasets to demonstrate the broad applicability of our method, analysing results with different taxa subset allowing to adjust it depending on the application goal, and showing that the methodology provides a set of robust sub-states to examine in downstream studies about dynamics in microbiome.

  20. E

    Woody habitat corridor data in South West England

    • catalogue.ceh.ac.uk
    • hosted-metadata.bgs.ac.uk
    • +1more
    zip
    Updated Mar 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R.K. Broughton; F. Gerard; R. Haslam; A.S. Howard (2017). Woody habitat corridor data in South West England [Dataset]. http://doi.org/10.5285/4b5680d9-fdbc-40c0-96a1-4c022185303f
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    NERC EDS Environmental Information Data Centre
    Authors
    R.K. Broughton; F. Gerard; R. Haslam; A.S. Howard
    Time period covered
    Jul 1, 2013 - Aug 31, 2013
    Area covered
    Description

    This dataset contains polylines depicting non-woodland linear tree and shrub features in Cornwall and much of Devon, derived from lidar data collected by the Tellus South West project. Data from a lidar (light detection and ranging) survey of South West England was used with existing open source GIS datasets to map non-woodland linear features consisting of woody vegetation. The output dataset is the product of several steps of filtering and masking the lidar data using GIS landscape feature datasets available from the Tellus South West project (digital terrain model (DTM) and digital surface model (DSM)), the Ordnance Survey (OS VectorMap District and OpenMap Local, to remove buildings) and the Forestry Commission (Forestry Commission National Forest Inventory Great Britain 2015, to remove woodland parcels). The dataset was tiled as 20 x 20 km shapefiles, coded by the bottom-left 10 km hectad name. Ground-truthing suggests an accuracy of 73.2% for hedgerow height classes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
Organization logoOrganization logo

R code

Explore at:
txtAvailable download formats
Dataset updated
Jun 5, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Christine Dodge
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers

Search
Clear search
Close search
Google apps
Main menu