This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the _byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr_{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
https://doi.org/10.5061/dryad.brv15dvh0
On each trial, participants heard a stimulus and clicked a box on the computer screen to indicate whether they heard "SET" or "SAT." Responses of "SET" are coded as 0 and responses of "SAT" are coded as 1. The continuum steps, from 1-7, for duration and spectral quality cues of the stimulus on each trial are named "DurationStep" and "SpectralStep," respectively. Group (young or older adult) and listening condition (quiet or noise) information are provided for each row of the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CTD data were acquired when the RMT instrument was in the water.
Data Acquisition:
There is a FSI CTD sensor housed in a fibreglass box that is attached to the top bar of the RMT. The RMT software running in the aft control room establishes a Telnet connection to the aft control terminal server which connects to the CTD sensor using various hardware connections. Included are the calibration data for the CTD sensor that were used for the duration of the voyage.
The RMT software receives packet of CTD data and every second the most recent CTD data are written out to a data file. Additional information about the motor is also logged with the CTD data.
Data are only written to the data file when the net is in the water. The net in and out of water status is determined by the conductivity value. The net is deemed to be in the water when the conductivity averaged over a 10 second period is greater than 0. When the average value is less than 0 the net is deemed to be out of the water. New data files were automatically created for each trawl.
Data Processing:
If the net did not open when first attempted then the net was 'jerked' open. This meant the winch operator adjusted the winch control so that it was at maximum speed and then turned it on for a very short time. This had the effect of dropping the net a short distance very quickly. This dislodges the net hook from its cradle and the net opens. The scientist responsible for the trawl would have noted the time in the trawl log book that the winch operator turned on the winch to jerk the net.
The data files will have started the 'net open' counter 10 seconds after the user clicks the 'Net Open' button. If this time did not match the time written in the trawl log book by the scientist, then the net open time in the CSV file was adjusted. The value in the 'Net Open Time' column will increment from the time the net started to open to the time that the net started to close.
The pressure was also plotted to ensure that the time written down in the log book was correct. When the net opens there is a visible change in the CTD pressure value received. The net 'flies' up as the drag in the water increases as the net opens. If the time noted was incorrect then the scientist responsible for the log book, So Kawaguchi, was notifed of the problem and the data file was not adjusted.
The original log files that were produced by the RMT software were trimmed to remove any columns that did not pertain to the CTD data. These columns include the motor information and the ITI data. The ITI data gives information about the distance from the net to the ship but was not working for the duration of the BROKE-West voyage. This trimming was completed using a purpose built java application. This java class is part of the NOODLES source code.
Dataset Format:
The dataset is in a zip format. There is a .CSV file for each trawl, 125 in total. There were 51 Routine trawls and 74 Target Trawls. The file naming convention is as follows:
[Routine/Target]NNN-rmt-2006-MM-DD.csv
Where,
NNN is the trawl number from 001 to 124. MM is the month, 01 or 02 DD is the day of the month.
Also included in the zip file are the calibration files for each of the CTD sensors and the current documentation on the RMT software.
Each CSV file contains the following columns:
Date (UTC) Time (UTC) Ship Latitude (decimal degrees) Ship Longitude (decimal degrees) Conductivity (mS/cm) Temperature (Deg C) Pressure (DBar) Salinity (PSU) Sound Velocity (m/s) Fluorometer (ug/L chlA) Net Open Time (mm:ss) If the net is not open this value will be 0, else the number of minutes and seconds since the net opened will be displayed.
When the user clicks the 'Net Open' button there is a delay of 10 seconds before the net starts to open. The value displayed in the 'Net Open Time' column starts incrementing once this 10 seconds delay has passed. Similarly when the user clicks the 'Net Close' button there is a delay of 6 seconds before the net starts to close. Thus the counter stops once this 6 seconds has passed.
Acronyms Used:
CTD: Conductivity, Temperature, Pressure RMT: Rectangular Midwater Trawl CSV: Comma seperated value FSI: Falmouth Scientific Inc ITI: Intelligent Trawl Interface
This work was completed as part of ASAC projects 2655 and 2679 (ASAC_2655, ASAC_2679).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.
Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel
Preprocessing
With libreoffice,
remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes intermediate data from RiboBase that generates translation efficiency (TE). The code to generate the files can be found at https://github.com/CenikLab/TE_model.
We uploaded demo HeLa .ribo files, but due to the large storage requirements of the full dataset, I recommend contacting Dr. Can Cenik directly to request access to the complete version of RiboBase if you need the original data.
The detailed explanation for each file:
human_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in human.
human_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in human.
human_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in human.
human_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in human.
human_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in human.
human_TE_rho.rda: TE proportional similarity data as genes by genes matrix in human.
mouse_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in mouse.
mouse_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in mouse.
mouse_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in mouse.
mouse_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in mouse.
mouse_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in mouse.
mouse_TE_rho.rda: TE proportional similarity data as genes by genes matrix in mouse.
All the data was passed quality control. There are 1054 mouse samples and 835 mouse samples:
* coverage > 0.1 X
* CDS percentage > 70%
* R2 between RNA and RIBO >= 0.188 (remove outliers)
All ribosome profiling data here is non-dedup winsorizing data paired with RNA-seq dedup data without winsorizing (even though it names as flatten, it just the same format of the naming)
####code
If you need to read rda data please use load("rdaname.rda") with R
If you need to calculate proportional similarity from clr data:
library(propr)
human_TE_homo_rho <- propr:::lr2rho(as.matrix(clr_data))
rownames(human_TE_homo_rho) <- colnames(human_TE_homo_rho) <- rownames(clr_data)
https://www.bco-dmo.org/dataset/660543/licensehttps://www.bco-dmo.org/dataset/660543/license
Water column data from CTD casts along the East Siberian Arctic Shelf on R/V Oden during 2011 (ESAS Water Column Methane project) access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=Acquisition methods are described in the following publication: Orcut, B. et al. 2005
Core sectioning, porewater\u00a0collection\u00a0and analysis
At each sampling site, sediment sub-samples were collected for porewater analyses and at selected depths for microbial rate assays (AOM, anaerobic oxidation of methane oxidation; methanogenesis (MOG) from bicarbonate and acetate). Sediment was expelled from core liner using a hydraulic extruder under anoxic conditions. The depth intervals for extrusion varied. At each depth interval, a sub-sample was collected into a cut-off syringe for dissolved methane concentration quantification. Another 5 mL\u00a0sub- sample\u00a0was collected into pre-weighed and pre-combusted glass vial for determination of porosity (determined by the change in weight after drying at 80 degrees celsius to a constant weight). The remaining material was used for porewater extraction. Sample fixation and\u00a0analyses\u00a0for dissolved constituents followed the methods of Joye et al. (2010).\u00a0
Microbial Activity Measurements\u00a0
To determine AOM and MOG rates, 8 to 12 sub-samples (5 cm3) were collected from a core by manual insertion of a glass tube. For AOM, 100 uL of dissolved\u00a014CH4\u00a0tracer (about 2,000,000 DPM as gas) was injected into each core. Samples were incubated for 36 to 48 hours at\u00a0in situ\u00a0temperature.\u00a0 Following incubation, samples were transferred to 20 mL glass vials containing 2 mL of 2M NaOH (which served to arrest biological activity and fix\u00a014CO2\u00a0as\u00a014C-HCO3-).\u00a0 Each vial was sealed with a\u00a0teflon-lined screw cap, vortexed to mix the sample and base, and immediately frozen. Time zero samples were fixed immediately after radiotracer injection. The specific activity of the tracer substrate (14CH4) was determined by injecting 50 uL directly into scintillation cocktail (Scintiverse BD) followed by liquid scintillation counting. The accumulation of 14C product (14CO2) was determined by acid digestion following the method of Joye et al. (2010).\u00a0 The AOM rate was calculated using equation 1:
AOM Rate = [CH4] x alphaCH4 /t x (a-14CO2/a-14CH4)\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0(Eq. 1)
Here, the AOM Rate is expressed as nmol CH4 oxidized per cm3 sediment per day (nmol\u00a0cm-3 d-1), [CH4] is the methane concentration (uM), alphaCH4 is the isotope fractionation factor for AOM (1.06; (ALPERIN and REEBURGH, 1988)), t is the incubation time (d), a-14CO2 is the activity of the product pool, and a-14CH4 is the activity of the substrate pool. If methane concentration was not available, the turnover time of the 14CH4 tracer is presented.
Rates of bicarbonate-based-methanogenesis and acetoclastic methanogenesis were determined by incubating samples in gas-tight, closed-tube vessels without headspace, to prevent the loss of gaseous 14CH4 product during sample manipulation. These sample tubes were sealed using custom-designed plungers (black Hungate stoppers with the lip removed containing a plastic \u201ctail\u201d that was run through the stopper) were inserted at the base of the tube; the sediment was then pushed via the plunger to the top of the tube until a small amount protruded through the tube opening. A butyl rubber septa\u00a0was\u00a0then eased into the tube opening to displace sediment in contact with the atmosphere and close the tube, which was then sealed with\u00a0a open-top\u00a0screw cap.\u00a0 The rubber materials used in these assays were boiled in 1N NaOH for 1 hour, followed by several rinses in boiling milliQ, to leach potentially toxic substances. \u00a0 \u00a0
A volume of radiotracer solution (100 uL of 14C-HCO3- tracer (~1 x 107\u00a0dpm\u00a0in slightly alkaline milliQ\u00a0water) or 1,2-14C-CH3COO- tracer (~5 x 107\u00a0dpm\u00a0in slightly alkaline milliQ\u00a0water)) was injected into each sample. Samples were incubated as described above and then 2 ml of 2N NaOH was injected through the top stopper into each sample to terminate biological activity (time zero samples were fixed prior to tracer injection).\u00a0 Samples were mixed to evenly distribute NaOH through the sample.\u00a0 Production of 14CH4 was quantified by stripping methane from the tubes with an air carrier, converting the 14CH4 to 14CO2 in a combustion furnace, and subsequent trapping of the 14CO2 in NaOH as carbonate (CRAGG et al., 1990; CRILL and MARTENS, 1986).\u00a0\u00a0Activity\u00a0of 14CO2 was measured subsequently by liquid scintillation counting.\u00a0
The rates of Bi-MOG and Ac-MOG rates were calculated using equations 2 and 3, respectively:
Bi-MOG Rate = [HCO3-] x alphaHCO3/t x\u00a0 (a-14CH4/a-H14CO3-) \u00a0 \u00a0 (Eq. 2)
Ac-MOG Rate = [CH3COO-] x alphaCH3COO-/t\u00a0 x\u00a0 (a-14CH4/a-14CH314COO-) \u00a0 \u00a0 (Eq. 3)
Both rates are expressed as nmol HCO3- or CH3COO-, respectively, reduced cm-3 d-1, alphaHCO3\u00a0and alphaCH3COO- are the isotope fractionation factors for MOG (assumed to be 1.06). [HCO3-] and [CH3COO-] are the\u00a0pore\u00a0water bicarbonate (mM) and acetate (uM) concentrations, respectively, t is incubation time (d), a-14CH4 is the activity of the product pool, and a-H14CO3 and a-14CH314COO are the activities of the substrate pools. If samples for substrate concentration determination were not available, the substrate turnover constant instead of the rate is presented.
For water column methane oxidation rate assays, triplicate 20 mL of live water (in addition to one 20 mL sample which was killed with ethanol (750 uL of pure EtOH) before tracer addition) were transferred from the CTD into serum vials. Samples were amended with 2 x 10^6 DPM of 3H-labeled-methane tracer and incubated for 24 to 72 hours (linearity of activity was tested and confirmed). After incubation, samples were fixed with ethanol, as above, and a sub-sample to determine total sample activity (3H-methane + 3H-water) was collected. Next, the sample was purged with nitrogen to remove the 3H-methane tracer and a sub-sample was amended with scintillation fluid and counted on a shipboard scintillation counter to determine the activity of tracer in the product of 3H-methane oxidation, 3H-water. The methane oxidation rate was calculated as:
MOX Rate = [methane concentration in nM] x alphaCH4/t\u00a0 x\u00a0 (a-3H- H2O/a-3H-CH4-) \u00a0 \u00a0 (Eq. 3) awards_0_award_nid=651766 awards_0_award_number=PLR-1023444 awards_0_data_url=http://www.nsf.gov/awardsearch/showAward?AWD_ID=1023444 awards_0_funder_name=NSF Division of Polar Programs awards_0_funding_acronym=NSF PLR awards_0_funding_source_nid=490497 awards_0_program_manager=Henrietta N Edmonds awards_0_program_manager_nid=51517 cdm_data_type=Other comment=Water Column Data S. Joye and V. Samarkin, PIs Version 4 October 2016 Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.660543.1 Easternmost_Easting=178.9479 geospatial_lat_max=77.3829 geospatial_lat_min=65.0835 geospatial_lat_units=degrees_north geospatial_lon_max=178.9479 geospatial_lon_min=125.0406 geospatial_lon_units=degrees_east geospatial_vertical_max=651.0 geospatial_vertical_min=10.0 geospatial_vertical_positive=down geospatial_vertical_units=m infoUrl=https://www.bco-dmo.org/dataset/660543 institution=BCO-DMO instruments_0_acronym=CTD instruments_0_dataset_instrument_description=Used to collect water column samples instruments_0_dataset_instrument_nid=660553 instruments_0_description=The Conductivity, Temperature, Depth (CTD) unit is an integrated instrument package designed to measure the conductivity, temperature, and pressure (depth) of the water column. The instrument is lowered via cable through the water column and permits scientists observe the physical properties in real time via a conducting cable connecting the CTD to a deck unit and computer on the ship. The CTD is often configured with additional optional sensors including fluorometers, transmissometers and/or radiometers. It is often combined with a Rosette of water sampling bottles (e.g. Niskin, GO-FLO) for collecting discrete water samples during the cast. This instrument designation is used when specific make and model are not known. instruments_0_instrument_external_identifier=https://vocab.nerc.ac.uk/collection/L05/current/130/ instruments_0_instrument_name=CTD profiler instruments_0_instrument_nid=417 instruments_0_supplied_name=CTD keywords_vocabulary=GCMD Science Keywords metadata_source=https://www.bco-dmo.org/api/dataset/660543 Northernmost_Northing=77.3829 param_mapping={'660543': {'lat': 'master - latitude', 'depth_max': 'flag - depth', 'lon': 'master - longitude'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/660543/parameters people_0_affiliation=University of Georgia people_0_affiliation_acronym=UGA people_0_person_name=Samantha B. Joye people_0_person_nid=51421 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=University of Georgia people_1_affiliation_acronym=UGA people_1_person_name=Vladimir Samarkin people_1_person_nid=641543 people_1_role=Co-Principal Investigator people_1_role_type=originator people_2_affiliation=University of Georgia people_2_affiliation_acronym=UGA people_2_person_name=Samantha B. Joye people_2_person_nid=51421 people_2_role=Contact people_2_role_type=related people_3_affiliation=Woods Hole Oceanographic Institution people_3_affiliation_acronym=WHOI BCO-DMO people_3_person_name=Hannah Ake people_3_person_nid=650173 people_3_role=BCO-DMO Data Manager people_3_role_type=related project=ESAS Water Column Methane projects_0_acronym=ESAS Water Column Methane projects_0_description=We propose to study methane (CH4)
A dataset within the Harmonized Database of Western U.S. Water Rights (HarDWR). For a detailed description of the database, please see the meta-record v2.0. Changelog v2.0 - Recalculated based on data sourced from WestDAAT - Changed using a Site ID column to identify unique records to using aa combination of Site ID and Allocation ID - Removed the Water Management Area (WMA) column from the harmonized records. The replacement is a separate file which stores the relationship between allocations and WMAs. This allows for allocations to contribute to water right amounts to multiple WMAs during the subsequent cumulative process. - Added a column describing a water rights legal status - Added "Unspecified" was a water source category - Added an acre-foot (AF) column - Added a column for the classification of the right's owner v1.02 - Added a .RData file to the dataset as a convenience for anyone exploring our code. This is an internal file, and the one referenced in analysis scripts as the data objects are already in R data objects. v1.01 - Updated the names of each file with an ID number less than 3 digits to include leading 0s v1.0 - Initial public release Description Heremore » we present an updated database of Western U.S. water right records. This database provides consistent unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of seven broad use categories. These data were instrumental in conducting a study of the multi-sector dynamics of inter-sectoral water allocation changes though water markets (Grogan et al., in review). Specifically, the data were formatted for use as input to a process-based hydrologic model, Water Balance Model (WBM), with a water rights module (Grogan et al., in review). While this specific study motivated the development of the database presented here, water management in the U.S. West is a rich area of study (e.g., Anderson and Woosly, 2005; Tidwell, 2014; Null and Prudencio, 2016; Carney et al., 2021) so releasing this database publicly with documentation and usage notes will enable other researchers to do further work on water management in the U.S. West. We produced the water rights database presented here in four main steps: (1) data collection, (2) data quality control, (3) data harmonization, and (4) generation of cumulative water rights curves. Each of steps (1)-(3) had to be completed in order to produce (4), the final product that was used in the modeling exercise in Grogan et al. (in review). All data in each step is associated with a spatial unit called a Water Management Area (WMA), which is the unit of water right administration utilized by the state in which the right came from. Steps (2) and (3) required use to make assumptions and interpretation, and to remove records from the raw data collection. We describe each of these assumptions and interpretations below so that other researchers can choose to implement alternative assumptions an interpretation as fits their research aims. Motivation for Changing Data Sources The most significant change has been a switch from collecting the raw water rights directly from each state to using the water rights records presented in WestDAAT, a product of the Water Data Exchange (WaDE) Program under the Western States Water Council (WSWC). One of the main reasons for this is that each state of interest is a member of the WSWC, meaning that WaDE is partially funded by these states, as well as many universities. As WestDAAT is also a database with consistent categorization, it has allowed us to spend less time on data collection and quality control and more time on answering research questions. This has included records from water right sources we had previously not known about when creating v1.0 of this database. The only major downside to utilizing the WestDAAT records as our raw data is that further updates are tied to when WestDAAT is updated, as some states update their public water right records daily. However, as our focus is on cumulative water amounts at the regional scale, it is unlikely most records updates would have a significant effect on our results. The structure of WestDAAT led to several important changes to how HarWR is formatted. The most significant change is that WaDE has calculated a field known as SiteUUID
, which is a unique identifier for the Point of Diversion (POD), or where the water is drawn from. This separate from AllocationNativeID
, which is the identifier for the allocation of water, or the amount of water associated with the water right. It should be noted that it is possible for a single site to have multiple allocations associated with it and for an allocation to be able to be extracted from multiple sites. The site-allocation structure has allowed us to adapt a more consistent, and hopefully more realistic, approach in organizing the water right records than we had with HarDWR v1.0. This was incredibly helpful as the raw data from many states had multiple water uses within a single field within a single row of their raw data, and it was not always clear if the first water use was the most important, or simply first alphabetically. WestDAAT has already addressed this data quality issue. Furthermore, with v1.0, when there were multiple records with the same water right ID, we selected the largest volume or flow amount and disregarded the rest. As WestDAAT was already a common structure for disparate data formats, we were better able to identify sites with multiple allocations and, perhaps more importantly, allocations with multiple sites. This is particularly helpful when an allocation has sites which cross WMA boundaries, instead of just assigning the full water amount to a single WMA we are now able to divide the amount of water between the number of relevant WMAs. As it is now possible to identify allocations with water used in multiple WMAs, it is no longer practical to store this information within a single column. Instead the stAllocationToWMATab.csv file was created, which is an allocation by WMA matrix containing the percent Place of Use area overlap with each WMA. We then use this percentage to divide the allocation's flow amount between the given WMAs during the cumulation process to hopefully provide more realistic totals of water use in each area. However, not every state provides areas of water use, so like HarDWR v1.0, a hierarchical decision tree was used to assign each allocation to a WMA. First, if a WMA could be identified based on the allocation ID, then that WMA was used; typically, when available, this applied to the entire state and no further steps were needed. Second was the spatial analysis of Place of Use to WMAs. Third was a spatial analysis of the POD locations to WMAs, with the assumption that allocation's POD is within the WMA it should belong to; if an allocation still had multiple WMAs based on its POD locations, then the allocation's flow amount would be divided equally between all WMAs. The fourth, and final, process was to include water allocations which spatially fell outside of the state WMA boundaries. This could be due to several reasons, such as coordinate errors / imprecision in the POD location, imprecision in the WMA boundaries, or rights attached with features, such as a reservoir, which crosses state boundaries. To include these records, we decided for any POD which was within one kilometer of the state's edge would be assigned to the nearest WMA. Other Changes WestDAAT has Allowed In addition to a more nuanced and consistent method of assigning water right's data to WMAs, there are other benefits gained from using the WestDAAT dataset. Among those is a consistent categorization of a water right's legal status. In HarDWR v1.0, legal status was effectively ignored, which led to many valid concerns about the quality of the database related to the amounts of water the rights allowed to be claimed. The main issue was that rights with legal status' such as "application withdrawn", "non-active", or "cancelled" were included within HarDWR v1.0. These, and other water rights status' which were deemed to not be in use have been removed from this version of the database. Another major change has been the addition of the "unspecified water source category. This is water that can come from either surface water or groundwater, or the source of which is unknown. The addition of this source category brings the total number of categories to three. Due to reviewer feedback, we decided to add the acre-foot (AF) column so that the data may be more applicable to a wider audience. We added the ownerClassification column so that the data may be more applicable to a wider audience. File Descriptions The dataset is a series of various files organized by state sub-directories. In addition, each file begins with the state's name, in case the file is separate from its sub-directory for some reason. After the state name is the text which describes the contents of the file. Here is each file described in detail. Note that st is a placeholder for the state's name. stFullRecords_HarmonizedRights.csv: A file of the complete water records for each state. The column headers for each of this type of file are: state - The name of the state to which the allocations belong to. FIPS - The two digit numeric state ID code. siteID - The site location ID for POD locations. A site may have multiple allocations, which are the actual amount of water which can be drawn. In a simplified hypothetical, a farm stead may have an allocation for "irrigation" and an allocation for "domestic" water use, but the water is drawn from the same pumping equipment. It should be noted that many of the site ID appear to have been added by WaDE, and therefore may not be recognized by a given state's water rights database. allocationID - The allocation ID for the water right. For most states this is the water right ID, and what is
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Regional- and continental-scale models predicting variations in the magnitude and timing of streamflow are important tools for forecasting water availability as well as flood inundation extent and associated damages. Such models must define the geometry of stream channels through which flow is routed. These channel parameters, such as width, depth, and hydraulic resistance, exhibit substantial variability in natural systems. While hydraulic geometry relationships have been extensively studied in the United States, they remain unquantified for thousands of stream reaches across the country. Consequently, large-scale hydraulic models frequently take simplistic approaches to channel geometry parameterization. Over-simplification of channel geometries directly impacts the accuracy of streamflow estimates, with knock-on effects for water resource and hazard prediction.
Here, we present a hydraulic geometry dataset derived from long-term measurements at U.S. Geological Survey (USGS) stream gages across the conterminous United States (CONUS). This dataset includes (a) at-a-station hydraulic geometry parameters following the methods of Leopold and Maddock (1953), (b) at-a-station Manning's n calculated from the Manning equation, (c) daily discharge percentiles, and (d) downstream hydraulic geometry regionalization parameters based on HUC4 (Hydrologic Unit Code 4). This dataset is referenced in Heldmyer et al. (2022); further details and implications for CONUS-scale hydrologic modeling are available in that article (https://doi.org/10.5194/hess-26-6121-2022).
At-a-station Hydraulic Geometry
We calculated hydraulic geometry parameters using historical USGS field measurements at individual station locations. Leopold and Maddock (1953) derived the following power law relationships:
\(w={aQ^b}\)
\(d=cQ^f\)
\(v=kQ^m\)
where Q is discharge, w is width, d is depth, v is velocity, and a, b, c, f, k, and m are at-a-station hydraulic geometry (AHG) parameters. We downloaded the complete record of USGS field measurements from the USGS NWIS portal (https://waterdata.usgs.gov/nwis/measurements). This raw dataset includes 4,051,682 individual measurements from a total of 66,841 stream gages within CONUS. Quantities of interest in AHG derivations are Q, w, d, and v. USGS field measurements do not include d--we therefore calculated d using d=A/w, where A is measured channel area. We applied the following quality control (QC) procedures in order to ensure the robustness of AHG parameters derived from the field data:
Application of the QC procedures described above removed 55,328 stream gages, many of which were short-term campaign gages at which very few field measurements had been recorded. We derived AHG parameters for the remaining 11,513 gages which passed our QC.
At-a-station Manning's n
We calculated hydraulic resistance at each gage location by solving Manning's equation for Manning's n, given by
\(n = {{R^{2/3}S^{1/2}} \over v}\)
where v is velocity, R is hydraulic radius and S is longitudinal slope. We used smoothed reach-scale longitudinal slopes from the NHDPlusv2 (National Hydrography Dataset Plus, version 2) ElevSlope data product. We note that NHDPlusv2 contains a minimum slope constraint of 10-5 m/m--no reach may have a slope less than this value. Furthermore, NHDPlusv2 lacks slope values for certain reaches. As such, we could not calculate Manning's n for every gage, and some Manning's n values we report may be inaccurate due to the NHDPlusv2 minimum slope constraint. We report two Manning's n values, both of which take stream depth as an approximation for R. The first takes the median stream depth and velocity measurements from the USGS's database of manual flow measurements for each gage. The second uses stream depth and velocity calculated for a 50th percentile discharge (Q50; see below). Approximating R as stream depth is an assumption which is generally considered valid if the width-to-depth ratio of the stream is greater than 10—which was the case for the vast majority of field measurements. Thus, we report two Manning's n values for each gage, which are each intended to approximately represent median flow conditions.
Daily discharge percentiles
We downloaded full daily discharge records from 16,947 USGS stream gages through the NWIS online portal. The data includes records from both operational and retired gages. Records for operational gages were truncated at the end of the 2018 water year (September 30, 2018) in order to avoid use of preliminary data. To ensure the robustness of daily discharge percentiles, we applied the following QC:
We calculated discharge percentiles for each of the 10,871 gages which passed QC. Discharge percentiles were calculated at increments of 1% between Q1 and Q5, increments of 5% (e.g. Q10, Q15, Q20, etc.) between Q5 and Q95, increments of 1% between Q95 and Q99, and increments of 0.1% between Q99 and Q100 in order to provide higher resolution at the lowest and highest flows, which occur much less frequently.
HG Regionalization
We regionalized AHG parameters from gage locations to all stream reaches in the conterminous United States. This downstream hydraulic geometry regionalization was performed using all gages with AHG parameters in each HUC4, as opposed to traditional downstream hydraulic geometry--which involves interpolation of parameters of interest to ungaged reaches on individual streams. We performed linear regressions on log-transformed drainage area and Q at a number of flow percentiles as follows:
\(log(Q_i) = \beta_1log(DA) + \beta_0\)
where Qi is streamflow at percentile i, DA is drainage area and \(\beta_1\) and \(\beta_0\) are regression parameters. We report \(\beta_1\), \(\beta_0\) , and the r2 value of the regression relationship for Q percentiles Q10, Q25, Q50, Q75, Q90, Q95, Q99, and Q99.9. Further discussion and additional analysis of HG regionalization are presented in Heldmyer et al. (2022).
Dataset description
We present the HyG dataset in a comma-separated value (csv) format. Each row corresponds to a different USGS stream gage. Information in the dataset includes gage ID (column 1), gage location in latitude and longitude (columns 2-3), gage drainage area (from USGS; column 4), longitudinal slope of the gage's stream reach (from NHDPlusv2; column 5), AHG parameters derived from field measurements (columns 6-11), Manning's n calculated from median measured flow conditions (column 12), Manning's n calculated from Q50 (column 13), Q percentiles (columns 14-51), HG regionalization parameters and r2 values (columns 52-75), and geospatial information for the HUC4 in which the gage is located (from USGS; columns 76-87). Users are advised to exercise caution when opening the dataset. Certain software, including Microsoft Excel and Python, may drop the leading zeros in USGS gage IDs and HUC4 IDs if these columns are not explicitly imported as strings.
Errata
In version 1, drainage area was mistakenly reported in cubic meters but labeled in cubic kilometers. This error has been corrected in version 2.
A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.
When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.
python
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
- where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.
- Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data was obtained from the Maricopa County Assessor under the search "Fast Food". The query has approximately 1342 results, with only 1000 returned due MCA Data Policies.
Due to some Subdivision Name values posessing unescaped commas that interfered with Pandas' ability to properly align the columns, some manual cleaning in Libre Office was performed by me.
Aside from a handful of Null values, the data is fairly clean and requires little from Pandas.
Here are the sums and percentage of NULLS in the dataframe.
Interestingly, there are 17
NULLS that do not have any physical addresses. This amounts to 1.7%
of values for the Address
, City
, and Zip
, and are all corresponding rows for those missing values.
I have looked into a couple of these on the Maricopa County Assessor's GIS Portal, and they do not appear to have any assigned physical addresses. This is a good avenue of exploration for EDA. Possibly an error that could be corrected, or some obscure legal reason, but interesting nonetheless.
Additionally, there are 391
NULLS in Subdivision Name
accounting for 39.1%
. This is a feature that I am interested in exploring to determine if there are any predominant groups. It could also generate a list of Entities that can be searched later to see if the dataset can be enriched beyond it's initial 1,000 record limit.
There are 348
NULLS in the MCR
column. This is the definition according to the MCA Glossary
MCR (MARICOPA COUNTY RECORDER NUMBER)
Often associated with recorded plat maps.
This seems to be an uninteresting nominal value, so I will drop this columns.
While Property Type
and Rental
have no NULLS, 100% of those values are Fast Food Restaurant
and N
(for No), and therefore offer no useful information, and will be dropped.
I will leave the S/T/R
column, although it also seems to be uninteresting nominal values, I am curious if there are predominent groups, and since it also has no NULLS, might be useful for further data enrichment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
**************** NTU Dataset ReadMe file *******************Please consider the latest version.Attached files contain our data collected inside Nanyang Technological University Campus for pedestrian intention prediction. The dataset is particularly designed to capture spontaneous vehicle influences on pedestrian crossing/not-crossing intention. We utilize this dataset in our paper "Context Model for Pedestrian Intention Prediction using Factored Latent-Dynamic Conditional Random Fields" submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence.The dataset consists of 35 crossing and 35 stopping* (not-crossing) scenarios. The image sequences are in 'Image_sequences' folder. 'stopping_instants.csv' and 'crossing_instants.csv' files provide the stopping and crossing instants respectively, utilized for labeling the data and providing ground-truth for evaluation. Camera1 and Camera2 images are synchronized. Two cameras were used to capture the whole scene of interest.We provide pedestrian and vehicle bounding boxes obtained from [1]. The occlusions and mis-detections are linearly interpolated. All necessary detections are stored in 'Object_detector_pedestrians_vehicles' folder. Each column within the csv files ('car_bndbox_..') corresponds to a unique tracked car within each image sequence. Each of the pedestrian csv files ('ped_bndbox_..') contains only one column, as we consider each pedestrian in the scene separately. Additional details:* [xmin xmax ymin ymax] = left right top down* Dataset frequency: 15 fps.* Camera parameters (in pixels): f = 1135, principal point = (960, 540).Additionally, we provide semantic segmentation output [2] and our depth parameters. As the data were collected in two phases, there are two files in each folder, highlighting the sequences in each phase.Crossing sequences 1-28 and stopping sequences 1-24 were collected in Phase 1, while crossing sequences 29-35 and stopping sequences 25-35 were collected in Phase 2.We obtained the optical flow from [3]. Our model (FLDCRF and LSTM) codes are available in 'Models' folder.If you use our dataset in your research, please cite our paper:"S. Neogi, M. Hoy, W. Chaoqun, J. Dauwels, 'Context Based Pedestrian Intention Prediction Using Factored Latent Dynamic Conditional Random Fields', IEEE SSCI-2017."Please email us if you have any questions:1. Satyajit Neogi, PhD Student, Nanyang Technological University @ satyajit001@e.ntu.edu.sg 2. Justin Dauwels, Associate Professor, Nanyang Technological University @ jdauwels@ntu.edu.sgOur other group members include:3. Dr. Michael Hoy, @ mch.hoy@gmail.com4. Dr. Kang Dang, @ kangdang@gmail.com5. Ms. Lakshmi Prasanna Kachireddy, 6. Mr. Mok Bo Chuan Lance,7. Dr. Hang Yu, @ fhlyhv@gmail.comReferences:1. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", NIPS 2015.2. A. Kendall, V. Badrinarayanan, R. Cipolla,
Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding", BMVC 2017.3. C. Liu. ``Beyond Pixels: Exploring New Representations and Applications for Motion Analysis". Doctoral Thesis. Massachusetts Institute of Technology. May 2009.* Please note, we had to remove sequence Stopping-33 for privacy reasons.
Not seeing a result you expected?
Learn how you can add new datasets to our index.