Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
This dataset tracks the updates made on the dataset "MeSH 2023 Update - Delete Report" as a repository for previous versions of the data and metadata.
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
DO NOT DELETE OR MODIFY THIS ITEM. This item is managed by the ArcGIS Hub application. To make changes to this site, please visit https://hub.arcgis.com/admin/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
See below for details of the files included below.
delly_vanc.vcf.gz # Raw output of Delly
b.vanc.fully.filtered.100k.plus.recode.vcf.gz # output of freebayes which was filtered using VCFtools v0.1.13 (Danecek et al. 2011) with the following flags: --remove-indels --min-alleles 2 --max-alleles 2 --minQ 20 --minDP 4 --max-missing 0.75
b.vanc.fully.filtered.100k.plus.recode.maf05.recode.ANN.vcf.gz #Fully filtered variant file (see manuscript for details) with annotation information
b.vanc.fully.filtered.100k.plus.recode.maf05.recode.impute.vcf.gz #Fully filtered variant file (see manuscript for details) after imputation with beagle
Trim_N_QC.sh #Trim raw sequencing data and run fastQC to evaluate trimmed data
BWA_PICARD_vanc1.sh #Example of script used to align sequence data to the reference genome using BWA. Also, uses Picard tools to sort, deduplicate and index bam files
P_call_test-2-vanc.sh #First part of pipeline for calling SNPS with freebayes (calls freebayes-parallel-part1_vanc.sh)
freebayes-parallel-part1_vanc.sh #see above
Filter_vanc.sh #Create list of SV's to filter from DELLY output
filter_delly.sh #filter based on generated list of SV's
delly_vanc.sh #call SV's using DELLY
bcf2vcf.sh # convert bcf from DELLY to vcf format
freebayes-parallel-part2.sh #Second part of freebayes pipeline
merge_vanc_vars.sh #Second part of freebayes pipeline (calls freebayes-parallel-part2.sh)
site_depth_vanc.sh #Gets site depth per SNP
remove_highdepth_vanc.sh #removes SNPs above depth threshold
hardy_vanc.sh #calculates HWE per SNP
remove_hwe_vanc.sh #removes SNPs based on HWE threshold
filter_vcf_size.sh #Removes SNPs on scaffolds less than 100Kb in size
filter_vcf_maf05.sh #filters SNPs based on 5% MAF filter
beagle.sh #imputes using beagle
LEA_con.R #converts vcf file into LFMM and geno format
Snpeff_ANN.sh # annotate vcf file using SNPeff
plink_for_sambaR.sh # convert vcf file into format ready for use in sambaR
LD_test.sh #example of script used to calculate LD per scaffold
vcf_stats.sh #Gets various stats from final filtered vcf
get_pi_diversity.sh #gets per population nucleotide diversity
sambaR.R #Runs SambaR
lfmm2_analysis.R #Code for running analysis on output of LFMM2 and generating graphs
Max_ent_map.R #Generates maxent map
RDA_script.R #Code for RDA analysis of structural variants
snprelate_script.R #runs SNPrelate as well as makes graphs of Fst and pi along scaffolds of interest
repeat_correctedfst.R #Analysis for correlation between repeat density and Fst
LD_script.R #analysis of linkage
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains open data and code to replicate the analysis in the manuscript "High-resolution mapping of wood burning appliance hotspots using Energy Performance Certificates: A case study of England and Wales".
To recreate the analysis on your local device, please carry out the following steps:
Clone the GitHub repository (available at: https://github.com/UCL-Wellcome-Trust-Air-Pollution/EPC_mapping_project_code) to your local device, or download the codebase from the 'Code.tar' folder and unzip in your project directory. Please ensure you use the directory with the R Project in it as your root directory.
Download the 'Data.tar' file and unzip the file in the R Project directory. The data should be in a folder called 'Data' in the root directory. All non-EPC data is provided under the UK Open Government License version 3.0. EPC data is provided under licence from DLUHC: https://epc.opendatacommunities.org/docs/copyright.
Download the main EPC data to your local device and unzip (see below for detailed instructions on how to do this). For Windows users, the 'Scripts' folder of the repository contains a .bat file which can be used to unzip the data. Note that this file requires the user to have installed 7Zip and added 7Zip to the system path. Otherwise, the .tar file can be unzipped manually.
Run the 'run.R' file in the 'Scripts' folder of the directory. You may need to change the 'path_data_epc_folders' variable to the path to the unzipped EPC data folders on your local device (see step 3). The full pipeline should now run.
Once you have run the pipeline for the first time, you should see a file called 'data_epc_raw.parquet' in the 'Data/raw/epc_data' folder. Once you have verified this is the case, you can safely delete the original unzipped EPC data folder, since the file is very large (>40Gb). If you run the pipeline again, you will be prompted that the raw EPC data .parquet file already exists, and you have the option to skip the merging of raw data files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers