100+ datasets found

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Prevention Trials Networkhttp://www.hptn.org/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Meta-Analysis and modeling of vegetated filter removal of sediment using...
catalog.data.gov
Updated Nov 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Meta-Analysis and modeling of vegetated filter removal of sediment using global dataset [Dataset]. https://catalog.data.gov/dataset/meta-analysis-and-modeling-of-vegetated-filter-removal-of-sediment-using-global-dataset
Explore at:
Dataset updated
Nov 22, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data on vegetated filter strips, sediment loading into and out of riparian corridors/buffers (VFS), removal efficiency of sediment, meta-analysis of removal efficiencies, dimensional analysis of predictor variables, and regression modeling of VFS removal efficiencies. This dataset is associated with the following publication: Ramesh, R., L. Kalin, M. Hantush, and A. Chaudhary. A secondary assessment of sediment trapping effectiveness by vegetated buffers. ECOLOGICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 159: 106094, (2021).

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Supplementary data and code 1 for "Significant shifts in latitudinal optima...
figshare.com
zip
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paulo Mateus Martins (2024). Supplementary data and code 1 for "Significant shifts in latitudinal optima of North American birds" (PNAS) [Dataset]. http://doi.org/10.6084/m9.figshare.24881544.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24881544.v1
Dataset updated
Apr 1, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Paulo Mateus Martins
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Significant shifts in latitudinal optima of North American birds (PNAS)Paulo Mateus Martins, Marti J. Anderson, Winston L. Sweatman, and Andrew J. PunnettOverviewThis file contains the raw 2022 release of the North American breeding bird survey dataset (Ziolkowski Jr et al. 2022), as well as the filtered version used in our paper and the code that generated it. We also included code for using BirdLife's species distribution shapefiles to classify species as eastern or western based on their occurrence in the BBS dataset and to calculated the percentage of their range covered by the BBS sampling extent. Note that this code requires species distribution shapefiles, which are not provided but can be obtained directly from https://datazone.birdlife.org/species/requestdis.ReferenceD. J. Ziolkowski Jr., M. Lutmerding, V. I. Aponte, M. A. R. Hudson, North American breeding bird survey dataset 1966–2021: U.S. Geological Survey data release (2022), https://doi.org/10.5066/P97WAZE5Detailed file descriptioninfo_birds_names_shp: A data frame that links BBS species names (column Species) to shapefiles (column Species_BL). See the code2_sampling coverage.dat_raw_BBS_data_v2022: This R environment contains the raw BBS data from the 2022 release (https://www.sciencebase.gov/catalog/item/625f151ed34e85fa62b7f926). This object contains data frames created with the files "Routes.zip" (route information), "SpeciesList.txt" (bird taxonomy), and "50-StopData.zip" (actual counts per route and year). This object is the starting point for creating the dataset used in the paper, which was filtered to remove taxonomic uncertainties, as demonstrated in the "code1_build_long_wide_datasets" R script.code1_build_long_wide_datasets: This code filters the original dataset (dat_raw_BBS_data_v2022) to remove taxonomic uncertainties, assigns routes as either eastern or western based on regionalization using the dynamically constrained agglomerative clustering and partitioning method (see the Methods section of the paper), and generates the full long and wide versions of the dataset used in the analyses (dat2_filtered_data_long, dat3_filtered_data_wide).dat2_filtered_data_long: The filtered raw dataset in long form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.dat3_filtered_data_wide: The filtered raw dataset in its widest form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.code2_sampling coverage: This code determines how much of a bird distribution is covered by the BBS sampling extent (refer to Dataset S1). It is important to note that this script requires bird species distribution shapefiles from BirdLife International, which we are not permitted to share. The shapefiles can be requested directly at https://datazone.birdlife.org/species/requestdis
d
Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
Explore at:
Dataset updated
Nov 13, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Contiguous United States, United States
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Machine learning pipeline to train toxicity prediction model of...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3529162
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Ewald; Jan Ewald
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

01_DATA # preprocessing and filtering of raw activity data from ChEMBL
- Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
- filt_stats.R # Filtering and preparation of raw data
- Filtered # output data sets from filt_stats.R
- toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
- datastore # files with all compounds and their calculated molecular descriptors based on SMILES
- scripts
- calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
- chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
- datastore # output files with statistics calculated by make_Z.R
- scripts
-make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
- datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
- scripts
-calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

- datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
- scripts
- data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
- Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
- Rforest.R # based on analysis of Rforest_CV.R learning of final models

rregrs_output
# early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2
Cyclistic
kaggle.com
zip
Updated May 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salam Ibrahim (2022). Cyclistic [Dataset]. https://www.kaggle.com/datasets/salamibrahim/cyclistic
Explore at:
zip(209748131 bytes)Available download formats
Dataset updated
May 12, 2022
Authors
Salam Ibrahim
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
**Introduction ** This case study will be based on Cyclistic, a bike sharing company in Chicago. I will perform tasks of a junior data analyst to answer business questions. I will do this by following a process that includes the following phases: ask, prepare, process, analyze, share and act.

Background Cyclistic is a bike sharing company that operates 5828 bikes within 692 docking stations. The company has been around since 2016 and separates itself from the competition due to the fact that they offer a variety of bike services including assistive options. Lily Moreno is the director of the marketing team and will be the person to receive these insights from this analysis.

Case Study and business task Lily Morenos perspective on how to generate more income by marketing Cyclistics services correctly includes converting casual riders (one day passes and/or pay per ride customers) into annual riders with a membership. Annual riders are more profitable than casual riders according to the finance analysts. She would rather see a campaign targeting casual riders into annual riders, instead of launching campaigns targeting new costumers. So her strategy as the manager of the marketing team is simply to maximize the amount of annual riders by converting casual riders.

In order to make a data driven decision, Moreno needs the following insights: - A better understanding of how casual riders and annual riders differ - Why would a casual rider become an annual one - How digital media can affect the marketing tactics

Moreno has directed me to the first question - how do casual riders and annual riders differ?

Stakeholders Lily Moreno, manager of the marketing team Cyclistic Marketing team Executive team

Data sources and organization Data used in this report is made available and is licensed by Motivate International Inc. Personal data is hidden to protect personal information. Data used is from the past 12 months (01/04/2021 – 31/03/2022) of bike share dataset.

By merging all 12 monthly bike share data provided, an extensive amount of data with 5,400,000 rows were returned and included in this analysis.

Data security and limitations: Personal information is secured and hidden to prevent unlawful use. Original files are backed up in folders and subfolders.

Tools and documentation of cleaning process The tools used for data verification and data cleaning are Microsoft Excel and R programming. The original files made accessible by Motivate International Inc. are backed up in their original format and in separate files.

Microsoft Excel is used to generally look through the dataset and get a overview of the content. I performed simple checks of the data by filtering, sorting, formatting and standardizing the data to make it easily mergeable.. In Excel, I also changed data type to have the right format, removed unnecessary data if its incomplete or incorrect, created new columns to subtract and reformat existing columns and deleting empty cells. These tasks are easily done in spreadsheets and provides an initial cleaning process of the data.

R will be used to perform queries of bigger datasets such as this one. R will also be used to create visualizations to answer the question at hand.

Limitations Microsoft Excel has a limitation of 1,048,576 rows while the data of the 12 months combined are over 5,500,000 rows. When combining the 12 months of data into one table/sheet, Excel is no longer efficient and I switched over to R programming.
n
Data from: A systematic evaluation of normalization methods and probe...
data.niaid.nih.gov
dataone.org
+2more
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 30, 2023
Dataset provided by
Hospital for Sick Children
Universidade de São Paulo
University of Toronto
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
f
Junco-Cowbird Dataset/R Code
datasetcatalog.nlm.nih.gov
figshare.com
Updated Sep 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hauber, Mark E.; Turner, Abbigail; Reichard, Dustin G (2022). Junco-Cowbird Dataset/R Code [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000304934
Explore at:
Dataset updated
Sep 16, 2022
Authors
Hauber, Mark E.; Turner, Abbigail; Reichard, Dustin G
Description
This dataset encompasses data on junco nest survival, cowbird parasitism, and the removal of cowbird eggs from junco nests. Also included is the R code used in our analysis.
Additional file 1 of Correction: Holomics ‑ a user‑friendly R shiny...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Aug 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Munk; Daria Ilina; Lisa Ziemba; Günter Brader; Eva M. Molin (2024). Additional file 1 of Correction: Holomics ‑ a user‑friendly R shiny application for multi‑omics data integration and analysis [Dataset]. http://doi.org/10.6084/m9.figshare.26758517.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26758517.v1
Dataset updated
Aug 15, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Katharina Munk; Daria Ilina; Lisa Ziemba; Günter Brader; Eva M. Molin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. The additional file (Additional file 1.zip) is a compressed folder containing four.csv files. Table S1: Targeted metabolite data, Table S2: Microbiomics ASV count table resulting from 16S amplicon sequencing, Table S3: Microbiomics ASV count table resulting from ITS amplicon sequencing, Table S4: Transcriptomics read count table (transposed format), and Table S5: Labels and class information including color code of the analyzed samples. Besides of being the data source for the present case study, these data tables can be used as test datasets after removal of the table header (first line). We highly recommend opening the files in a text editor of your choice and remove the headers there. When doing this step in Excel an error may occur.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Z
Occurrence Record Dataset from "Depth Matters for Marine Biodiversity"
datasetcatalog.nlm.nih.gov
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahbek, Carsten; Owens, Hannah (2024). Occurrence Record Dataset from "Depth Matters for Marine Biodiversity" [Dataset]. http://doi.org/10.5281/zenodo.13318673
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13318673
Dataset updated
Aug 14, 2024
Authors
Rahbek, Carsten; Owens, Hannah
Description
This is the final occurrence record dataset produced for the manuscript "Depth Matters for Marine Biodiversity". Detailed methods for the creation of the dataset, below, have been excerpted from Appendix I: Extended Methods. Detailed citations for the occurrence datasets from which these data were derived can also be foud in Appedix I of the manuscript. We assembled a list of all recognized species of fishes from the orders Scombiformes (sensu Betancur-R et al., 2017), Gadiformes, and Beloniformes by accessing FishBase (Boettiger et al., 2012; Froese & Pauly, 2017) and the Ocean Biodiversity Information System (OBIS; OBIS, 2022; Provoost & Bosch, 2019) through queries in R (R Core Team, 2021). Species were considered Atlantic if their FishBase distribution or occurrence records on OBIS included any area within the Atlantic or Mediterranean major fishing regions as defined by the Food and Agriculture Organization of the United Nations (FAO Regions 21, 27, 31, 34, 37, 41, 47, and 48; FAO, 2020) The database query script can be found on the project code repository (https://github.com/hannahlowens/3DFishRichness/blob/main/1_OccurrenceSearch.R). We then curated the list of names to resolve discrepancies in taxonomy and known distributions through comparison with the Eschmeyer Catalog of Fishes (Eschmeyer & Fricke, 2015), accessed in September of 2020, as our ultimate taxonomic authority. The resulting list of species was then mapped onto the Global Biodiversity Information Facility’s backbone taxonomy (Chamberlain et al., 2021; GBIF, 2020a) to ensure taxonomic concurrence across databases (Appendix I Table 1). The final taxonomic list was used to download occurrence records from OBIS (OBIS, 2022) and GBIF (GBIF, 2020b) in R through robis and occCite (Chamberlain et al., 2020; Provoost & Bosch, 2019; Owens et al., 2021). Once the resulting data were mapped and curated to remove records with putatively spurious coordinates, under-sampled regions and species were augmented with data from publicly available digital museum collection databases not served through OBIS or GBIF, as well as a literature search. For each species, duplicate points were removed from two- and three-dimensional species occurrence datasets separately, and inaccurate depth records were removed from 3D datasets. Inaccuracy was determined based on extreme statistical outliers (values greater than 2 or less than -2 when occurrence depths were centered and scaled), depth ranges that exceeded bathymetry at occurrence coordinates, and occurrence far outside known depth ranges compared to information from FishBase, Eschmeyer’s Catalog of Fishes, and congeneric depth ranges in the dataset. Finally, for datasets with more than 20 points remaining after cleaning, occurrence data were downsampled to the resolution of the environmental data; that is, to 1 point per 1 degree grid cell in the 2D dataset, and to one point per depth slice per 1 degree grid cell in the 3D dataset. Counts of raw and cleaned records for each species can be found in Appendix 1 Table 1. References: Betancur-R, R., Wiley, E. O., Arratia, G., Acero, A., Bailly, N., Miya, M., Lecointre, G., & Ortí, G. (2017). Phylogenetic classification of bony fishes. BMC Evolutionary Biology, 17(1), 162. https://doi.org/10.1186/s12862-017-0958-3 Boettiger, C., Lang, D. T., & Wainwright, P. C. (2012). rfishbase: exploring, manipulating and visualizing FishBase data from R. Journal of Fish Biology, 81(6), 2030–2039. https://doi.org/10.1111/j.1095-8649.2012.03464.x Chamberlain, S., Barve, V., McGlinn, D., Oldoni, D., Desmet, P., Geffert, L., & Ram, K. (2021). rgbif: Interface to the Global Biodiversity Information Facility API. https://CRAN.R-project.org/package=rgbif Eschmeyer, & Fricke, W. N. &. (2015). Taxonomic checklist of fish species listed in the CITES Appendices and EC Regulation 338/97 (Elasmobranchii, Actinopteri, Coelacanthi, and Dipneusti, except the genus Hippocampus). Catalog of Fishes, Electronic Version. Accessed September, 2020. https://www.calacademy.org/scientists/projects/eschmeyers-catalog-of-fishes FAO. (2020). FAO Major Fishing Areas. United Nations Fisheries and Aquaculture Division. https://www.fao.org/fishery/en/collection/area Froese, R., & Pauly, D. (2017). FishBase. Accessed September, 2022. www.fishbase.org GBIF.org. (2020a). GBIF Backbone Taxonomy. Accessed September, 2020. GBIF.org GBIF.org. (2020b). GBIF Occurrence Download. Accessed November, 2020. https://doi.org/10.15468 OBIS. (2020). Ocean Biodiversity Information System. Intergovernmental Oceanographic Commission of UNESCO. Accessed November, 2020. www.obis.org Owens, H. L., Merow, C., Maitner, B. S., Kass, J. M., Barve, V., & Guralnick, R. P. (2021). occCite: Tools for querying and managing large biodiversity occurrence datasets. Ecography, 44(8), 1228–1235. https://doi.org/10.1111/ecog.05618 Provoost, P., & Bosch, S. (2019). robis: R Client to access data from the OBIS API. https://cran.r-project.org/package=robis R Core Team. (2021). R: A Language and Environment for Statistical Computing. https://www.R-project.org/
R
R K.v2i.coco Remove_hashmarkerli Dataset
universe.roboflow.com
zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
afb (2025). R K.v2i.coco Remove_hashmarkerli Dataset [Dataset]. https://universe.roboflow.com/afb-nnhqq/r-k.v2i.coco-remove_hashmarkerli-bjhdy/model/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 26, 2025
Dataset authored and provided by
afb
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Variables measured
YARD12
Description
R K.v2i.coco Remove_hashmarkerli

## Overview R K.v2i.coco Remove_hashmarkerli is a dataset for computer vision tasks - it contains YARD12 annotations for 263 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [BY-NC-SA 4.0 license](https://creativecommons.org/licenses/BY-NC-SA 4.0).
Z
Spatio-temporal dynamics of attacks around deaths of wolves: A statistical...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grente, Oksana; Opitz, Thomas; Duchamp, Christophe; Drouet-Hoguet, Nolwenn; Chamaillé-Jammes, Simon; Gimenez, Olivier (2025). Spatio-temporal dynamics of attacks around deaths of wolves: A statistical assessment of lethal control efficiency in France [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_12772867
Explore at:
Dataset updated
Feb 19, 2025
Dataset provided by
Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement
Centre National de la Recherche Scientifique
French Biodiversity Agency
Authors
Grente, Oksana; Opitz, Thomas; Duchamp, Christophe; Drouet-Hoguet, Nolwenn; Chamaillé-Jammes, Simon; Gimenez, Olivier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
France
Description
This repository contains the supplementary materials (Supplementary_figures.docx, Supplementary_tables.docx) of the manuscript: "Spatio-temporal dynamics of attacks around deaths of wolves: A statistical assessment of lethal control efficiency in France". This repository also provides the R codes and datasets necessary to run the analyses described in the manuscript.

The R datasets with suffix "_a" have anonymous spatial coordinates to respect confidentiality. Therefore, the preliminary preparation of the data is not provided in the public codes. These datasets, all geolocated and necessary to the analyses, are:

Attack_sf_a.RData: 19,302 analyzed wolf attacks on sheep

ID: unique ID of the attack

DATE: date of the attack

PASTURE: the related pasture ID from "Pasture_sf_a" where the attack is located

STATUS: column resulting from the preparation and the attribution of attacks to pastures (part 2.2.4 of the manuscript); not shown here to respect confidentiality

Pasture_sf_a.RData: 4987 analyzed pastures grazed by sheep

ID: unique ID of the pasture

CODE: Official code in the pastoral census

FLOCK_SIZE: maximum annual number of sheep grazing in the pasture

USED_MONTHS: months for which the pasture is grazed by sheep

Removal_sf_a.RData: 232 analyzed single wolf removal or groups of wolf removals

ID: unique ID of the removal

OVERLAP: are they single removal ("non-interacting" in the manuscript => "NO" here), or not ("interacting" in the manuscrit, here "SIMULTANEOUS" for removals occurring during the same operation or "NON-SIMULTANEOUS" if not).

DATE_MIN: date of the single removal or date of the first removal of a group

DATE_MAX: date of the single removal or date of the last removal of a group

CLASS: administrative type of the removal according to definitions from 2.1 part of the manuscript

SEX: sex or sexes of the removed wolves if known

AGE: class age of the removed wolves if known

BREEDER: breeding status of the removed female wolves, "Yes" for female breeder, "No" for female non-breeder. Males are "No" by default, when necropsied; dead individuals with NA were not found.

SEASON: season of the removal, as defined in part 2.3.4 of the manuscript

MASSIF: mountain range attributed to the removal, as defined in part 2.3.4 of the manuscript

Area_to_exclude_sf_a.RData: one row for each mountain range, corresponding to the area where removal controls of the mountain range could not be sampled, as defined in part 2.3.6 of the manuscript

These datasets were used to run the following analyses codes:

Code 1 : The file Kernel_wolf_culling_attacks_p.R contains the before-after analyses.

We start by delimiting the spatio-temporal buffer for each row of the "Removal_sf_a.RData" dataset.

We identify the attacks from "Attack_sf_a.RData" within each buffer, giving the data frame "Buffer_df" (one row per attack)

We select the pastures from "Pasture_sf_a.RData" within each buffer, giving the data frame "Buffer_sf" (one row per removal)

We calculate the spatial correction

We spatially slice each buffer into 200 rings, giving the data frame "Ring_sf" (one row per ring)

We add the total pastoral area of the ring of the attack ("SPATIAL_WEIGHT"), for each attack of each buffer, within Buffer_df ("Buffer_df.RData")

We calculate the pastoral correction

We create the pastoral matrix for each removal, giving a matrix of 200 rows (one for each ring) and 180 columns (one for each day, 90 days before the removal date and 90 day after the removal date), with the total pastoral area in use by sheep for each corresponding cell of the matrix (one element per removal, "Pastoral_matrix_lt.RData")

We simulate, for each removal, the random distribution of the attacks from "Buffer_df.RData" according to "Pastoral_matrix_lt.RData". The process is done 100 times (one element per simulation, "Buffer_simulation_lt.RData").

We estimate the attack intensities

We classified the removals into 20 subsets, according to part 2.3.4 of the manuscript ("Variables_lt.RData") (one element per subset)

We perform, for each subset, the kernel estimations with the observed attacks ("Kernel_lt.RData"), with the simulated attacks ("Kernel_simulation_lt.RData") and we correct the first kernel computations with the second ("Kernel_controlled_lt.RData") (one element per subset).

We calculate the trend of attack intensities, for each subset, that compares the total attack intensity before and after the removals (part 2.3.5 of the manuscript), giving "Trends_intensities_df.RData". (one row per subset)

We calculate the trend of attack intensities, for each subset, along the spatial axis, three times, one for each time analysis scale. This gives "Shift_df" (one row per ring and per time analysis scale.

Code 2 : The file Control_removals_p.R contains the control-impact analyses.

It starts with the simulation of 100 removal control sets ("Control_sf_lt_a.RData") from the real set of removals ("Removal_sf_a.RData"), that is done with the function "Control_fn" (l. 92).

The rest of the analyses follows the same process as in the first code "Kernel_wolf_culling_attacks_p.R", in order to apply the before-after analyses to each control set. All objects have the same structure as before, except that they are now a list, with one resulting element per control set. These objects have "control" in their names (not to be confused with "controlled" which refers to the pastoral correction already applied in the first code).

The code is also applied again, from l. 92 to l. 433, this time for the real set of removals (l. 121) - with "Simulated = FALSE" (l. 119). We could not simply use the results from the first code because the set of removals is restricted to removals attributed to mountain ranges only. There are 2 resulting objects: "Kernel_real_lt.RData" (observed real trends) and "Kernel_controlled_real_lt.RData" (real trends corrected for pastoral use).

The part of the code from line 439 to 524 relates to the calculations of the trends (for the real set and the control sets), as in the first code, giving "Trends_intensities_real_df.RData" and "Trends_intensities_control_lt.RData".

The part of the code from line 530 to 588 relates to the calculation of the 95% confidence intervals and the means of the intensity trends for each subset based on the results of the 100 control sets (Trends_intensities_mean_control_df.RData, Trends_intensities_CImin_control_df.RData and Trends_intensities_CImax_control_df.RData). This will be used to test the significativity of the real trends. This comparison is done right after, l. 595-627, and gives the data frame "Trends_comparison_df.RData".

Code 3 : The file Figures.R produces part of the figures from the manuscript:

"Dataset map": figure 1

"Buffer": figure 2 (then pasted in powerpoint)

"Kernel construction": figure 5 (then pasted in powerpoint)

"Trend distributions": figure 7

"Kernels": part of figures 10 and S2

"Attack shifts": figure 9 and S1

"Significant": figure 8
u
Refined 3D Hepatic Vessel Skeleton Dataset from the Medical Segmentation...
investigacion.usc.gal
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nodar-Corral, Lois; Fdez-Gonzalez, Marcos; Fdez-Vidal, Xose R.; Otero Freiría, Roque; Velo Outumuro, Noelia; Comesaña Figueroa, Enrique; Nodar-Corral, Lois; Fdez-Gonzalez, Marcos; Fdez-Vidal, Xose R.; Otero Freiría, Roque; Velo Outumuro, Noelia; Comesaña Figueroa, Enrique (2025). Refined 3D Hepatic Vessel Skeleton Dataset from the Medical Segmentation Decathlon (Task08_HepaticVessel) [Dataset]. https://investigacion.usc.gal/documentos/688b602417bb6239d2d48e9f
Explore at:
Dataset updated
2025
Authors
Nodar-Corral, Lois; Fdez-Gonzalez, Marcos; Fdez-Vidal, Xose R.; Otero Freiría, Roque; Velo Outumuro, Noelia; Comesaña Figueroa, Enrique; Nodar-Corral, Lois; Fdez-Gonzalez, Marcos; Fdez-Vidal, Xose R.; Otero Freiría, Roque; Velo Outumuro, Noelia; Comesaña Figueroa, Enrique
Description
Hepatic Vessel Skeletonization Dataset (Task08_HepaticVessel)

Overview

This dataset is derived from the hepatic vessel task of the Medical Segmentation Decathlon (MSD) Task 8. It comprises manually revised vessel skeletons and modified vessel segmentations that are initially generated via automatic 3D thinning and subsequently refined through manual revision. Both the skeletons and the labels have been refined to provide a high-quality ground truth for skeletonization algorithm evaluation.

Dataset Description

Modifications from Original MSD Task 8

Label modifications: Using 3D Slicer, vessel segmentations were refined to remove vessels not located within the liver parenchyma, large segmentations of the inferior vena cava and aorta that were inconsistent across the dataset, and anatomical structures not relevant for hepatic vessel skeletonization analysis.

Skeleton revision: The manual revision process addressed broken and missing branches, incorrect or ambiguous vessel representations, and redundant skeleton points generated by the automatic thinning algorithm.

Dataset Characteristics

The dataset covers various anatomical aspects including vessel representation up to the third level of ramification, anatomically diverse hepatic vessel structures, and consistent spatial resolution and coordinate systems.

Level of ramification is defined as the level of branching in a vessel tree where branches extend at least three levels deep from the main trunk following anatomical hierarchy.

File Structure

Task08_HepaticVessel/├── 0_README.md # This documentation file├── labelsTr/ # Modified vessel segmentations (NIfTI format)│ ├── hepaticvessel_001_mod.nii.gz│ ├── hepaticvessel_002_mod.nii.gz│ ├── hepaticvessel_004_mod.nii.gz│ ├── hepaticvessel_005_mod.nii.gz│ ├── hepaticvessel_007_mod.nii.gz│ ├── hepaticvessel_008_mod.nii.gz│ ├── hepaticvessel_010_mod.nii.gz│ ├── hepaticvessel_011_mod.nii.gz│ ├── hepaticvessel_013_mod.nii.gz│ ├── hepaticvessel_016_mod.nii.gz│ └── hepaticvessel_018_mod.nii.gz└── skeletons/ # Manually revised skeletons (JSON format) ├── hepaticvessel_001_LNC.json ├── hepaticvessel_001_NVO.json ├── hepaticvessel_001_ROF.json ├── hepaticvessel_002_LNC.json ├── hepaticvessel_002_NVO.json ├── hepaticvessel_004_LNC.json ├── hepaticvessel_004_NVO.json ├── hepaticvessel_005_LNC.json ├── hepaticvessel_005_NVO.json ├── hepaticvessel_007_LNC.json ├── hepaticvessel_007_NVO.json ├── hepaticvessel_008_NVO.json ├── hepaticvessel_010_NVO.json ├── hepaticvessel_011_NVO.json ├── hepaticvessel_013_NVO.json ├── hepaticvessel_016_NVO.json └── hepaticvessel_018_LNC.json

Naming Convention

Vessel Segmentations (labelsTr/)

Pattern: hepaticvessel_[3-digit-number]_mod.nii.gzExample: hepaticvessel_008_mod.nii.gz

Skeletons (skeletons/)

Pattern: hepaticvessel_[3-digit-number]_[ANNOTATOR_INITIALS].jsonExample: hepaticvessel_008_LNC.json

Annotator Codes

LNC: Lois Nodar Corral**NVO**: Noelia Velo Outumuro**ROF**: Roque Otero Freiría

Data Format

Vessel Segmentations

Format: NIfTI (.nii.gz)Binary masks (0 = background, 1 = vessel)Coordinate system: RAS (Right-Anterior-Superior)Spatial resolution: Variable (inherited from original MSD dataset)

Skeletons

Format: JSON arrays containing 3D coordinatesCoordinate system: Voxel coordinates (matching corresponding segmentation)Structure:json[ [x1, y1, z1], [x2, y2, z2], [x3, y3, z3], ...]

Each coordinate triplet [x, y, z] represents a voxel position in the 3D volume where the skeleton passes through.

Dataset Statistics

Total cases: 11 vessel segmentations- Total skeletons: 17 manually revised skeletons- Multiple annotations: Some cases have multiple annotator versions for validation- Coverage: Cases 001, 002, 004, 005, 007, 008, 010, 011, 013, 016, 018

File Correspondence

| Case | Segmentation File | Available Skeletons ||------|-------------------|-------------------|| 001 | hepaticvessel_001_mod.nii.gz | LNC, NVO, ROF || 002 | hepaticvessel_002_mod.nii.gz | LNC, NVO || 004 | hepaticvessel_004_mod.nii.gz | LNC, NVO || 005 | hepaticvessel_005_mod.nii.gz | LNC, NVO || 007 | hepaticvessel_007_mod.nii.gz | LNC, NVO || 008 | hepaticvessel_008_mod.nii.gz | NVO || 010 | hepaticvessel_010_mod.nii.gz | NVO || 011 | hepaticvessel_011_mod.nii.gz | NVO || 013 | hepaticvessel_013_mod.nii.gz | NVO || 016 | hepaticvessel_016_mod.nii.gz | NVO || 018 | hepaticvessel_018_mod.nii.gz | LNC |

Usage Guidelines

Correspondence: Each skeleton file corresponds to a vessel segmentation with the same case number2. Coordinate system: Skeleton coordinates are in voxel space of the corresponding NIfTI volume3. Validation: Cases with multiple annotators can be used for inter-observer variability studies4. Citation: Please cite the original Medical Segmentation Decathlon when using this dataset

Software Tools

The manual revision was performed using custom Python tools for skeleton visualization and editing. The annotation software used for skeleton revision is available in a separate GitHub repository (link to be provided upon publication). This tool provides an interactive 3D visualization environment that allows for precise skeleton editing, branch correction, and quality validation.

The Python-based annotation tool, available at https://github.com/Removirt/skeleton-viewer features interactive 3D visualization of vessel segmentations and skeletons, point-by-point skeleton editing capabilities, branch connection and disconnection tools, real-time validation of topological correctness, and multi-platform compatibility (Windows, macOS, Linux).

Dataset Access

This dataset is publicly available through Zenodo. The complete dataset including all vessel segmentations, manually revised skeletons, and documentation can be downloaded from: https://doi.org/10.5281/zenodo.15729285

Citation

This DatasetNodar-Corral, L., Fdez-Gonzalez, M., Fdez-Vidal, X. R., Otero Freiría, R., Velo Outumuro, N., & Comesaña Figueroa, E. (2025). Refined 3D Hepatic Vessel Skeleton Dataset from the Medical Segmentation Decathlon (Task08_HepaticVessel) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15729285

Original Medical Segmentation DecathlonSimpson, A. L., Antonelli, M., Bakas, S., et al. (2019). A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063.

BibTeX Format```bibtex@dataset{hepatic_vessel_skeletons_2025, author = {Nodar-Corral, L., Fdez-Gonzalez, M., Fdez-Vidal, X. R., Otero Freiría, R., Velo Outumuro, N., & Comesaña Figueroa, E.}, title = {Refined 3D Hepatic Vessel Skeleton Dataset from the Medical Segmentation Decathlon (Task08_HepaticVessel)}, year = {2025}, publisher = {Zenodo}, doi = {10.5281/zenodo.15729285}, url =

@article{simpson2019large, title={A large annotated medical image dataset for the development and evaluation of segmentation algorithms}, author={Simpson, Amber L and Antonelli, Michela and Bakas, Spyridon and others}, journal={arXiv preprint arXiv:1902.09063}, year={2019}}```

Contact

For questions about this dataset or annotation methodology, please contact the first author on lois.nodar.corral@usc.es or loisnodar@gmail.com, or any of the other authors on their ORCID correspondence.
e
Simple download service (Atom) of the dataset: Generators linked to category...
data.europa.eu
unknown
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Simple download service (Atom) of the dataset: Generators linked to category T1 servitude in Seine-et-Marne [Dataset]. https://data.europa.eu/data/datasets/fr-120066022-srv-c38194ef-3a0a-4bc2-b133-903295ec4cc7
Explore at:
unknownAvailable download formats
Dataset updated
Apr 22, 2022
Description
These are easements concerning the riparian properties of the railways and established in areas defined by the Act of 15 July 1845 on the Police of Railways and by Article 6 of the Decree of 30 October 1935, as amended, creating visibility easements on public roads, namely: — prohibition on the construction of any construction, other than a fence wall, within a distance of two metres from a railway (art. 5 of the Law of 15 July 1845), — prohibition, without prior authorisation, of excavations in an area equal to the vertical height of a railway embankment of more than three metres, measured from the foot of the slope (art. 6 of the Law of 15 July 1845), — prohibition on establishing thatched blankets, straw and hay grindstones, and any other deposition of flammable materials, at a distance of less than 20 metres from a railway served by fire machines, measured from the foot of the slope (art. 7 of the Law of 15 July 1845), — prohibition on depositing stones or non-flammable objects without prior prefectural authorisation less than five metres from a railway (art. 8 of the Law of 15 July 1845), —Servitudes of visibility at the crossing of a public road and a railway (art. 6 of the Decree-Law of 30 October 1935 and Art. R. 114-6 of the Highway Code), easements defined by a clearance plan drawn up by the authority managing the highway and which may include, as the case may be, in accordance with Article 2 of the decree: •the obligation to remove fence walls or replace them with grids, to remove annoying plantations, to bring back and hold the terrain and any superstructure to a level that is most equal to the level is determined by the above-mentioned decommitment plan, •the absolute prohibition of building, placing fences, filling, planting and making any installations above the level set by the clearance plan Texts in force: Law of 15 July 1845 on the Railway Police — Title I: measures relating to the conservation of iron (Articles 1 to 11); Road Traffic Code (created by Act No. 89-413 and Decree No. 89-631) and in particular the following articles: —L. 123-6 and R.123-3 relating to alignment on national roads, — L. 114-1 to L. 114-6 relating to visibility easements at grade crossings, — R. 131-1 et seq. and R. 141-1 et seq. for the implementation of decommitment plans on departmental or municipal roads. The linear entities of this data relate to the use of certain resources and equipment, they affect land use. With the collection of easements from third parties, DDT-77 cannot guarantee the completeness and accuracy of the deferral of these easements on a large-scale map.
d
Data from: SGS-LTER Bouteloua gracilis Removal Experiment Vegetation Density...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). SGS-LTER Bouteloua gracilis Removal Experiment Vegetation Density Data on the Central Plains Experimental Range, Nunn, Colorado, USA 1997-2008 [Dataset]. https://catalog.data.gov/dataset/sgs-lter-bouteloua-gracilis-removal-experiment-vegetation-density-data-on-the-central-1997-e461a
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Area covered
Colorado, United States, Nunn
Description
This data package was produced by researchers working on the Shortgrass Steppe Long Term Ecological Research (SGS-LTER) Project, administered at Colorado State University. Long-term datasets and background information (proposals, reports, photographs, etc.) on the SGS-LTER project are contained in a comprehensive project collection within the Digital Collections of Colorado (http://digitool.library.colostate.edu/R/?func=collections&collection_id=3429). The data table and associated metadata document, which is generated in Ecological Metadata Language, may be available through other repositories serving the ecological research community and represent components of the larger SGS-LTER project collection. Six sites approximately 6 km apart were selected at the Central Plains Experimental Range in 1997. Within each site, there was a pair of adjacent ungrazed and moderately summer grazed (40-60% removal of annual aboveground production by cattle) locations. Grazed locations had been grazed from 1939 to present and ungrazed locations had been protected from 1991 to present by the establishment of exclosures. Within grazed and ungrazed locations, all tillers and root crowns of B. gracilis were removed from two treatment plots (3 m x 3 m) with all other vegetation undisturbed. Two control plots were established adjacent to the treatment plots. Plant density was measured annually by species in a fixed 1m x 1m quadrat in the center of treatment and control plots. For clonal species, an individual plant was defined as a group of tillers connected by a crown Coffin & Lauenroth 1988, Fair et al. 1999). Seedlings were counted as separate individuals. In the same quadrat, basal cover by species, bare soil, and litter were estimated annually using a point frame. A total of 40 points were read from four locations halfway between the center point and corners of the 1m x 1m quadrat. Density was measured from 1998 to 2005 and cover from 1997 to 2006. All measurements were taken in late June/early July. Resources in this dataset:Resource Title: Website Pointer to html file. File Name: Web Page, url: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sgs&identifier=702 Webpage with information and links to data files for download
r
Data from: Male responses to sperm competition risk when rivals vary in...
researchdata.edu.au
datadryad.org
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences (2019). Data from: Male responses to sperm competition risk when rivals vary in their number and familiarity [Dataset]. http://doi.org/10.5061/DRYAD.M097580
Explore at:
Unique identifier
https://doi.org/10.5061/DRYAD.M097580
Dataset updated
2019
Dataset provided by
DRYAD
The University of Western Australia
Authors
Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences
Description
Males of many species adjust their reproductive investment to the number of rivals present simultaneously. However, few studies have investigated whether males sum previous encounters with rivals, and the total level of competition has never been explicitly separated from social familiarity. Social familiarity can be an important component of kin recognition and has been suggested as a cue that males use to avoid harming females when competing with relatives. Previous work has succeeded in independently manipulating social familiarity and relatedness among rivals, but experimental manipulations of familiarity are confounded with manipulations of the total number of rivals that males encounter. Using the seed beetle Callosobruchus maculatus we manipulated three factors: familiarity among rival males, the number of rivals encountered simultaneously, and the total number of rivals encountered over a 48-hour period. Males produced smaller ejaculates when exposed to more rivals in total, regardless of the maximum number of rivals they encountered simultaneously. Males did not respond to familiarity. Our results demonstrate that males of this species can sum the number of rivals encountered over separate days, and therefore the confounding of familiarity with the total level of competition in previous studies should not be ignored.,Lymbery et al 2018 Full datasetContains all the data used in the statistical analyses for the associated manuscript. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Full Dataset.xlsxLymbery et al 2018 Reduced dataset 1Contains data used in the attached manuscript following the removal of three outliers for the purposes of data distribution, as described in the associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After 1st Round of Outlier Removal.xlsxLymbery et al 2018 Reduced dataset 2Contains the data used in the statistical analyses for the associated manuscript, after the removal of all outliers stated in the manuscript and associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After Final Outlier Removal.xlsxLymbery et al 2018 R ScriptContains all the R code used for statistical analysis in this manuscript, with annotations to aid interpretation.,
E
Woody habitat corridor data in South West England
catalogue.ceh.ac.uk
hosted-metadata.bgs.ac.uk
+1more
zip
Updated Mar 21, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
R.K. Broughton; F. Gerard; R. Haslam; A.S. Howard (2017). Woody habitat corridor data in South West England [Dataset]. http://doi.org/10.5285/4b5680d9-fdbc-40c0-96a1-4c022185303f
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5285/4b5680d9-fdbc-40c0-96a1-4c022185303f
Dataset updated
Mar 21, 2017
Dataset provided by
NERC EDS Environmental Information Data Centre
Authors
R.K. Broughton; F. Gerard; R. Haslam; A.S. Howard
License
https://eidc.ac.uk/licences/ogl/plainhttps://eidc.ac.uk/licences/ogl/plain
Time period covered
Jul 1, 2013 - Aug 31, 2013
Area covered

Description
This dataset contains polylines depicting non-woodland linear tree and shrub features in Cornwall and much of Devon, derived from lidar data collected by the Tellus South West project. Data from a lidar (light detection and ranging) survey of South West England was used with existing open source GIS datasets to map non-woodland linear features consisting of woody vegetation. The output dataset is the product of several steps of filtering and masking the lidar data using GIS landscape feature datasets available from the Tellus South West project (digital terrain model (DTM) and digital surface model (DSM)), the Ordnance Survey (OS VectorMap District and OpenMap Local, to remove buildings) and the Forestry Commission (Forestry Commission National Forest Inventory Great Britain 2015, to remove woodland parcels). The dataset was tiled as 20 x 20 km shapefiles, coded by the bottom-left 10 km hectad name. Ground-truthing suggests an accuracy of 73.2% for hedgerow height classes.
Data from: Automatic Definition of Robust Microbiome Sub-states in...
zenodo.org
txt, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson (2020). Data from: Automatic Definition of Robust Microbiome Sub-states in Longitudinal Data [Dataset]. http://doi.org/10.5281/zenodo.167376
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.167376
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Output files of the application of our R software (available at https://github.com/wilkinsonlab/robust-clustering-metagenomics) to different microbiome datasets already published.

Prefixes:

David2014_: original microbiome dataset published in [David et al.,2014] (http://genomebiology.com/2014/15/7/R89)

Ballou2016_: original microbiome dataset published in [Ballou et al.,2016] (http://journal.frontiersin.org/article/10.3389/fvets.2016.00002/full)

Gajer2012_: original microbiome dataset published in [Gajer et al.,2012] (http://stm.sciencemag.org/content/4/132/132ra52.long)

LaRosa2014_: original microbiome dataset published in [LaRosa et al.,2014] (http://www.pnas.org/cgi/doi/10.1073/pnas.1409497111)

Suffixes:

_All: all taxa

_Dominant: only 1% most abundant taxa

_NonDominant: remaining taxa after removing above dominant taxa

_GenusAll: taxa aggregated at genus level

_GenusDominant: taxa aggregated at genes level and then to select only 1% most abundant taxa

_GenusNonDominant: taxa aggregated at genus level and then to remove 1% most abundant taxa

Each folder contains 3 output files related to the same input dataset:
- data.normAndDist_definitiveClustering_XXX.RData: R data file with a) a phyloseq object (including OTU table, meta-data and cluster assigned to each sample); and b) a distance matrix object.
- definitiveClusteringResults_XXX.txt: text file with assessment measures of the selected clustering.
- sampleId-cluster_pairs_XXX.txt: text file. Two columns, comma separated file: sampleID,clusterID

Abstract of the associated paper:

The analysis of microbiome dynamics would allow us to elucidate patterns within microbial community evolution; however, microbiome state-transition dynamics have been scarcely studied. This is in part because a necessary first-step in such analyses has not been well-defined: how to deterministically describe a microbiome's "state". Clustering in states have been widely studied, although no standard has been concluded yet. We propose a generic, domain-independent and automatic procedure to determine a reliable set of microbiome sub-states within a specific dataset, and with respect to the conditions of the study. The robustness of sub-state identification is established by the combination of diverse techniques for stable cluster verification. We reuse four distinct longitudinal microbiome datasets to demonstrate the broad applicability of our method, analysing results with different taxa subset allowing to adjust it depending on the application goal, and showing that the methodology provides a set of robust sub-states to examine in downstream studies about dynamics in microbiome.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.w3r2280w0

Dataset updated

Dec 7, 2023

Dataset provided by

National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Prevention Trials Networkhttp://www.hptn.org/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
PEPFAR

Authors

Dylan Westfall; Mullins James

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

Clear search

Close search

Google apps

Main menu

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Meta-Analysis and modeling of vegetated filter removal of sediment using...

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Supplementary data and code 1 for "Significant shifts in latitudinal optima...

Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

Machine learning pipeline to train toxicity prediction model of...

Cyclistic

Data from: A systematic evaluation of normalization methods and probe...

Junco-Cowbird Dataset/R Code

Additional file 1 of Correction: Holomics ‑ a user‑friendly R shiny...

LScD (Leicester Scientific Dictionary)

Occurrence Record Dataset from "Depth Matters for Marine Biodiversity"

R K.v2i.coco Remove_hashmarkerli Dataset

R K.v2i.coco Remove_hashmarkerli

Spatio-temporal dynamics of attacks around deaths of wolves: A statistical...

Refined 3D Hepatic Vessel Skeleton Dataset from the Medical Segmentation...

Hepatic Vessel Skeletonization Dataset (Task08_HepaticVessel)

Overview

Dataset Description

Modifications from Original MSD Task 8

Dataset Characteristics

File Structure

Naming Convention

Vessel Segmentations (labelsTr/)

Skeletons (skeletons/)

Annotator Codes

Data Format

Vessel Segmentations

Skeletons

Dataset Statistics

File Correspondence

Usage Guidelines

Software Tools

Dataset Access

Citation

Original Medical Segmentation DecathlonSimpson, A. L., Antonelli, M., Bakas, S., et al. (2019). A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063.

Contact

Simple download service (Atom) of the dataset: Generators linked to category...

Data from: SGS-LTER Bouteloua gracilis Removal Experiment Vegetation Density...

Data from: Male responses to sperm competition risk when rivals vary in...

Woody habitat corridor data in South West England

Data from: Automatic Definition of Robust Microbiome Sub-states in...

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispeciesSee More Versions

Vessel Segmentations (`labelsTr/`)

Skeletons (`skeletons/`)

Original Medical Segmentation Decathlon`Simpson, A. L., Antonelli, M., Bakas, S., et al. (2019). A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063.`

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies