17 datasets found

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Prevention Trials Networkhttp://www.hptn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Z
Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...
data.niaid.nih.gov
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Price, Juan José; Henningsen, Arne (2023). Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7882078
Explore at:
Dataset updated
Jun 17, 2023
Dataset provided by
University of Copenhagen
Universidad Adolfo Ibáñez
Authors
Price, Juan José; Henningsen, Arne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).

We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).

This replication package contains the following files and folders:

README This file

MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:

museum: Name of the museum.

type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).

munic: Municipality, in which the museum is located.

yr: Year of the observation.

units: Number of visit sites.

resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).

vis: Number of (physical) visitors.

aarc: Number of articles published (archeology).

ach: Number of articles published (cultural history).

aah: Number of articles published (art history).

anh: Number of articles published (natural history).

exh: Number of temporary exhibitions.

edu: Number of primary school classes on educational visits to the museum.

ev: Number of events other than exhibitions.

ftesc: Scientific labor (full-time equivalents).

ftensc: Non-scientific labor (full-time equivalents).

expProperty: Running and maintenance costs [1,000 DKK].

expCons: Conservation expenditure [1,000 DKK].

ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).

prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.

DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.

make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.

IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.

IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.

allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.

IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.

/tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.

/figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.
H
Area Resource File (ARF)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Area Resource File (ARF) [Dataset]. http://doi.org/10.7910/DVN/8NMSFV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8NMSFV
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the area resource file (arf) with r the arf is fun to say out loud. it's also a single county-level data table with about 6,000 variables, produced by the united states health services and resources administration (hrsa). the file contains health information and statistics for over 3,000 us counties. like many government agencies, hrsa provides only a sas importation script and an as cii file. this new github repository contains two scripts: 2011-2012 arf - download.R download the zipped area resource file directly onto your local computer load the entire table into a temporary sql database save the condensed file as an R data file (.rda), comma-separated value file (.csv), and/or stata-readable file (.dta). 2011-2012 arf - analysis examples.R limit the arf to the variables necessary for your analysis sum up a few county-level statistics merge the arf onto other data sets, using both fips and ssa county codes create a sweet county-level map click here to view these two scripts for mo re detail about the area resource file (arf), visit: the arf home page the hrsa data warehouse notes: the arf may not be a survey data set itself, but it's particularly useful to merge onto other survey data. confidential to sas, spss, stata, and sudaan users: time to put down the abacus. time to transition to r. :D
d
Census block internal point coordinates and weights formatted specifically...
catalog.data.gov
s.cnmilf.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OP,ORPM (2023). Census block internal point coordinates and weights formatted specifically for use in R code of the Environmental Justice Analysis Multisite (EJAM) tool, USA, 2020, EPA, EPA AO OP ORPM [Dataset]. https://catalog.data.gov/dataset/census-block-internal-point-coordinates-and-weights-formatted-specifically-for-use-in-r-co
Explore at:
Dataset updated
Sep 8, 2023
Dataset provided by
OP,ORPM
Description
This is Census 2020 block data specifically formatted for use by the Environmental Protection Agency (EPA) in-development Environmental Justice Analysis Multisite (EJAM) tool, which uses R code to find which block centroids are within X miles of each specified point (e.g., regulated facility), and to find those distances. The datasets have latitude and longitude of each block's internal point, as provided by Census Bureau, and the FIPS code of the block and its parent block group. The datasets also include a weight for each block, representing this block's Census 2020 population count as a fraction of the count for the parent block group overall, for use in estimating how much of a given block group is within X miles of a specified point or inside a polygon of interest. The datasets also have an effective radius of each block, which is what the radius would be in miles if the block covered the same area in square miles but were circular. The datasets also have coordinates in units that facilitate building a quadtree index of locations. They are in R data.table format, saved as .rda or .arrow files to be read by R code.
Storage and Transit Time Data and Code
zenodo.org
zip
Updated Nov 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14171251
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14171251
Dataset updated
Nov 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 11/15/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.

#Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
f
Table of rcprd functions.
plos.figshare.com
xls
Updated Aug 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Pate; Rosa Parisi; Evangelos Kontopantelis; Matthew Sperrin (2025). Table of rcprd functions. [Dataset]. http://doi.org/10.1371/journal.pone.0327229.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327229.t001
Dataset updated
Aug 19, 2025
Dataset provided by
PLOS ONE
Authors
Alexander Pate; Rosa Parisi; Evangelos Kontopantelis; Matthew Sperrin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Clinical Practice Research Datalink (CPRD) is a large and widely used resource of electronic health records from the UK, linking primary care data to hospital data, death registration data, cancer registry data, deprivation data and mental health services data. Extraction and management of CPRD data is a computationally demanding process and requires a significant amount of work, in particular when using R. The rcprd package simplifies the process of extracting and processing CPRD data in order to build datasets ready for statistical analysis. Raw CPRD data is provided in thousands of.txt files, making querying this data cumbersome and inefficient. rcprd saves the relevant information into an SQLite database stored on the hard drive which can then be queried efficiently to extract required information about individuals. rcprd follows a four-stage process: 1) Definition of a cohort, 2) Read in medical/prescription data and save into an SQLite database, 3) Query this SQLite database for specific codes and tests to create variables for each individual in the cohort, 4) Combine extracted variables into a dataset ready for statistical analysis. Functions are available to extract common variable types (e.g., history of a condition, or time until an event occurs, relative to an index date), and more general functions for database queries, allowing users to define their own variables for extraction. The entire process can be done from within R, with no knowledge of SQL required. This manuscript showcases the functionality of rcprd by running through an example using simulated CPRD Aurum data. rcprd will reduce the duplication of time and effort among those using CPRD data for research, allowing more time to be focused on other aspects of research projects.
Kickastarter Campaigns
kaggle.com
zip
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Cantara (2024). Kickastarter Campaigns [Dataset]. https://www.kaggle.com/datasets/alessiocantara/kickastarter-project/discussion
Explore at:
zip(2233314 bytes)Available download formats
Dataset updated
Jan 25, 2024
Authors
Alessio Cantara
Description
Welcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.

ASK

Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?

PREPARE

I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign

Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.

Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.

PROCESS

I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries: R library(tidyverse) library(lubridate) library(ggplot2) library(dplyr) library(tidyr) I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:

csv_files <- list.files(pattern = "\\.csv$") data_frames <- list() for (file in csv_files) { variable_name <- sub("\\.csv$", "", file) assign(variable_name, read.csv(file)) data_frames[[variable_name]] <- get(variable_name) }

Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.

data_frames <- lapply(data_frames, function(df) { df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount) return(df) }) data_frames <- lapply(data_frames, function(df) { df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate) return(df) }) data_frames <- lapply(data_frames, function(df) { df$usd_pledged <- as.numeric(df$usd_pledged) return(df) })

In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):

all_nov_2023 = bind_rows(data_frames) all_dec_2023 = bind_rows(data_frames) all_jan_2024 = bind_rows(data_frames)`

After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.

filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>% filter(created_at != 0 & deadline != 0 & launched_at != 0) %>% mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>% mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>% mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>% mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>% select(-category, -deadline, -launched_at, -created_at) %>% relocate(created, launched, setted_deadline, .before = goal) write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)

The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2...
zenodo.org
application/gzip
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi (2024). Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2 Data Science Libraries in Python [Dataset]. http://doi.org/10.5281/zenodo.11584961
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11584961
Dataset updated
Jul 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package

This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".

Requirements

We recommend the following requirements to replicate our study:

Internet access

At least 100GB of space

Docker installed

Git installed

Package Structure

We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:

data-analysis, an R-based Container we used to run our data analysis.

data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.

database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.

storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.

docker-compose.yml, the Docker file that configures all containers used in the package.

In the remainder of this document, we describe how to set up each container properly.

Using VSCode to Setup the Package

We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.

You first need to set up the containers

$ cd /replication/package/folder $ docker-compose build $ docker-compose up # Wait docker creating and running all containers

Then, you can open them in Visual Studio Code:

Open VSCode in project root folder

Access the command palette and select "Dev Container: Reopen in Container"

Select either Data Collection or Data Analysis.

Start working

If you want/need a more customized organization, the remainder of this file describes it in detail.

Longest Road: Manual Package Setup

Database Setup

The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:

Build an image:

$ cd ./database $ docker build --tag 'dabc-database' . $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB

Create and enter inside the container:

$ docker run -it --name dabc-database-1 dabc-database $ docker exec -it dabc-database-1 /bin/bash root# psql -U postgres -h localhost -d jupyter-notebooks jupyter-notebooks=# \dt List of relations Schema | Name | Type | Owner --------+-------------------+-------+------- public | Cell | table | root public | Code_cell | table | root public | Md_cell | table | root public | Notebook | table | root public | Notebook_features | table | root public | Notebook_metadata | table | root public | repository | table | root

If you got the tables list as above, your database is properly setup.

It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.

Data Collection Setup

This container is responsible for collecting the data to answer our research questions. It has the following structure:

dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.

dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.

Makefile, commands to set up and run both dabcs.py and dabcs-clients.py

matroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.

storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.

requirements.txt, Python dependencies adopted in this module.

Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:

$ cd ./data-collection $ docker build --tag "data-collection" . $ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection $ docker exec -it data-collection-1 /bin/bash $ ls Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py

If you see project files, it means the container is configured accordingly.

Data Analysis Setup

We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:

dependencies.R, an R script containing the dependencies used in our data analysis.

data-analysis.Rmd, the R notebook we used to perform our data analysis

datasets, a docker volume pointing to the storage directory.

Execute the following commands to run this container:

$ cd ./data-analysis $ docker build --tag "data-analysis" . $ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis $ docker exec -it data-analysis-1 /bin/bash $ ls data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile

If you see project files, it means the container is configured accordingly.

A note on storage shared folder

As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.

$ make unzip # extract files $ ls clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv $ make zip # compress files $ ls csv-files.tar.gz Makefile
H
Replication Data for: Virtual Reality Embodiment Experiences Reduce...
dataverse.harvard.edu
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umutcan Ay; Fotini Christia (2025). Replication Data for: Virtual Reality Embodiment Experiences Reduce Transgender Prejudice [Dataset]. http://doi.org/10.7910/DVN/X6VOXQ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/X6VOXQ
Dataset updated
Nov 12, 2025
Dataset provided by
Harvard Dataverse
Authors
Umutcan Ay; Fotini Christia
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Purpose. This dataset accompanies the paper Virtual Reality Embodiment Experiences Reduce Transgender Prejudice and is released to enable full replication, verification, and extension of the study’s findings. The materials reproduce every table and figure in the manuscript and support re-analysis of the main, placebo, and heterogeneity results. Nature. The study is a pre-registered, randomized controlled trial conducted with Turkish university students, comparing a culturally tailored 180° VR embodiment experience to a standard video intervention in a 2×2 factorial design. Survey data were collected at baseline (T₀), immediately post-treatment (T₁), and endline (T₂, four weeks later). Outcomes cover (i) transgender attitudes and beliefs and (ii) animalistic/mechanistic dehumanization indices, alongside placebo outcomes unrelated to gender identity. The deposited data are anonymized; no direct identifiers are included. Code is provided as documented R scripts that perform data cleaning, imputation, index construction (including PCA-based indices), estimation of treatment effects, placebo checks, and heterogeneity analyses (e.g., by right-wing authoritarianism, social dominance orientation, religiosity, and field of study). Scope. The deposit contains: RawData (anonymized surveys across T₀/T₁/T₂ and study-field/treatment distribution files), IntermediateData (cleaned, imputed, indexed datasets created by the scripts), FinalData (analysis-ready merged files for estimation), Core analysis objects (e.g., mit-google-experiment-data-survey.RDS, df, df-did), Scripts to replicate all tables/figures (see RunAll.R) and to export LaTeX tables and publication-ready plots, OutcomeFamily1/2 folders with model outputs and Bayes test results, Placebo analyses, ExperimentFootage (treatment and control videos shown in VR/video conditions), Surveys (English-language PDFs of the instruments). Running RunAll.R will regenerate the manuscript’s figures and tables and save them to the Plots and Tables folders. The dataset is intended for scholarly, non-commercial use to study VR’s effects on prejudice reduction and dehumanization in a Global South context. Please consult the included README for file structure and workflow details, and observe the license/usage terms specified in this Dataverse record. For questions, contact Umutcan Ay (ayh16@mit.edu).
H
Hazardous Times
dataverse.harvard.edu
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Robert Lefter (2025). Hazardous Times [Dataset]. http://doi.org/10.7910/DVN/MP6L5B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/MP6L5B
Dataset updated
Aug 27, 2025
Dataset provided by
Harvard Dataverse
Authors
George Robert Lefter
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the replication package for the article "Hazardous Times: Animal Spirits and U.S. Recession Probabilities." It includes all necessary R code, raw data, and processed data in the start-stop (counting process) format required to reproduce the empirical results, tables, and figures from the study. Project Description: The study assembles monthly U.S. macroeconomic time series from the Federal Reserve Economic Data (FRED) and related sources—covering labor market conditions, consumer sentiment, term spreads, and credit spreads—and implements a novel "high water mark" methodology to measure the lead times with which these indicators signal NBER-dated recessions. Contents: Code: R scripts for data cleaning, multiple imputation, survival analysis, and figure/table generation. A top-level master script (run_all.R) executes the entire analytical pipeline end-to-end. Data: Raw/: Original data pulls from primary sources. Analysis_Ready/: Cleaned series, constructed cycle-specific extremes (high water marks), lead time variables, and the final start-stop dataset for survival analysis. The final curated Excel workbooks used as direct inputs for the replication code. (Note: These Excel sheets must be saved as separate .xlsx files in the designated directory before running the R code.) Documentation: This README file and detailed comments within the code. Key Details: Software Requirements: The replication code is written in R. A list of required R packages (with versions) is provided in the reference list of the article. Missing Data: Addressed via Multiple Imputation by Chained Equations (MICE). License: The original raw data from FRED is subject to its own terms of use, which require citation. The R code is released under the MIT License. All processed data, constructed variables, and analysis-ready datasets created by the author are dedicated to the public domain under the CC0 1.0 Universal Public Domain Dedication. Instructions: Download the entire repository. Install the required R packages. Save Excel sheets from the workbook “Hazardous_Times_Data.xlsx” as separate .xlsx files in the designated directory before running the R code in step 4. Run the master script run_all.R to fully replicate the study's analysis from the provided Analysis_Ready data. This script will regenerate all tables and figures. Users should consult the main publication for full context, theoretical motivation, and series-specific citations.
dailyActivity_altered
kaggle.com
zip
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Felder (2022). dailyActivity_altered [Dataset]. https://www.kaggle.com/datasets/carmenfelder/dailyactivity-altered
Explore at:
zip(28109 bytes)Available download formats
Dataset updated
Nov 14, 2022
Authors
Carmen Felder
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is to be used with FitBit Fitness Tracker . I added columns to aid my cleaning and analysis. I wanted to visualize the total activity minutes and percentage of activity for users. I manipulated the data in Excel and then saved as a .csv in order to use the data in R for further analysis.
d
National Survey on Drug Use and Health (NSDUH)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). National Survey on Drug Use and Health (NSDUH) [Dataset]. http://doi.org/10.7910/DVN/ZIGNUL
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ZIGNUL
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the national survey on drug use and health (nsduh) with r the national survey on drug use and health (nsduh) monitors illicit drug, alcohol, and tobacco use with more detail than any other survey out there. if you wanna know the average age at first chewing tobacco dip, the prevalence of needle-sharing, the family structure of households with someone abusing pain relievers, even the health insurance coverage of peyote users, you are in the right place. the substance abuse and mental health services administration (samhsa) contracts with the north carolinians over at research triangle institute to run the survey, but the university of michigan's substance abuse and mental health data archive (samhda) holds the keys to this data castle. nsduh in its current form only goes back about a decade, when samhsa re-designed the methodology and started paying respondents thirty bucks a pop. before that, look for its predecessor - the national household survey on drug abuse (nhsda) - with public use files available back to 1979 (included in these scripts). be sure to read those changes in methodo logy carefully before you start trying to trend smokers' virginia slims brand loyalty back to 1999. although (to my knowledge) only the national health interview survey contains r syntax examples in its documentation, the friendly folks at samhsa have shown promise. since their published data tables were run on a restricted-access data set, i requested that they run the same sudaan analysis code on the public use files to confirm that this new r syntax does what it should. they delivered, i matched, pats on the back all around. if you need a one-off data point, samhda is overflowing with options to analyze the data online. you even might find some restricted statistics that won't appear in the public use files. still, that's no substitute for getting your hands dirty. when you tire of menu-driven online query tools and you're ready to bark with the big data dogs, give these puppies a whirl. the national survey on drug use and health targets the civilian, noninstitutionalized population of the united states aged twelve and older. this new github repository contains three scripts: 1979-2011 - download all microdata.R authenticate the university of michi gan's "i agree with these terms" page download, import, save each available year of data (with documentation) back to 1979 convert ea ch pre-packaged stata do-file (.do) into r, run the damn thing, get NAs where they belong 2010 single-year - analysis examples.R load a single year of data limit the table to the variables needed for an example analysis construct the complex sample survey object run enough example analyses to make a kitchen sink jealous replicate sam hsa puf.R load a single year of data limit the table to the variables needed for an example analysis construct the complex sample survey object print statistics and standard errors matching the target replicati on table click here to view these three scripts for more detail about the national survey on drug use and health, visit: the substance abuse and mental health services administration's nsduh homepage research triangle in stitute's nsduh homepage the university of michigan's nsduh homepage notes: the 'download all microdata' program intentionally breaks unless you complete the clearly-defined, one-step instruction to authenticate that you have read and agree with the download terms. the script will download the entire public use file archive, but only after this step has been completed. if you c ontact me for help without reading those instructions, i reserve the right to tease you mercilessly. also: thanks to the great hadley wickham for figuring out how to authenticate in the first place. confidential to sas, spss, stata, and sudaan users: did you know that you don't have to stop reading just because you've run out of candlewax? maybe it's time to switch to r. :D
d
Supplementary materials for: \"Comparing Internet experiences and...
dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hargittai, Eszter; Shaw, Aaron (2023). Supplementary materials for: \"Comparing Internet experiences and prosociality in Amazon Mechanical Turk and population-based survey samples\" [Dataset]. http://doi.org/10.7910/DVN/UFL6MI
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UFL6MI
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Hargittai, Eszter; Shaw, Aaron
Description
Overview Supplementary materials for the paper "Comparing Internet experiences and prosociality in Amazon Mechanical Turk and population-based survey samples" by Eszter Hargittai and Aaron Shaw published in Socius in 2020 (https://doi.org/10.1177/2378023119889834). License The materials provided here are issued under the same (Creative Commons Attribution Non-Commercial 4.0) license as the paper. Details and a copy of the license are available at: http://creativecommons.org/licenses/by-nc/4.0/. Manifest The files included are: Hargittai-Shaw-AMT-NORC-2019.rds and Hargittai-Shaw-AMT-NORC-2019.tsv: Two (identical) versions the dataset used for the analysis. The tsv file is provided to facilitate import into software other than R. R analysis code files: 01-import.R - Imports dataset. Creates a mapping of dependent variables and variable names used elsewhere in the figure and analysis. 02-gen_figure.R - Generates Figure 1 in PDF and PNG formats and saves them in the "figures" directory. 03-gendescriptivestats.R - Generates results reported in Table 1. 04-gen_models.R - Fits models reported in Tables 2-4. 05-alternative_specifications.R - Fits models using log-transformed version of the income variable. Makefile: Executes all of the R files in sequence, produces corresponding .log files in the "log" directory that contain the full R session from each file as well as separate error log files (also in the "log" directory) that capture any error messages and warnings generated by R along the way. HargittaiShaw2019Socius-Instrument.pdf: The questions distributed to both the NORC and AMT survey participants used in the analysis reported in this paper. How to reproduce the analysis presented in the paper Depending on your computing environment, reproducing the analysis presented in the paper may be as easy as invoking "make all" or "make" in the directory containing this file on a system that has the appropriate software installed. Once compilation is complete, you can review the log files in a text editor. See below for more on software and dependencies. If calling the makefile fails, the individual R scripts can also be run interactively or in batch mode. Software and dependencies The R and compilation materials provided here were created and tested on a 64-bit laptop pc running Ubuntu 18.04.3 LTS, R version 3.6.1, ggplot2 version 3.2.1, reshape2 version 1.4.3, forcats version 0.4.0, pscl version 1.5.2, and stargazer version 5.2.2 (these last five are R packages called in specific .R files). As with all software, your mileage may vary and the authors provide no warranties. Codebook The dataset consists of 36 variables (columns) and 2,716 participants (rows). The variable names and brief descriptions follow below. Additional details of measurement are provided in the paper and survey instrument. All dichotomous indicators are coded 0/1 where 1 is the affirmative response implied by the variable name: id: Index to identify individual units (participants). svy_raked_wgt: Raked survey weights provided by NORC. amtsample: Data source coded 0 (NORC) or 1 (AMT). age: Participant age in years. female: Participant selected "female" gender. incomecont: Income in USD (continuous) coded from center-points of categories reported in the instruments. incomediv: Income in $1,000s USD (=incomecont/1000). incomesqrt: Square-root of incomecont. lincome: Natural logarithm of incomecont. rural: Participant resides in a rural area. employed: Participant is fully or partially employed. eduhsorless: Highest education level is high school or less. edusc: Highest education level is completed some college. edubaormore: Highest education level is BA or more. white: Race = white. black: Race = black. nativeam: Race = native american. hispanic: Ethnicity = hispanic. asian: Race = asian. raceother: Race = other. skillsmean: Internet use skills index (described in paper). accesssum: Internet use autonomy (described in paper). webweekhrs: Internet use frequency (described in paper). do_sum: Participatory online activities (described in paper). snssumcompare: Social network site activities (described in paper). altru_scale: Generous behaviors (described in paper). trust_scale: Trust scale score (described in paper). pts_give: Points donated in unilateral dictator game (described in paper). std_accesssum: Standardized (z-score) version of accesssum. std_webweekhrs: Standardized (z-score) version of webweekhrs. std_skillsmean: Standardized (z-score) version of skillsmean. std_do_sum: Standardized (z-score) version of do_sum. std_snssumcompare: Standardized (z-score) version of snssumcompare. std_trust_scale: Standardized (z-score) version of trust_scale. std_altru_scale: Standardized (z-score) version of altru_scale. std_pts_give: Standardized (z-score) version of pts_give.
r
Grid Garage ArcGIS Toolbox
researchdata.edu.au
data.nsw.gov.au
Updated Sep 6, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NSW Department of Climate Change, Energy, the Environment and Water (2018). Grid Garage ArcGIS Toolbox [Dataset]. https://researchdata.edu.au/grid-garage-arcgis-toolbox/1342780
Explore at:
Dataset updated
Sep 6, 2018
Dataset provided by
data.nsw.gov.au
Authors
NSW Department of Climate Change, Energy, the Environment and Water
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
The Grid Garage Toolbox is designed to help you undertake the Geographic Information System (GIS) tasks required to process GIS data (geodata) into a standard, spatially aligned format. This format is required by most, grid or raster, spatial modelling tools such as the Multi-criteria Analysis Shell for Spatial Decision Support (MCAS-S). Grid Garage contains 36 tools designed to save you time by batch processing repetitive GIS tasks as well diagnosing problems with data and capturing a record of processing step and any errors encountered.\r \r Grid Garage provides tools that function using a list based approach to batch processing where both inputs and outputs are specified in tables to enable selective batch processing and detailed result reporting. In many cases the tools simply extend the functionality of standard ArcGIS tools, providing some or all of the inputs required by these tools via the input table to enable batch processing on a 'per item' basis. This approach differs slightly from normal batch processing in ArcGIS, instead of manually selecting single items or a folder on which to apply a tool or model you provide a table listing target datasets. In summary the\r Grid Garage allows you to:\r \r * List, describe and manage very large volumes of geodata.\r * Batch process repetitive GIS tasks such as managing (renaming, describing etc.) or processing (clipping, resampling, reprojecting etc.) many geodata inputs such as time-series geodata derived from satellite imagery or climate models.\r * Record any errors when batch processing and diagnose errors by interrogating the input geodata that failed.\r * Develop your own models in ArcGIS ModelBuilder that allow you to automate any GIS workflow utilising one or more of the Grid Garage tools that can process an unlimited number of inputs.\r * Automate the process of generating MCAS-S TIP metadata files for any number of input raster datasets.\r \r The Grid Garage is intended for use by anyone with an understanding of GIS principles and an intermediate to advanced level of GIS skills. Using the Grid Garage tools in ArcGIS ModelBuilder requires skills in the use of the ArcGIS ModelBuilder tool.\r \r Download Instructions: Create a new folder on your computer or network and then download and unzip the zip file from the GitHub Release page for each of the following items in the 'Data and Resources' section below. There is a folder in each zip file that contains all the files. See the Grid Garage User Guide for instructions on how to install and use the Grid Garage Toolbox with the sample data provided. \r \r
Pre-processed PubMed data for a study of coauthorship
zenodo.org
data.niaid.nih.gov
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cory Brunson; Xiaoyan Wang; Cory Brunson; Xiaoyan Wang (2020). Pre-processed PubMed data for a study of coauthorship [Dataset]. http://doi.org/10.5281/zenodo.345934
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.345934
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cory Brunson; Xiaoyan Wang; Cory Brunson; Xiaoyan Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was collected from the PubMed portal to MEDLINE and other repositories of biomedical research (https://www.ncbi.nlm.nih.gov/pubmed/). Analysis of the dataset led to the paper "Effects of research complexity and competition on the incidence and growth of coauthorship in biomedicine", published in PLOS One (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173444). The raw data were pre-processed using the script "clean.r" in the project directory on GitHub (https://github.com/corybrunson/coauthor) to obtain the file presented here.

The dataset is formatted as a data table (https://cran.r-project.org/web/packages/data.table/index.html), a class of data frame in R, and saved as a .RData file, which can be loaded into an R session via `load("path/to/dataset/pmDat.RData")`. The fields are as follows:

`pmid` - the unique publication identifier (PMID) used by PubMed

`jid` - the unique journal identifier used by PubMed

`issn` - the (print) ISSN of the journal

`ym` - the month and year of publication

`nau` - the number of authors credited by the publication (up to any limits imposed by PubMed, and counting each author collective as a single author)

`cau` - whether any corporate author was credited

`rev` - whether the publication was tagged as a review

`trial` - whether the publication was tagged as a clinical trial

`npmt` - the number of MeSH terms assigned to the publication that were flagged as "major" topics

`nmh` - the number of top-level MeSH headings assigned to the publication

`supp` - whether the publication was tagged as having received financial support

`ng` - the number of grants acknowledged by the publication

`co` - the country in which the journal was published

Note that the field values for any publication can be validated by searching for the PMID in PubMed.

Serial Non-contrast Non-gated T2w MRI Datasets of Patient-derived Xenograft...

cancerimagingarchive.net

dicom, docx, n/a, pdf +1

Updated Jun 14, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

The Cancer Imaging Archive (2023). Serial Non-contrast Non-gated T2w MRI Datasets of Patient-derived Xenograft Cancer Models for Development of Tissue Characterization Algorithms [Dataset]. http://doi.org/10.7937/3KQ0-YK19

Explore at:

pdf, dicom, docx, n/a, xlsxAvailable download formats

Unique identifier

https://doi.org/10.7937/3KQ0-YK19

Dataset updated

Jun 14, 2023

Dataset authored and provided by

The Cancer Imaging Archive

License

https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

Time period covered

Jun 14, 2023

Dataset funded by

National Cancer Institutehttp://www.cancer.gov/

Description

This collection contains serial non-contrast non-gated T2w MRI of 18 patient derived xenograft cancer models. 175 mice were imaged at multiple time points (514 total studies) for researchers to develop algorithms using neural networks, and classification techniques to improve tissue characterization (morphological changes) for the improvement in patient care through advances in precision medicine.

Characterization of tissue using non-invasive in vivo imaging techniques is used for detection and measurement of disease burden in oncology. Researchers have developed numerous algorithms, such as neural networks, and classification techniques to improve the characterization (morphological changes) of tissue. Unfortunately, to obtain statistical significance, large datasets are a requirement in this research endeavor due to tumor heterogeneity within the same histologic classification. Pre-clinical patient derived xenograft animal models can be a significant resource by providing collections with a more homogenous tumor genome across the collection with companion genomic and pathologic characterization available (https://pdmr.cancer.gov/), allowing determination of the variability of imaging characteristics.

This dataset of a patient derived xenograft model (below table) can be used for training algorithms for evaluating variations in tissue texture with respect to tumor growth and cancer model.

PDX Model Characterizations and Biweekly imaging sessions

				Characterization			1	2	3	4	5	6	7	8
Model ID	CTEP SDC Description	Disease Body Location	Biopsy site	Implant Date	Passage	Sex	# Mice imaged per biweekly imaging session
144126-210-T Images	Neuroendocrine cancer, NOS	Endocrine and Neuroendocrine	* Liver	2/14/2020	4	M	8	8	5	5	5	5
146476-266-R Images	Urothelial/bladder cancer, NOS	Genitourinary	Bladder	2/3/2020	4	M	17	16	13	10	11	4	4
165739-295-R Images	Adenocarcinoma-pancreas	Digestive/Gastrointestinal	Pancreas	5/4/2018	2	M	10	10	1
172845-121-T Images	Adenocarcinoma-colon	Digestive/Gastrointestinal	* Liver	10/16/2020	4	F	20	20
172845-142-T Images	Adenocarcinoma-colon	Digestive/Gastrointestinal	* Liver	8/24/2018	3	F	15	13	8	3
287954-098-R Images	Ewing sarcoma/Peripheral PNET	Musculoskeletal	* Pelvis	3/18/2021	6	M	10	8	1	1	1
466636-057-R Images	Adenocarcinoma-pancreas	Digestive/Gastrointestinal	Pancreas	12/15/2017	N/A	M	5	4	2	1
521955-158-R4 Images	Adenocarcinoma-pancreas	Digestive/Gastrointestinal	* Tumor in colonic fat	9/30/2021	4	F	10	10	10	8	5	1	1
521955-158-R6 Images	Adenocarcinoma-pancreas	Digestive/Gastrointestinal	* Myometrium	3/27/2018	N/A	F	7	7	4
625472-104-R Images	Adenocarcinoma-colon	Digestive/Gastrointestinal	* Shoulder	8/27/2019	2	F	9	1
695669-166-R Images	Melanoma	Skin	Arm	4/16/2021	3	M	7	8	8	6	4	4	2	2
698357-238-R Images	Osteosarcoma	Musculoskeletal	Scapula	3/5/2021	6	F	7	4
765638-272-R Images	Squamous cell lung carcinoma	Respiratory/Thoracic	* Liver	3/26/2021	4	F	7	8	5	3	1
779769-127-R Images	Adenocarcinoma-rectum	Digestive/Gastrointestinal	Rectum	2/19/2020	5	F	5	5	5	5	3	4
833975-119-R <a

Z
Data from: Code and Data Schimmelradar manuscript 1.1
data-staging.niaid.nih.gov
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kortenbosch, Hylke (2025). Code and Data Schimmelradar manuscript 1.1 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14851614
Explore at:
Dataset updated
Apr 3, 2025
Dataset provided by
Wageningen University & Research
Authors
Kortenbosch, Hylke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Read me – Schimmelradar manuscript

The code in this repository was written to analyse the data and generate figures for the manuscript “Land use drives spatial structure of drug resistance in a fungal pathogen”.

This repository consists of two original .csv raw data files, 2 .tif files that are minimally reformatted after being downloaded from LGN.nl and www.pdok.nl/introductie/-/article/basisregistratie-gewaspercelen-brp-, and 9 scripts using the R language. The remaining files include intermediate .tif and .csv files to skip more computationally heavy steps of the analysis and facilitate the reproduction of the analysis.

Data files:§1

Schimmelradar_360_submission.csv: The raw phenotypic resistance spatial data from the air sample

Sample: an arbitrary sample code given to each of the participants

Area: A random number assigned to each of the 100 areas the Netherlands was split up into to facilitate an even spread of samples across the country during participant selection.

Logistics status: Variable used to indicate whether the sample was returned in good order, not otherwise used in the analysis.

Arrived back on: The date by which the sample arrived back at Wageningen University

Quality seals: quality of the seals upon sample return, only samples of a quality designated as good seals were processed. (also see Supplement file – section A).

Start sampling: The date on which the trap was deployed and the stickers exposed to the air, recorded by the participant

End sampling: The date on which the trap was taken down and the stickers were re-covered and no longer exposed to the air, recorded by the participant

3 back in area?: Binary indicating whether at least three samples have been returned in the respective area (see Area)

Batch: The date on which processing of the sample was started. To be more specific, the date at which Flamingo medium was poured over the seals of the sample and incubation was started.

Lab processing: Binary indication completion of lab processing

Tot ITR: A. fumigatus CFU count in the permissive layer of the itraconazole-treated plate

RES ITR: CFU count of colonies that had breached the surface of the itraconazole-treated layer after incubation and were visually (with the unaided eye) sporulating.

RF ITR: The itraconazole (~4 mg/L) resistance fraction = RES ITR/Tot ITR

Muccor ITR: Indication of the presence of Mucorales spp. growth on the itraconazole treatment plate

Tot VOR: A. fumigatus CFU count in the permissive layer of the voriconazole-treated plate

RES VOR: CFU count of colonies that had breached the surface of the voriconazole-treated layer after incubation and were visually (with the unaided eye) sporulating.

RF VOR: The voriconazole (~2 mg/L) resistance fraction = RES VOR/Tot VOR

Muccor VOR: Indication of the presence of Mucorales spp. growth on the voriconazole treatment plate

Tot CON: CFU count on the untreated growth control plate Note: note on the sample based on either information given by the participant or observations in the lab. The exclude label was given if the sample had either too little (<25) or too many (>300) CFUs on one or more of the plates (also see Supplement file – section A).

Lat: Exact latitude of the address where the sample was taken. Not used in the published version of the code and hidden for privacy reasons.

Long: Exact longitude of the address where the sample was taken. Not used in the published version of the code and hidden for privacy reasons.

Round_Lat: Rounded latitude of the address where the sample was taken. Rounded down to two decimals (the equivalent of a 1 km2 area), so they could not be linked to a specific address. Used in the published version of the code.

Round_Long: Rounded longitude of the address where the sample was taken. Rounded down to two decimals (the equivalent of a 1 km2 area), so they could not be linked to a specific address. Used in the published version of the code.

Analysis_genotypic_schimmelradar_TR_types.csv: The genotype data inferred from gel electrophoresis for all resistant isolates

TR type: Indicates the length of the tandem repeats in bp, as judged from a gel. 34 bp, 46 bp, or multiples of 46.

Plate: 96-well plate on which the isolate was cultured

96-well: well in which the isolate was cultured

Azole: Azole on which the isolate was grown and resistant to. Itraconazole (ITRA) or Voriconazole (VORI).

Sample: The air sample the isolate was taken from, corresponds to “Sample” in “Schimmelradar_360_submission.csv”.

Strata: The number that equates to “Area” in “Schimmelradar_360_submission.csv”.

WT: A binary that indicates whether an isolate had a wildtype cyp51a promotor.

TR34: A binary that indicates whether an isolate had a TR34 cyp51a promotor.

TR46: A binary that indicates whether an isolate had a TR46 cyp51a promotor.

TR46_3: A binary that indicates whether an isolate had a TR46*3 cyp51a promotor.

TR46_4: A binary that indicates whether an isolate had a TR46*4 cyp51a promotor.

Script 1 - generation_100_equisized_areas_NL

NOTE: Running this code is not necessary for the other analyses, it was used primarily for sample selection. The area distribution was used during the analysis in script 2B, yet each sample is already linked to an area in “Schimmelradar_360_submission.csv". This script was written to generate a spatial polygons data frame of 100 equisized areas of the Netherlands. The registrations for the citizen science project Schimmelradar were binned into these areas to facilitate a relatively even distribution of samples throughout the country which can be seen in Figure S1. The spatial polygons data frame can be opened and displayed in open-source software such as QGIS. The package “spcosa” used to generate the areas has RJava as a dependency, so having Java installed is required to run this script. The R script uses a shapefile of the Netherlands from the tmap package to generate the areas within the Netherlands. Generating a similar distribution for other countries will require different shape files!

Script 2 - Spatial_data_integration_fungalradar

This script produces 4 data files that describe land use in the Netherlands: The three focal.RData files with land use and resistant/colony counts, as well as the “Predictor_raster_NL.tif” land use file.

In this script, both the phenotypic and genotypic resistance spatial data from the air samples taken during the Fungal radar citizen science project are integrated with the land use and weather data used to model them. It is not recommended to run this code because the data extraction is fairly computationally demanding and it does not itself contain key statistical analyses. Rather it is used to generate the objects used for modelling and spatial predictions that are also included in this repository.

The phenotypic resistance is summarised in Table 1, which is generated in this script. Subsequently, the spatial data from the LNG22 and BRP datasets are integrated into the data. These dataset can be loaded from the "LGN2022.tif" and "Gewas22rast.tiff" raster files, respectively. Link to webpages where these files can be downloaded can found in the code.

Once the raster files are loaded, the code generates heatmaps and calculates the proportions of all the land use classes in both a 5 and 10-km radius around every sample and across the country to make spatial predictions. Only the 10 km radius data are used in the later analysis, but the 5 km radius was generated to test if that radius would be more appropriate, during an earlier stage of the analyses, and was left in for completeness. For documentation of the LGN22 data set, we refer to https://lgn.nl/documentatie and for BRP to https://nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/44e6d4d3-8fc5-47d6-8712-33dd6d244eef, both of these online resources are in Dutch but can be readily translated. A list of the variables that were included from these datasets during model selection can be found in Table S3. Alongside land-use data, the code extracts weather data from datafiles that can be downloaded from https://cds.climate.copernicus.eu/datasets/sis-agrometeorological-indicators?tab=download for the Netherlands during the sampling window, dates and dimensions are listed within the code. The Weather_schimmelradar folder contains four subfolders for each weather variable that was considered during modelling: temperature, wind speed, precipitation and humidity. Each of these subfolders contains 44 .nc files that each cover the daily mean of the respective weather variable across the Netherlands for each of the 44 days of the sampling window the citizen scientists were given.

All spatial objects weather + land use are eventually merged into one predictor raster "Predictor_raster_NL.tif". The land use fractions and weather data are subsequently integrated with the air sample data into a single spatial data frame along with the resistance data and saved into an R object "Schimmelradar360spat_focal.RData". The script concludes by merging the cyp51a haplotype data with this object as well, to create two different objects: "Schimmelradar360spat_focal_TR_VORI.RData" for the haplotype data of the voriconazole resistant isolates and "Schimmelradar360spat_focal_TR_ITRA.RData" including the haplotype data of itraconazole resistant isolates. These two datasets are modeled separately in scripts 5,9 and 6,8, respectively. This final section of the script also generates summary table S2, which summarises the frequency of the cyp51a haplotypes per azole treatment.

If the relevant objects are loaded
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.w3r2280w0

Dataset updated

Dec 7, 2023

Dataset provided by

HIV Prevention Trials Networkhttp://www.hptn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
PEPFAR

Authors

Dylan Westfall; Mullins James

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

Clear search

Close search

Google apps

Main menu

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...

Area Resource File (ARF)

Census block internal point coordinates and weights formatted specifically...

Storage and Transit Time Data and Code

Table of rcprd functions.

Kickastarter Campaigns

Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2...

Replication Data for: Virtual Reality Embodiment Experiences Reduce...

Hazardous Times

dailyActivity_altered

National Survey on Drug Use and Health (NSDUH)

Supplementary materials for: \"Comparing Internet experiences and...

Grid Garage ArcGIS Toolbox

Pre-processed PubMed data for a study of coauthorship

Serial Non-contrast Non-gated T2w MRI Datasets of Patient-derived Xenograft...

PDX Model Characterizations and Biweekly imaging sessions

Data from: Code and Data Schimmelradar manuscript 1.1

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispeciesSee More Versions

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies