Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterThis data release supports an analysis of changes in dissolved organic carbon (DOC) and nitrate concentrations in Buck Creek watershed near Inlet, New York 2001 to 2021. The Buck Creek watershed is a 310-hectare forested watershed that is recovering from acidic deposition within the Adirondack region. The data release includes pre-processed model inputs and model outputs for the Weighted Regressions on Time, Discharge and Season (WRTDS) model (Hirsch and others, 2010) to estimate daily flow normalized concentrations of DOC and nitrate during a 20-year period of analysis. WRTDS uses daily discharge and concentration observations implemented through the Exploration and Graphics for River Trends R package (EGRET) to predict solute concentration using decimal time and discharge as explanatory variables (Hirsch and De Cicco, 2015; Hirsch and others, 2010). Discharge and concentration data are available from the U.S. Geological Survey National Water Information System (NWIS) database (U.S. Geological Survey, 2016). The time series data were analyzed for the entire period, water years 2001 (WY2001) to WY2021 where WY2001 is the period from October 1, 2000 to September 30, 2001. This data release contains 5 comma-separated values (CSV) files, one R script, and one XML metadata file. There are four input files (“Daily.csv”, “INFO.csv”, “Sample_doc.csv”, and “Sample_nitrate.csv”) that contain site information, daily mean discharge, and mean daily DOC or nitrate concentrations. The R script (“Buck Creek WRTDS R script.R”) uses the four input datasets and functions from the EGRET R package to generate estimations of flow normalized concentrations. The output file (“WRTDS_results.csv”) contains model output at daily time steps for each sub-watershed and for each solute. Files are automatically associated with the R script when opened in RStudio using the provided R project file ("Files.Rproj"). All input, output, and R files are in the "Files.zip" folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).
We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).
This replication package contains the following files and folders:
README This file
MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:
museum: Name of the museum.
type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).
munic: Municipality, in which the museum is located.
yr: Year of the observation.
units: Number of visit sites.
resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).
vis: Number of (physical) visitors.
aarc: Number of articles published (archeology).
ach: Number of articles published (cultural history).
aah: Number of articles published (art history).
anh: Number of articles published (natural history).
exh: Number of temporary exhibitions.
edu: Number of primary school classes on educational visits to the museum.
ev: Number of events other than exhibitions.
ftesc: Scientific labor (full-time equivalents).
ftensc: Non-scientific labor (full-time equivalents).
expProperty: Running and maintenance costs [1,000 DKK].
expCons: Conservation expenditure [1,000 DKK].
ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).
prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.
DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.
make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.
IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.
IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.
allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.
IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.
/tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.
/figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.
## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads
## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`
## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category
## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level
## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Facebook
TwitterThis child page contains a zipped folder which contains all items necessary to run trend models and produce results published in U.S. Geological Scientific Investigations Report 2021–XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p.]. To run the R-QWTREND program in R 6 files are required and each is included in this child page: prepQWdataV4.txt, runQWmodelV4XXUEP.txt, plotQWtrendV4XXUEP.txt, qwtrend2018v4.exe, salflibc.dll, and StartQWTrendV4.R (Vecchia and Nustad, 2020). The folder contains: six items required to run the R–QWTREND trend analysis tool; a readme.txt file; a flowtrendData.RData file; an allsiteinfo.table.csv file, a folder called "scripts", and a folder called "waterqualitydata". The "scripts" folder contains the scripts that can be used to reproduce the results found in the USGS Scientific Investigations Report referenced above. The "waterqualitydata" folder contains .csv files with the naming convention of site_ions or site_nuts for major ions and nutrients constituents and contains machine readable files with the water-quality data used for the trend analysis at each site. R–QWTREND is a software package for analyzing trends in stream-water quality. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for using R–QWTREND: • Windows 10 operating system • R (version 3.4 or later; 64 bit recommended) • RStudio (version 1.1.456 or later). An accompanying report (Vecchia and Nustad, 2020) serves as the formal documentation for R–QWTREND. Vecchia, A.V., and Nustad, R.A., 2020, Time-series model, statistical methods, and software documentation for R–QWTREND—An R package for analyzing trends in stream-water quality: U.S. Geological Survey Open-File Report 2020–1014, 51 p., https://doi.org/10.3133/ofr20201014 R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
Facebook
TwitterAttribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
\r The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.\r \r \r \r This dataset contains all the scripts used to carry out the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Hunter bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). See History for a detailed explanation of the dataset contents.\r \r \r \r References:\r \r Herron N, Crosbie R, Peeters L, Marvanek S, Ramage A and Wilkins A (2016) Groundwater numerical modelling for the Hunter subregion. Product 2.6.2 for the Hunter subregion from the Northern Sydney Basin Bioregional Assessment. Department of the Environment, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia.\r \r
\r This dataset uses the results of the design of experiment runs of the groundwater model of the Hunter subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016).\r \r \r \r A flow chart of the way the various files and scripts interact is provided in HUN_GW_UA_Flowchart.png (editable version in HUN_GW_UA_Flowchart.gliffy).\r \r \r \r R-script HUN_DoE_Parameters.R creates the set of parameters for the design of experiment in HUN_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset HUN GW Model v01). Associated with this spreadsheet is file HUN_GW_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure.\r \r \r \r The results of the design of experiment model runs are summarised in files HUN_GW_dmax_DoE_Predictions.csv, HUN_GW_tmax_DoE_Predictions.csv, HUN_GW_DoE_Observations.csv, HUN_GW_DoE_mean_BL_BF_hist.csv which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observed groundwater levels and SW-GW fluxes respectively. These are generated with post-processing scripts in dataset HUN GW Model v01 from the output (as exemplified in dataset HUN GW Model simulate ua999 pawsey v01).\r \r \r \r Spreadsheets HUN_GW_dmax_Predictions.csv and HUN_GW_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction.\r \r Spreadsheet HUN_GW_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing the observation, the spatial coordinates, the observed value and the number of observations at this location (from dataset HUN bores v01). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). Spreadsheet HUN_GW_mean_BL_BF_hist.csv has similar information, but on the SW-GW flux. The observed values are from dataset HUN Groundwater Flowrate Time Series v01\r \r \r \r These files are used in script HUN_GW_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets HUN_GW_dmax_SI.csv, HUN_GW_tmax_SI.csv, HUN_GW_hobs_SI.py, HUN_GW_mean_BF_hist_SI.csv\r \r \r \r Script HUN_GW_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow rates. The results are stored in HUN_GW_DoE_ObjFun.csv and HUN_GW_ObjFun.csv. \r \r \r \r The latter files are used in scripts HUN_GW_dmax_CreatePosteriorParameters.R to carry out the Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Herron et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file HUN_GW_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention HUN_GW_dmax_Posterior_Parameters_OO_$OFName$.csv where $OFName$ is the name of the objective function. Python script HUN_GW_PosteriorParameters_Percentiles.py summarizes these posterior parameter combinations and stores the results in HUN_GW_PosteriorParameters_Percentiles.csv.\r \r \r \r The same set of spreadsheets is used to test convergence of the emulator performance with script HUN_GW_emulator_convergence.R and batch file HUN_GW_emulator_convergence.slurm to produce spreadsheet HUN_GW_convergence_objfun_BF.csv.\r \r \r \r The posterior parameter distributions are sampled with scripts HUN_GW_dmax_tmax_MCsampler.R and associated .slurm batch file. The script create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. A single emulator and associated output is included for illustrative purposes.\r \r \r \r Script HUN_GW_collate_predictions.csv collates all posterior predictive distributions in spreadsheets HUN_GW_dmax_PosteriorPredictions.csv and HUN_GW_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet HUN_GW_dmax_tmax_excprob.csv with script HUN_GW_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and the threshold of the objective function and the acceptance rate.\r \r \r \r The script HUN_GW_dmax_tmax_MCsampler.R is also used to evaluate parameter distributions HUN_GW_dmax_Posterior_Parameters_HUN_OF_probe439.csv and HUN_GW_dmax_Posterior_Parameters_Mackie_OF_probe439.csv. These are, for one predictions, different parameter distributions, in which the latter represents local information. The corresponding dmax values are stored in HUN_GW_dmax_probe439_HUN.csv and HUN_GW_dmax_probe439_Mackie.csv\r \r
\r Bioregional Assessment Programme (XXXX) HUN GW Uncertainty Analysis v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a.\r \r
\r * Derived From HUN GW Model code v01\r \r * Derived From Hydstra Groundwater Measurement Update - NSW Office of Water, Nov2013\r \r * Derived From Groundwater Economic Elements Hunter NSW 20150520 PersRem v02\r \r * Derived From NSW Office of Water - National Groundwater Information System 20140701\r \r * Derived From Travelling Stock Route Conservation Values\r \r * Derived From HUN GW Model v01\r \r * Derived From NSW Wetlands\r \r * Derived From Climate Change Corridors Coastal North East NSW\r \r * Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only\r \r * Derived From Climate Change Corridors for Nandewar and New England Tablelands\r \r * Derived From National Groundwater Dependent Ecosystems (GDE) Atlas\r \r * Derived From Fauna Corridors for North East NSW\r \r * Derived From R-scripts for uncertainty analysis v01\r \r * Derived From Asset database for the Hunter subregion on 27 August 2015\r \r * Derived From Hunter CMA GDEs (DRAFT DPI pre-release)\r \r * Derived From [Estuarine Macrophytes of Hunter Subregion NSW DPI Hunter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all the RAPID input and output files that were used in the study reported in:
David, Cédric H., Florence Habets, David R. Maidment and Zong-Liang Yang (2011), RAPID applied to the SIM-France model, Hydrological Processes, 25(22), 3412-3425. DOI: 10.1002/hyp.8070.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Time format
The times reported in this description all follow the ISO 8601 format. For example 2000-01-01T16:00-06:00 represents 4:00 PM (16:00) on Jan 1st 2000 (2000-01-01), Central Standard Time (-06:00). Additionally, when time ranges with inner time steps are reported, the first time corresponds to the beginning of the first time step, and the second time corresponds to the end of the last time step. For example, the 3-hourly time range from 2000-01-01T03:00+00:00 to 2000-01-01T09:00+00:00 contains two 3-hourly time steps. The first one starts at 3:00 AM and finishes at 6:00AM on Jan 1st 2000, Universal Time; the second one starts at 6:00 AM and finishes at 9:00AM on Jan 1st 2000, Universal Time.
Data sources
The following sources were used to produce files in this dataset:
The hydrographic network of SIM-France, as published in Habets, F., A. Boone, J. L. Champeaux, P. Etchevers, L. Franchistéguy, E. Leblois, E. Ledoux, P. Le Moigne, E. Martin, S. Morel, J. Noilhan, P. Quintana Seguí, F. Rousset-Regimbeau, and P. Viennot (2008), The SAFRAN-ISBA-MODCOU hydrometeorological model applied over France, Journal of Geophysical Research: Atmospheres, 113(D6), DOI: 10.1029/2007JD008548.
The observed flows are from Banque HYDRO, Service Central d'Hydrométéorologie et d'Appui à la Prévision des Inondations. Available at http://www.hydro.eaufrance.fr/index.php.
Outputs from a simulation using SIM-France (Habets et al. 2008). The simulation was run by Florence Habets, and produced 3-hourly time steps from 1995-08-01T00:00+02:00 to 2005-07-31T21:02+00:00. Further details on the inputs and options used for this simulation are provided in David et al. (2011).
Software
The following software were used to produce files in this dataset:
The Routing Application for Parallel computation of Discharge (RAPID, David et al. 2011, http://rapid-hub.org), Version 1.1.0. Further details on the inputs and options used for this series of simulations are provided below and in David et al. (2011).
ESRI ArcGIS (http://www.arcgis.com).
Microsoft Excel (https://products.office.com/en-us/excel).
The GNU Compiler Collection (https://gcc.gnu.org) and the Intel compilers (https://software.intel.com/en-us/intel-compilers).
Study domain
The files in this dataset correspond to one study domain:
The river network of SIM-France is made of 24264 river reaches. The temporal range corresponding to this domain is from 1995-08-01T00:00+02:00 to 2005-07-31 T21:00+02:00.
Description of files
All files below were prepared by Cédric H. David, using the data sources and software mentioned above.
rapid_connect_France.csv. This CSV file contains the river network connectivity information and is based on the unique IDs of the SIM-France river reaches (the IDs). For each river reach, this file specifies: the ID of the reach, the ID of the unique downstream reach, the number of upstream reaches with a maximum of four reaches, and the IDs of all upstream reaches. A value of zero is used in place of NoData. The river reaches are sorted in increasing value of ID. The values were computed based on the SIM-France FICVID file. This file was prepared using a Fortran program.
m3_riv_France_1995_2005_ksat_201101_c_zvol_ext.nc. This netCDF file contains the 3-hourly accumulated inflows of water (in cubic meters) from surface and subsurface runoff into the upstream point of each river reach. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The time range for this file is from 1995-08-01T00:00+02:00 to 2005/07/31T21:00+02:00. The values were computed using the outputs of SIM-France. This file was prepared using a Fortran program.
kfac_modcou_1km_hour.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The values were computed based on the following information: ID, size of the side of the grid cell, Equation (5) in David et al. (2011), and using a wave celerity of 1 km/h. This file was prepared using a Fortran program.
kfac_modcou_ttra_length.csv. This CSV file contains a second guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same IDs and are sorted similarly to rapid_connect_France.csv. The values were computed based on the following information: ID, size of the side of the grid cell, travel time, and Equation (9) in David et al. (2011).
k_modcou_0.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_1.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_2.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_3.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_4.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_a.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_b.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
k_modcou_c.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following information: kfac_modcou_1km_hour.csv and using Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_0.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_1.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_2.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_3.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_4.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_a.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Table (2) in David et al. (2011). This file was prepared using a Fortran program.
x_modcou_b.csv. This CSV file contains Muskingum x values
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry. The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global. The database consists of the following main components, in compressed .rds format: Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions. Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified. X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity. L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value). E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3). mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns). mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns). A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx. Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R How to cite: To cite FABIO work please refer to this paper: Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554 License: This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at. Known issues: The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and code to reproduce analyses in manuscript entitled: Influence of changing snowfall on seasonal complementarity of hydroelectric and solar power. Submitted to Environmental Research: Infrastructure and Sustainability.
The contents include the following scripts and files, listed below. Scripts are listed in the order needed to reproduce the analysis, though intermediate data products have been saved so it is not necessary to reproduce the initial analytical steps.
R/
eia923_860.R: extracts solar and hydropower production data; requires local download of EIA data.
gridMET_swep.R: downloads and summarises gridmet data; does not require prior local download.
fdr.R: function to calculate the p-value associated with a given false discovery rate as described in the associated manuscript.
combine_data.R combines solar, hydropower, and SWE/P data
analysis.Rmd: primary script in which analyses are conducted
data/
annual_swep.csv: output from gridMET_swep.R with annual SWE/P for each watershed in the study
monthly_hydro.csv: output from eia923_860.R
monthly_solar.csv: output from eia923_860.R
combined_variables.csv: combines variables above in one CSV
watersheds_wbd_ss: shapefiles for watersheds that drain to each dam used in the study, derived as described in the manuscript.
Facebook
TwitterWelcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.
ASK
Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?
PREPARE
I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign
Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.
Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.
PROCESS
I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries:
R
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyr)
I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:
csv_files <- list.files(pattern = "\\.csv$")
data_frames <- list()
for (file in csv_files) {
variable_name <- sub("\\.csv$", "", file)
assign(variable_name, read.csv(file))
data_frames[[variable_name]] <- get(variable_name)
}
Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.
data_frames <- lapply(data_frames, function(df) {
df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_pledged <- as.numeric(df$usd_pledged)
return(df)
})
In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):
all_nov_2023 = bind_rows(data_frames)
all_dec_2023 = bind_rows(data_frames)
all_jan_2024 = bind_rows(data_frames)`
After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.
filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month
select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>%
filter(created_at != 0 & deadline != 0 & launched_at != 0) %>%
mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>%
mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>%
mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>%
mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>%
select(-category, -deadline, -launched_at, -created_at) %>%
relocate(created, launched, setted_deadline, .before = goal)
write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)
The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We investigate apparent-time tone variation in the Black Lahu language (Loloish/Ngwi, Tibeto-Burman) of Yunnan, China. These are the supplementary materials for the paper "Generational differences in the low tones of Black Lahu," accepted for publication in Linguistics Vanguard.
Appendices:
Script files contained in the analysis:
Data files contained in this analysis:
Facebook
TwitterA machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 predictions of spatially and temporally continuous monthly streamflow. Additional information about the datasets is in the metadata included in the four zipped dataset files and about the MLFLOW model is in the readme included in the zipped model archive folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corresponding peer-reviewed publication
This dataset corresponds to all the RAPID input and output files that were used in the study reported in:
David, Cédric H., David R. Maidment, Guo-Yue Niu, Zong-Liang Yang, Florence Habets and Victor Eijkhout (2011), River Network Routing on the NHDPlus Dataset, Journal of Hydrometeorology, 12(5), 913-934. DOI: 10.1175/2011JHM1345.1.
When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.
Time format
The times reported in this description all follow the ISO 8601 format. For example 2000-01-01T16:00-06:00 represents 4:00 PM (16:00) on Jan 1st 2000 (2000-01-01), Central Standard Time (-06:00). Additionally, when time ranges with inner time steps are reported, the first time corresponds to the beginning of the first time step, and the second time corresponds to the end of the last time step. For example, the 3-hourly time range from 2000-01-01T03:00+00:00 to 2000-01-01T09:00+00:00 contains two 3-hourly time steps. The first one starts at 3:00 AM and finishes at 6:00AM on Jan 1st 2000, Universal Time; the second one starts at 6:00 AM and finishes at 9:00AM on Jan 1st 2000, Universal Time.
Data sources
The following sources were used to produce files in this dataset:
The National Hydrography Dataset Plus (NHDPlus) Version 1, obtained from http://www.horizon-systems.com/nhdplus.
The National Water Information System (NWIS), obtained from http://waterdata.usgs.gov/nwis.
Outputs from a simulation using the community Noah land surface model with multiparameterization options (Noah-MP, Niu et al. 2011, http://www.jsg.utexas.edu/noah-mp). The simulation was run by Guo-Yue Niu, and produced 3-hourly time steps from 2004-01-01T00:00+00:00 to 2008-01-01T00:00+00:00. Further details on the inputs and options used for this simulation are provided in David et al. (2011).
Software
The following software were used to produce files in this dataset:
The Routing Application for Parallel computation of Discharge (RAPID, David et al. 2011, http://rapid-hub.org), Version 1.0.0. Further details on the inputs and options used for this series of simulations are provided below and in David et al. (2011).
ESRI ArcGIS (http://www.arcgis.com).
Microsoft Excel (https://products.office.com/en-us/excel).
CUAHSI HydroGET (http://his.cuahsi.org/hydroget.html).
The GNU Compiler Collection (https://gcc.gnu.org) and the Intel compilers (https://software.intel.com/en-us/intel-compilers).
Study domain
The files in this dataset correspond to two study domains:
The combination of the San Antonio and Guadalupe River Basins, TX. RAPID can only use the river reaches of NHDPlus that have a known flow direction and focus is made on these reaches here (a total of 5,175). The temporal range corresponding to this domain is from 2004-01-01T00:00-06:00 to 2007-12-31 T00:00-06:00.
The Upper Mississippi River Basin. RAPID can only use the river reaches of NHDPlus that have a known flow direction and focus is made on these reaches here (a total of 182,240). The temporal range corresponding to this domain spans 100 fictitious days.
Description of files for the San Antonio and Guadalupe River Basins
All files below were prepared by Cédric H. David, using the data sources and software mentioned above.
rapid_connect_San_Guad.csv. This CSV file contains the river network connectivity information and is based on the unique IDs of NHDPlus reaches (the COMIDs). For each river reach, this file specifies: the COMID of the reach, the COMID of the unique downstream reach, the number of upstream reaches with a maximum of four reaches, and the COMIDs of all upstream reaches. A value of zero is used in place of NoData. The river reaches are sorted in increasing value of COMID. The values were computed using a combination of the following NHDPlus fields: COMID, DIVERGENCE, FROMNODE and TONODE. This file was prepared using ArcGIS and Excel.
m3_riv_San_Guad_2004_2007_cst.nc. This netCDF file contains the 3-hourly accumulated inflows of water (in cubic meters) from surface and subsurface runoff into the upstream point of each river reach. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007/12/31T18:00-06:00. The values were computed by superimposing a 900-m gridded map of NHDPlus catchments to the outputs of Noah-MP. This file was prepared using ArcGIS and a Fortran program.
kfac_San_Guad_1km_hour.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, Equation (13) in David et al. (2011), and using a wave celerity of 1 km/h. This file was prepared using a Fortran program.
kfac_San_Guad_celerity.csv. This CSV file contains a first guess of Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, Equation (13) in David et al. (2011), and using the wave celerity numbers of Table 2 in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_1.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (17) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_2.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (18) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_3.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (19) in David et al. (2011). This file was prepared using a Fortran program.
k_San_Guad_2004_4.csv. This CSV file contains Muskingum k values (in seconds) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on the following NHDPlus fields: COMID, LENGTHKM, and using Equation (21) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_1.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (17) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_2.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (18) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_3.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (19) in David et al. (2011). This file was prepared using a Fortran program.
x_San_Guad_2004_4.csv. This CSV file contains Muskingum x values (dimensionless) for all river reaches. The river reaches have the same COMIDs and are sorted similarly to rapid_connect_San_Guad.csv. The values were computed based on Equation (21) in David et al. (2011). This file was prepared using a Fortran program.
basin_id_San_Guad_hydroseq.csv. This CSV file contains the list of unique IDs of NHDPlus river reaches (COMID) in the San Antonio and Guadalupe River Basins. The river reaches are sorted from upstream to downstream. The values were computed using the following NHDPlus fields: COMID and HYDROSEQ. This file was prepared using Excel.
Qout_San_Guad_1460days_p1_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream point of each reach. The river reaches have the same COMIDs and are sorted similarly to basin_id_San_Guad_hydroseq.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007-12-31-00:00-06:00. The values were computed using the Muskingum method with parameters of Equation (17) in David et al. (2011). This file was prepared using RAPID v1.0.0 running with the preonly ILU solver on one core.
Qout_San_Guad_1460days_p2_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream point of each reach. The river reaches have the same COMIDs and are sorted similarly to basin_id_San_Guad_hydroseq.csv. The time range for this file is from 2004-01-01T00:00-06:00 to 2007-12-31-00:00-06:00. The values were computed using the Muskingum method with parameters of Equation (18) in David et al. (2011). This file was prepared using RAPID v1.0.0 running with the preonly ILU solver on one core.
Qout_San_Guad_1460days_p3_dtR=900s.nc. This netCDF file contains the 3-hourly averaged outputs (in cubic meters per second) from RAPID corresponding to the downstream
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to ...
Facebook
TwitterThe dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains all the scripts used to carry out the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Hunter bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). See History for a detailed explanation of the dataset contents.
References:
Herron N, Crosbie R, Peeters L, Marvanek S, Ramage A and Wilkins A (2016) Groundwater numerical modelling for the Hunter subregion. Product 2.6.2 for the Hunter subregion from the Northern Sydney Basin Bioregional Assessment. Department of the Environment, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia.
This dataset uses the results of the design of experiment runs of the groundwater model of the Hunter subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016).
A flow chart of the way the various files and scripts interact is provided in HUN_GW_UA_Flowchart.png (editable version in HUN_GW_UA_Flowchart.gliffy).
R-script HUN_DoE_Parameters.R creates the set of parameters for the design of experiment in HUN_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset HUN GW Model v01). Associated with this spreadsheet is file HUN_GW_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure.
The results of the design of experiment model runs are summarised in files HUN_GW_dmax_DoE_Predictions.csv, HUN_GW_tmax_DoE_Predictions.csv, HUN_GW_DoE_Observations.csv, HUN_GW_DoE_mean_BL_BF_hist.csv which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observed groundwater levels and SW-GW fluxes respectively. These are generated with post-processing scripts in dataset HUN GW Model v01 from the output (as exemplified in dataset HUN GW Model simulate ua999 pawsey v01).
Spreadsheets HUN_GW_dmax_Predictions.csv and HUN_GW_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction.
Spreadsheet HUN_GW_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing the observation, the spatial coordinates, the observed value and the number of observations at this location (from dataset HUN bores v01). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). Spreadsheet HUN_GW_mean_BL_BF_hist.csv has similar information, but on the SW-GW flux. The observed values are from dataset HUN Groundwater Flowrate Time Series v01
These files are used in script HUN_GW_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets HUN_GW_dmax_SI.csv, HUN_GW_tmax_SI.csv, HUN_GW_hobs_SI.py, HUN_GW_mean_BF_hist_SI.csv
Script HUN_GW_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow rates. The results are stored in HUN_GW_DoE_ObjFun.csv and HUN_GW_ObjFun.csv.
The latter files are used in scripts HUN_GW_dmax_CreatePosteriorParameters.R to carry out the Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Herron et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file HUN_GW_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention HUN_GW_dmax_Posterior_Parameters_OO_$OFName$.csv where $OFName$ is the name of the objective function. Python script HUN_GW_PosteriorParameters_Percentiles.py summarizes these posterior parameter combinations and stores the results in HUN_GW_PosteriorParameters_Percentiles.csv.
The same set of spreadsheets is used to test convergence of the emulator performance with script HUN_GW_emulator_convergence.R and batch file HUN_GW_emulator_convergence.slurm to produce spreadsheet HUN_GW_convergence_objfun_BF.csv.
The posterior parameter distributions are sampled with scripts HUN_GW_dmax_tmax_MCsampler.R and associated .slurm batch file. The script create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. A single emulator and associated output is included for illustrative purposes.
Script HUN_GW_collate_predictions.csv collates all posterior predictive distributions in spreadsheets HUN_GW_dmax_PosteriorPredictions.csv and HUN_GW_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet HUN_GW_dmax_tmax_excprob.csv with script HUN_GW_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and the threshold of the objective function and the acceptance rate.
The script HUN_GW_dmax_tmax_MCsampler.R is also used to evaluate parameter distributions HUN_GW_dmax_Posterior_Parameters_HUN_OF_probe439.csv and HUN_GW_dmax_Posterior_Parameters_Mackie_OF_probe439.csv. These are, for one predictions, different parameter distributions, in which the latter represents local information. The corresponding dmax values are stored in HUN_GW_dmax_probe439_HUN.csv and HUN_GW_dmax_probe439_Mackie.csv
Bioregional Assessment Programme (XXXX) HUN GW Uncertainty Analysis v01. Bioregional Assessment Derived Dataset. Viewed 09 October 2018, http://data.bioregionalassessments.gov.au/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a.
Derived From HUN GW Model code v01
Derived From NSW Office of Water Surface Water Entitlements Locations v1_Oct2013
Derived From NSW Office of Water - National Groundwater Information System 20140701
Derived From Travelling Stock Route Conservation Values
Derived From HUN GW Model v01
Derived From NSW Wetlands
Derived From Climate Change Corridors Coastal North East NSW
Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only
Derived From Climate Change Corridors for Nandewar and New England Tablelands
Derived From National Groundwater Dependent Ecosystems (GDE) Atlas
Derived From R-scripts for uncertainty analysis v01
Derived From Asset database for the Hunter subregion on 27 August 2015
Derived From Birds Australia - Important Bird Areas (IBA) 2009
Derived From Estuarine Macrophytes of Hunter Subregion NSW DPI Hunter 2004
Derived From Hunter CMA GDEs (DRAFT DPI pre-release)
Derived From Camerons Gorge Grassy White Box Endangered Ecological Community (EEC) 2008
Derived From Atlas of Living Australia NSW ALA Portal 20140613
Derived From [Spatial Threatened Species and Communities (TESC) NSW
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the raw data for the study:
Characterizing Intraspecific Resource Utilization in an Aquatic Consumer Using High-Throughput Phenotyping
Data are provided separately for the first experiment (numerical response experiment with 16 rotifer clones across six food concentrations) and the second experiment (growth rate measurements with 98 rotifer clones across two food concentrations).
Contents of first_experiment.zip
input/ This folder contains raw count data (output of the Wellcounter software):
popgrowth_
output/ output files produced by the R-script 'first_experiment_analysis.Rmd'
wellcounter/ contains the Wellcounter software (programs and configuration files) that were used for running the raw analysis of this dataset on a High Performance Computing cluster
first_experiment_analysis.Rmd R-Markdown file with data processing and statistical analysis of the first experiment
numerical_response_2par.R A function required by 'first_experiment_analysis.Rmd'
Contents of second_experiment.zip
input/ This folder contains raw count and behavioral data (output of the Wellcounter software):
popgrowth_
output/ output files produced by the R-script 'second_experiment_analysis.Rmd'
wellcounter/ contains the Wellcounter software (programs and configuration files) that were used for running the raw analysis (image and motion analysis) of this dataset on a High Performance Computing cluster
second_experiment_prep_run1.Rmd R-Markdown file for preprocessing the data from run1
second_experiment_prep_run2.Rmd R-Markdown file for preprocessing the data from run2
second_experiment_analysis.Rmd R-Markdown file with data processing and statistical analysis of the second experiment
extract_fixed_effects_table.R A function required by 'second_experiment_analysis.Rmd'
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The following seven zip files are compressed folders containing the input datasets/trees, main output files and the scripts of the related analyses performed in this study. I. ancestral_microhabitat_reconstruction.zip: contains four files, including two input files (microhabitats.csv, timetree.tre) and a script (simmap_microhabitat.R) for ancestral states reconstruction of microhabitat by make.simmap implemented in the R package phytools v1.5, as well as the main output file (ancestral_microhabitats.csv). 1. ancestral_microhabitats.csv: reconstructed ancestral microhabitats for each node. 2. microhabitats.csv: microhabitats of the studies species. 3. simmap_microhabitat.R: the R script of make.simmap for ancestral microhabitat reconstruction 4. timetree.tre: dated tree used for ancestral state reconstruction for microhabitat and morphological characters II. ancestral_morphology_reconstruction.zip: contains six files, including an input file (morphology.csv) and a script (simmap_morphology.R) for ancestral states reconstruction of morphology by make.simmap implemented in the R package phytools v1.5, as well as four main output files(forewing_ancestral_state.csv, frontal_sutures_ancestral_state.csv, hind_wing_ancestral_state.csv, ocellus_ancestral_state.csv). 1. forewing_ancestral_state.csv: reconstructed ancestral states of the development of the forewing for each node. 2. frontal_sutures_ancestral_state.csv: reconstructed ancestral states of the development of frontal sutures for each node. 3. hind_wing_ancestral_state.csv: reconstructed ancestral states of the development of the hind wing for each node. 4. morphology.csv: the states of the development of ocellus, forewing, hing wing and frontal sutures for each studies species. 5. ocellus_ancestral_state.csv: reconstructed ancestral states of the development of the ocellus for each node. 6. simmap_morphology.R: the R script of make.simmap for ancestral state reconstruction of morphology III. biogeographic_reconstruction.zip: contains four files, including three input files (dispersal_probablity.txt, distributions.csv, timetree_noOutgroup.tre) used for a stratified biogeographic analysis by BioGeoBEARS in RASP v4.2 and the main output file (DIVELIKE_result.txt). 1. dispersal_probablity.txt: relative dispersal probabilities among biogeographical regions at different geological epochs. 2. distributions.csv: current distributions of the studied species. 3. DIVELIKE_result.txt: BioGeoBEARS result of ancestral areas based on the DIVELIKE model. 4. timetree_noOutgroup.tre: the dated tree with the outgroup lineage (Eurymelinae) excluded. IV. coalescent_analysis.zip: contains a folder and two files, including a folder (individual_gene_alignment) of input files used to construct gene trees, an input file (MLtree_BS70.tre) used for the multi-species coalescent analysis by ASTRAL v 4.10.5 and the main output file (coalescent_species_tree.tre). 1. coalescent_species_tree.tre: the species tree generated by the multi-species coalescent analysis with the quartet support, effective number of genes and the local posterior probability indicated. 2. individual_gene_alignment: a folder containing 427 FASTA files, each one represents the nucleotide alignment for a gene. Hyphens are used to represent gaps. These files were used to construct gene trees using IQ-TREE v1.6.12. 3. MLtree_BS70.tre: 165 gene trees with the average SH-aLRT and ultrafast bootstrap values of ≥ 70%. This file was used to estimate the species tree by ASTRAL v 4.10.5. V. divergence_time_estimation.zip: contains five files, including two input files (treefile_rooted_noBranchLength.tre, treefile_rooted.tre) and two control files (baseml.ctl, mcmctree.ctl) used for divergence time estimation by BASEML and MCMCTREE in PAML v4.9, as well as the main output file (timetree_with95%HPD.tre). 1. baseml.ctl: the control file used for the estimation of substitution rates by BASEML in PAML v4.9. 2. mcmctree.ctl: the control file used for the estimation of divergence times by MCMCTREE in PAML v4.9. 3. timetree_with95%HPD.tre: dated tree with the 95% highest posterior density confidence intervals indicated. 4. treefile_rooted_noBranchLength.tre: the maximum likelihood tree based on the concatenated nucleotide dataset with calibrations for the crown and internal nodes. Branch length and support values were not indicated. 5. treefile_rooted.tre: the maximum likelihood tree based on the concatenated nucleotide dataset with a secondary calibration on the root age. Branch support values were not indicated. VI. maximum_likelihood_analysis_aa.zip: contains three files, including two input files (concatenated_aa_partition.nex, concatenated_aa.phy) used for the maximum likelihood analysis by IQ-TREE v1.6.12 and the main output file (MLtree_aa.tre). 1. concatenated_aa_partition.nex: the partitioning schemes for the maximum likelihood analysis using concatenated_aa.phy. This file partitions the 52,024 amino acid positions into 427 character sets. 2. concatenated_aa.phy: a concatenated amino acid dataset with 52,024 amino acid positions. Hyphens are used to represent gaps. This dataset was used for the maximum likelihood analysis. 3. MLtree_aa.tre: the maximum likelihood tree based on the concatenated amino acid dataset, with SH-aLRT values and ultrafast bootstrap values indicated. VII. maximum_likelihood_analysis_nt.zip: contains three files, including two input files (concatenated_nt_partition.nex, concatenated_nt.phy) used for the maximum likelihood analysis by IQ-TREE v1.6.12 and the main output file (MLtree_nt.tre). 1. concatenated_nt_partition.nex: the partitioning schemes for the maximum likelihood analysis using concatenated_nt.phy. This file partitions the 156,072 nucleotide positions into 427 character sets. 2. concatenated_nt.phy: a concatenated nucleotide dataset with 156,072 nucleotide positions. Hyphens are used to represent gaps. This dataset was used for the maximum likelihood analysis as well as divergence time estimation. 3. MLtree_nt.tre: the maximum likelihood tree based on the concatenated nucleotide dataset, with SH-aLRT values and ultrafast bootstrap values indicated. VIII. Taxon_sampling.csv: contains the sample IDs (1st column) which were used in the alignments and the taxonomic information (2nd to 6th columns).
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `