31 datasets found
  1. u

    Data for: Climate impacts and adaptation in US dairy systems 1981–2018

    • agdatacommons.nal.usda.gov
    bin
    Updated May 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Gisbert-Queral; Nathan Mueller (2025). Data for: Climate impacts and adaptation in US dairy systems 1981–2018 [Dataset]. http://doi.org/10.5281/zenodo.4818011
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2025
    Dataset provided by
    Zenodo
    Authors
    Maria Gisbert-Queral; Nathan Mueller
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Area covered
    United States
    Description

    Data is archived here: https://doi.org/10.5281/zenodo.4818011Data and code archive provides all the files that are necessary to replicate the empirical analyses that are presented in the paper "Climate impacts and adaptation in US dairy systems 1981-2018" authored by Maria Gisbert-Queral, Arne Henningsen, Bo Markussen, Meredith T. Niles, Ermias Kebreab, Angela J. Rigden, and Nathaniel D. Mueller and published in 'Nature Food' (2021, DOI: 10.1038/s43016-021-00372-z). The empirical analyses are entirely conducted with the "R" statistical software using the add-on packages "car", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "lmtest", "lubridate", "magrittr", "nlme", "OneR", "plyr", "pracma", "quadprog", "readxl", "sandwich", "tidyr", "usfertilizer", and "usmap". The R code was written by Maria Gisbert-Queral and Arne Henningsen with assistance from Bo Markussen. Some parts of the data preparation and the analyses require substantial amounts of memory (RAM) and computational power (CPU). Running the entire analysis (all R scripts consecutively) on a laptop computer with 32 GB physical memory (RAM), 16 GB swap memory, an 8-core Intel Xeon CPU E3-1505M @ 3.00 GHz, and a GNU/Linux/Ubuntu operating system takes around 11 hours. Running some parts in parallel can speed up the computations but bears the risk that the computations terminate when two or more memory-demanding computations are executed at the same time.This data and code archive contains the following files and folders:* READMEDescription: text file with this description* flowchart.pdfDescription: a PDF file with a flow chart that illustrates how R scripts transform the raw data files to files that contain generated data sets and intermediate results and, finally, to the tables and figures that are presented in the paper.* runAll.shDescription: a (bash) shell script that runs all R scripts in this data and code archive sequentially and in a suitable order (on computers with a "bash" shell such as most computers with MacOS, GNU/Linux, or Unix operating systems)* Folder "DataRaw"Description: folder for raw data filesThis folder contains the following files:- DataRaw/COWS.xlsxDescription: MS-Excel file with the number of cows per countySource: USDA NASS QuickstatsObservations: All available counties and years from 2002 to 2012- DataRaw/milk_state.xlsxDescription: MS-Excel file with average monthly milk yields per cowSource: USDA NASS QuickstatsObservations: All available states from 1981 to 2018- DataRaw/TMAX.csvDescription: CSV file with daily maximum temperaturesSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/VPD.csvDescription: CSV file with daily maximum vapor pressure deficitsSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/countynamesandID.csvDescription: CSV file with county names, state FIPS codes, and county FIPS codesSource: US Census BureauObservations: All counties- DataRaw/statecentroids.csvDescriptions: CSV file with latitudes and longitudes of state centroidsSource: Generated by Nathan Mueller from Matlab state shapefiles using the Matlab "centroid" functionObservations: All states* Folder "DataGenerated"Description: folder for data sets that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these generated data files so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Results"Description: folder for intermediate results that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these intermediate results so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Figures"Description: folder for the figures that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these figures so that people who replicate our analysis can more easily compare the figures that they get with the figures that are presented in our paper. Additionally, this folder contains CSV files with the data that are required to reproduce the figures.* Folder "Tables"Description: folder for the tables that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these tables so that people who replicate our analysis can more easily compare the tables that they get with the tables that are presented in our paper.* Folder "logFiles"Description: the shell script runAll.sh writes the output of each R script that it runs into this folder. We provide these log files so that people who replicate our analysis can more easily compare the R output that they get with the R output that we got.* PrepareCowsData.RDescription: R script that imports the raw data set COWS.xlsx and prepares it for the further analyses* PrepareWeatherData.RDescription: R script that imports the raw data sets TMAX.csv, VPD.csv, and countynamesandID.csv, merges these three data sets, and prepares the data for the further analyses* PrepareMilkData.RDescription: R script that imports the raw data set milk_state.xlsx and prepares it for the further analyses* CalcFrequenciesTHI_Temp.RDescription: R script that calculates the frequencies of days with the different THI bins and the different temperature bins in each month for each state* CalcAvgTHI.RDescription: R script that calculates the average THI in each state* PreparePanelTHI.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different THI bins* PreparePanelTemp.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different temperature bins* PreparePanelFinal.RDescription: R script that creates the state-month panel/longitudinal data set with all variables (e.g., THI bins, temperature bins, milk yield) that are used in our statistical analyses* EstimateTrendsTHI.RDescription: R script that estimates the trends of the frequencies of the different THI bins within our sampling period for each state in our data set* EstimateModels.RDescription: R script that estimates all model specifications that are used for generating results that are presented in the paper or for comparing or testing different model specifications* CalcCoefStateYear.RDescription: R script that calculates the effects of each THI bin on the milk yield for all combinations of states and years based on our 'final' model specification* SearchWeightMonths.RDescription: R script that estimates our 'final' model specification with different values of the weight of the temporal component relative to the weight of the spatial component in the temporally and spatially correlated error term* TestModelSpec.RDescription: R script that applies Wald tests and Likelihood-Ratio tests to compare different model specifications and creates Table S10* CreateFigure1a.RDescription: R script that creates subfigure a of Figure 1* CreateFigure1b.RDescription: R script that creates subfigure b of Figure 1* CreateFigure2a.RDescription: R script that creates subfigure a of Figure 2* CreateFigure2b.RDescription: R script that creates subfigure b of Figure 2* CreateFigure2c.RDescription: R script that creates subfigure c of Figure 2* CreateFigure3.RDescription: R script that creates the subfigures of Figure 3* CreateFigure4.RDescription: R script that creates the subfigures of Figure 4* CreateFigure5_TableS6.RDescription: R script that creates the subfigures of Figure 5 and Table S6* CreateFigureS1.RDescription: R script that creates Figure S1* CreateFigureS2.RDescription: R script that creates Figure S2* CreateTableS2_S3_S7.RDescription: R script that creates Tables S2, S3, and S7* CreateTableS4_S5.RDescription: R script that creates Tables S4 and S5* CreateTableS8.RDescription: R script that creates Table S8* CreateTableS9.RDescription: R script that creates Table S9

  2. d

    Data from: [Dataset:] Data from Tree Censuses and Inventories in Panama

    • search.dataone.org
    Updated Aug 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao (2024). [Dataset:] Data from Tree Censuses and Inventories in Panama [Dataset]. https://search.dataone.org/view/urn%3Auuid%3A07030ed9-e51f-4ffa-a4b5-921392681123
    Explore at:
    Dataset updated
    Aug 16, 2024
    Dataset provided by
    Smithsonian Research Data Repository
    Authors
    Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao
    Description

    Abstract: These are results from a network of 65 tree census plots in Panama. At each, every individual stem in a rectangular area of specified size is given a unique number and identified to species, then stem diameter measured in one or more censuses. Data from these numerous plots and inventories were collected following the same methods as, and species identity harmonized with, the 50-ha long-term tree census at Barro Colorado Island. Precise location of every site, elevation, and estimated rainfall (for many sites) are also included. These data were gathered over many years, starting in 1994 and continuing to the present, by principal investigators R. Condit, R. Perez, S. Lao, and S. Aguilar. Funding has been provided by many organizations.

    Description:

    marenaRecent.full.Rdata5Jan2013.zip: A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format, designed for data analysis. This and all other tables labelled 'full' have one record per individual tree found in that census. Detailed documentations of the 'full' tables is given in RoutputFull.pdf (see component 10 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. These are the best data to use if only a single plot census is needed. marena2cns.full.Rdata5Jan2013.zip: R Analytical Tables of the style 'full' for 44 plots with two censuses: 'marena2cns.full1.rdata' for the first census and 'marena2cns.full2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.full (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed. marena3cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for nine plots with three censuses: 'marena3cns.full1.rdata' for the first census through 'marena2cns.full3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.full (component 2): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed. marena4cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for six plots with four censuses: 'marena4cns.full1.rdata' for the first census through 'marena4cns.full4.rdata' for the fourth census. These six plots are a subset of the nine found in marena3cns.full (component 3): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed. marenaRecent.stem.Rdata5Jan2013.zip. A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format. These are designed for data analysis. This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. The table has one record per individual stem, necessary because some individual trees have more than one stem. Detailed documentations of these tables is given in RoutputFull.pdf (see component 11 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). These are the best data to use if only a single plot census is needed, and individual stems are desired. marena2cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for 44 plots with two censuses: 'marena2cns.stem1.rdata' for the first census and 'marena3cns.stem2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.stem (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed, and individual stems are desired. marena3cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for nine plots with three censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.stem (component 6): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed, and individual stems are desired. marena4cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for six plots with four censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These six plots are a subset of the nine found in marena3cns.stem (component 7): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed, and individual stems are desired. bci.spptable.rdata. A list of the 1414 species found across all tree plots and inventories i... Visit https://dataone.org/datasets/urn%3Auuid%3A07030ed9-e51f-4ffa-a4b5-921392681123 for complete metadata about this dataset.

  3. Z

    Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henningsen, Arne (2023). Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7882078
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Henningsen, Arne
    Price, Juan José
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).

    We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).

    This replication package contains the following files and folders:

    README This file

    MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:

    museum: Name of the museum.

    type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).

    munic: Municipality, in which the museum is located.

    yr: Year of the observation.

    units: Number of visit sites.

    resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).

    vis: Number of (physical) visitors.

    aarc: Number of articles published (archeology).

    ach: Number of articles published (cultural history).

    aah: Number of articles published (art history).

    anh: Number of articles published (natural history).

    exh: Number of temporary exhibitions.

    edu: Number of primary school classes on educational visits to the museum.

    ev: Number of events other than exhibitions.

    ftesc: Scientific labor (full-time equivalents).

    ftensc: Non-scientific labor (full-time equivalents).

    expProperty: Running and maintenance costs [1,000 DKK].

    expCons: Conservation expenditure [1,000 DKK].

    ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).

    prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.

    DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.

    make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.

    IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.

    IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.

    allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.

    IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.

    /tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.

    /figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.

  4. d

    Data from: Data and code from: Stem borer herbivory dependent on...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-stem-borer-herbivory-dependent-on-interactions-of-sugarcane-variety-ass-1e076
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and location. All location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip

  5. Replication Package - How Do Requirements Evolve During Elicitation? An...

    • zenodo.org
    bin, zip
    Updated Apr 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Ferrari; Alessio Ferrari; Paola Spoletini; Paola Spoletini; Sourav Debnath; Sourav Debnath (2022). Replication Package - How Do Requirements Evolve During Elicitation? An Empirical Study Combining Interviews and App Store Analysis [Dataset]. http://doi.org/10.5281/zenodo.6472498
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Apr 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessio Ferrari; Alessio Ferrari; Paola Spoletini; Paola Spoletini; Sourav Debnath; Sourav Debnath
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the replication package for the paper titled "How Do Requirements Evolve During Elicitation? An Empirical Study Combining Interviews and App Store Analysis", by Alessio Ferrari, Paola Spoletini and Sourav Debnath.

    The package contains the following folders and files.

    /R-analysis

    This is a folder containing all the R implementations of the the statistical tests included in the paper, together with the source .csv file used to produce the results. Each R file has the same title as the associated .csv file. The titles of the files reflect the RQs as they appear in the paper. The association between R files and Tables in the paper is as follows:

    - RQ1-1-analyse-story-rates.R: Tabe 1, user story rates

    - RQ1-1-analyse-role-rates.R: Table 1, role rates

    - RQ1-2-analyse-story-category-phase-1.R: Table 3, user story category rates in phase 1 compared to original rates

    - RQ1-2-analyse-role-category-phase-1.R: Table 5, role category rates in phase 1 compared to original rates

    - RQ2.1-analysis-app-store-rates-phase-2.R: Table 8, user story and role rates in phase 2

    - RQ2.2-analysis-percent-three-CAT-groups-ph1-ph2.R: Table 9, comparison of the categories of user stories in phase 1 and 2

    - RQ2.2-analysis-percent-two-CAT-roles-ph1-ph2.R: Table 10, comparison of the categories of roles in phase 1 and 2.

    The .csv files used for statistical tests are also used to produce boxplots. The association betwee boxplot figures and files is as follows.

    - RQ1-1-story-rates.csv: Figure 4

    - RQ1-1-role-rates.csv: Figure 5

    - RQ1-2-categories-phase-1.csv: Figure 8

    - RQ1-2-role-category-phase-1.csv: Figure 9

    - RQ2-1-user-story-and-roles-phase-2.csv: Figure 13

    - RQ2.2-percent-three-CAT-groups-ph1-ph2.csv: Figure 14

    - RQ2.2-percent-two-CAT-roles-ph1-ph2.csv: Figure 17

    - IMG-only-RQ2.2-us-category-comparison-ph1-ph2.csv: Figure 15

    - IMG-only-RQ2.2-frequent-roles.csv: Figure 18

    NOTE: The last two .csv files do not have an associated statistical tests, but are used solely to produce boxplots.

    /Data-Analysis

    This folder contains all the data used to answer the research questions.

    RQ1.xlsx: includes all the data associated to RQ1 subquestions, two tabs for each subquestion (one for user stories and one for roles). The names of the tabs are self-explanatory of their content.

    RQ2.1.xlsx: includes all the data for the RQ1.1 subquestion. Specifically, it includes the following tabs:

    - Data Source-US-category: for each category of user story, and for each analyst, there are two lines.

    The first one reports the number of user stories in that category for phase 1, and the second one reports the

    number of user stories in that category for phase 2, considering the specific analyst.

    - Data Source-role: for each category of role, and for each analyst, there are two lines.

    The first one reports the number of user stories in that role for phase 1, and the second one reports the

    number of user stories in that role for phase 2, considering the specific analyst.

    - RQ2.1 rates: reports the final rates for RQ2.1.

    NOTE: The other tabs are used to support the computation of the final rates.

    RQ2.2.xlsx: includes all the data for the RQ2.2 subquestion. Specifically, it includes the following tabs:

    - Data Source-US-category: same as RQ2.1.xlsx

    - Data Source-role: same as RQ2.1.xlsx

    - RQ2.2-category-group: comparison between groups of categories in the different phases, used to produce Figure 14

    - RQ2.2-role-group: comparison between role groups in the different phases, used to produce Figure 17

    - RQ2.2-specific-roles-diff: difference between specific roles, used to produce Figure 18

    NOTE: the other tabs are used to support the computation of the values reported in the tabs above.

    RQ2.2-single-US-category.xlsx: includes the data for the RQ2.2 subquestion associated to single categories of user stories.

    A separate tab is used given the complexity of the computations.

    - Data Source-US-category: same as RQ2.1.xlsx

    - Totals: total number of user stories for each analyst in phase 1 and phase 2

    - Results-Rate-Comparison: difference between rates of user stories in phase 1 and phase 2, used to produce the file

    "img/IMG-only-RQ2.2-us-category-comparison-ph1-ph2.csv", which is in turn used to produce Figure 15

    - Results-Analysts: number of analysts using each novel category produced in phase 2, used to produce Figure 16.

    NOTE: the other tabs are used to support the computation of the values reported in the tabs above.

    RQ2.3.xlsx: includes the data for the RQ2.3 subquestion. Specifically, it includes the following tabs:

    - Data Source-US-category: same as RQ2.1.xlsx

    - Data Source-role: same as RQ2.1.xlsx

    - RQ2.3-categories: novel categories produced in phase 2, used to produce Figure 19

    - RQ2-3-most-frequent-categories: most frequent novel categories

    /Raw-Data-Phase-I

    The folder contains one Excel file for each analyst, s1.xlsx...s30.xlsx, plus the file of the original user stories with annotations (original-us.xlsx). Each file contains two tabs:

    - Evaluation: includes the annotation of the user stories as existing user story in the original categories (annotated with "E"), novel user story in a certain category (refinement, annotated with "N"), and novel user story in novel category (Name of the category in column "New Feature"). **NOTE 1:** It should be noticed that in the paper the case "refinement" is said to be annotated with "R" (instead of "N", as in the files) to make the paper clearer and easy to read.

    - Roles: roles used in the user stories, and count of the user stories belonging to a certain role.

    /Raw-Data-Phaes-II

    The folder contains one Excel file for each analyst, s1.xlsx...s30.xlsx. Each file contains two tabs:

    - Analysis: includes the annotation of the user stories as belonging to existing original

    category (X), or to categories introduced after interviews, or to categories introduced

    after app store inspired elicitation (name of category in "Cat. Created in PH1"), or to

    entirely novel categories (name of category in "New Category").

    - Roles: roles used in the user stories, and count of the user stories belonging to a certain role.

    /Figures

    This folder includes the figures reported in the paper. The boxplots are generated from the

    data using the tool http://shiny.chemgrid.org/boxplotr/. The histograms and other plots are

    produced with Excel, and are also reported in the excel files listed above.

  6. H

    Consumer Expenditure Survey (CE)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Consumer Expenditure Survey (CE) [Dataset]. http://doi.org/10.7910/DVN/UTNJAH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the consumer expenditure survey (ce) with r the consumer expenditure survey (ce) is the primo data source to understand how americans spend money. participating households keep a running diary about every little purchase over the year. those diaries are then summed up into precise expenditure categories. how else are you gonna know that the average american household spent $34 (±2) on bacon, $826 (±17) on cellular phones, and $13 (±2) on digital e-readers in 2011? an integral component of the market basket calculation in the consumer price index, this survey recently became available as public-use microdata and they're slowly releasing historical files back to 1996. hooray! for a t aste of what's possible with ce data, look at the quick tables listed on their main page - these tables contain approximately a bazillion different expenditure categories broken down by demographic groups. guess what? i just learned that americans living in households with $5,000 to $9,999 of annual income spent an average of $283 (±90) on pets, toys, hobbies, and playground equipment (pdf page 3). you can often get close to your statistic of interest from these web tables. but say you wanted to look at domestic pet expenditure among only households with children between 12 and 17 years old. another one of the thirteen web tables - the consumer unit composition table - shows a few different breakouts of households with kids, but none matching that exact population of interest. the bureau of labor statistics (bls) (the survey's designers) and the census bureau (the survey's administrators) have provided plenty of the major statistics and breakouts for you, but they're not psychic. if you want to comb through this data for specific expenditure categories broken out by a you-defined segment of the united states' population, then let a little r into your life. fun starts now. fair warning: only analyze t he consumer expenditure survey if you are nerd to the core. the microdata ship with two different survey types (interview and diary), each containing five or six quarterly table formats that need to be stacked, merged, and manipulated prior to a methodologically-correct analysis. the scripts in this repository contain examples to prepare 'em all, just be advised that magnificent data like this will never be no-assembly-required. the folks at bls have posted an excellent summary of what's av ailable - read it before anything else. after that, read the getting started guide. don't skim. a few of the descriptions below refer to sas programs provided by the bureau of labor statistics. you'll find these in the C:\My Directory\CES\2011\docs directory after you run the download program. this new github repository contains three scripts: 2010-2011 - download all microdata.R lo op through every year and download every file hosted on the bls's ce ftp site import each of the comma-separated value files into r with read.csv depending on user-settings, save each table as an r data file (.rda) or stat a-readable file (.dta) 2011 fmly intrvw - analysis examples.R load the r data files (.rda) necessary to create the 'fmly' table shown in the ce macros program documentation.doc file construct that 'fmly' table, using five quarters of interviews (q1 2011 thru q1 2012) initiate a replicate-weighted survey design object perform some lovely li'l analysis examples replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using unimputed variables replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t -tests using unimputed variables create an rsqlite database (to minimize ram usage) containing the five imputed variable files, after identifying which variables were imputed based on pdf page 3 of the user's guide to income imputation initiate a replicate-weighted, database-backed, multiply-imputed survey design object perform a few additional analyses that highlight the modified syntax required for multiply-imputed survey designs replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using imputed variables repl icate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using imputed variables replicate the %proc_reg() and %proc_logistic() macros found in "ce macros.sas" and provide some examples of regressions and logistic regressions using both unimputed and imputed variables replicate integrated mean and se.R match each step in the bls-provided sas program "integr ated mean and se.sas" but with r instead of sas create an rsqlite database when the expenditure table gets too large for older computers to handle in ram export a table "2011 integrated mean and se.csv" that exactly matches the contents of the sas-produced "2011 integrated mean and se.lst" text file click here to view these three scripts for...

  7. S

    Training effect of feedforward neural networks with different inputs and 2pF...

    • scidb.cn
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shang tian shuai (2023). Training effect of feedforward neural networks with different inputs and 2pF parameter prediction results for (near) stable nuclei [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00010
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Science Data Bank
    Authors
    shang tian shuai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains five Unicode Origin Graph (.opju) files, which can be opened with the Origin 2021 software. Details of the five data files are as follows:1、 Fig.2-data.opjuThe file named "Fig.2-data.opju" contains the original information from Figure 2 in the associated article. It contains two workbooks. One is the density distributions and error bands information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set), and they contain density information and error band information for the corresponding nuclei. Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The charge density distribution predicted by FNN-3I; Error band value of charge density predicted by FNN-3I; The charge density distribution predicted by FNN-4I; Error band value of charge density predicted by FNN-4I; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value or error band value at r. The practice of obtaining these data can be found in the associated article.2、 Fig.3-data.opjuThe file named "Fig.3-data.opju" contains the original information from Figure 3 in the associated article. It contains two workbooks. One is the density distributions information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set). Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The average charge density distribution value obtained by density averaging method; The average charge density distribution value obtained by parameter averaging method; The value of the charge density distribution derived from the network with the smallest loss function; The value of the charge density distribution derived from the network with the largest loss function; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value at r. The practice of obtaining these data can be found in the associated article.3、 Fig.4-data.opjuThe file named "Fig.4-data.opju" contains the original information from Figure 4 in the associated article. It contains two workbooks. One is the density distributions and error bands information for the training set, and the other is for the validation set. Each workbook contains several data tables (more tables in the training set and less tables in the validation set). Each data table is named by the nuclei in the corresponding data set (training set or validation set). Each data table has six columns and one thousand rows. Columns 1 through 6 are: the radial coordinate r of the nucleus, in fm; The average charge density distribution value obtained by density averaging method; The error band value obtained by density averaging method; The average charge density distribution value obtained by parameter averaging method; The error band value obtained by parameter average method; The value of charge density distribution obtained by experiment (2pF model). Each row of data pairs should be the density value or error band value at r. The practice of obtaining these data can be found in the associated article.4、 Fig.5 6 7-data.opjuThe file named "Fig.5 6 7-data.opju" contains the original information from Figures 5 6 and 7 in the associated article. It consists of two tables, one containing the training result information for the training set and the other containing the prediction results for the prediction set. The table related to the training set has 15 columns and 86 rows, in which each column represents: proton number; Neutron number; The experimental value of parameter c; The experimental value of parameter z; The predicted value of parameter c; The predicted value of parameter z; Experimental value of charge radius R (2pF model); The predicted value of charge radius R; The difference between the experimental value and the predicted value of charge radius R; Experimental values of the second moment of charge (2pF model); The predicted value of the second moment of charge; The difference of the second moment of charge; Experimental values of fourth moment of charge (2pF model); The predicted value of fourth moment of charge; The difference of the fourth moment of charge. Each row represents the data of a nucleus. The table associated with the prediction set has 284 rows in 7 columns, each of which contains data on a single nucleus. The seven columns represent the number of protons; Neutron number; Mass number; The predicted value of parameter c; The predicted value of parameter z; The predicted value of charge radius R; The experimental value of the charge radius R.5、 Fig.8-data.opjuThe file named "Fig.8-data.opju" contains the original information from Figure 8 in the associated article. It has only one workbook, which contains several tables, each containing information about a calcium isotope and named by that isotope. Each table has 3 columns and 1000 rows. The three columns are: radial coordinate r, unit is fm; Charge density distribution value; Charge density error band value. Each row corresponds to a drawing coordinate point and error band.

  8. f

    Contains the following files: Fig A. Frequency that each variable is...

    • plos.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pi Guo; Fangfang Zeng; Xiaomin Hu; Dingmei Zhang; Shuming Zhu; Yu Deng; Yuantao Hao (2023). Contains the following files: Fig A. Frequency that each variable is selected for the stepwise variable selection method when sample size changes. [Dataset]. http://doi.org/10.1371/journal.pone.0134151.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Pi Guo; Fangfang Zeng; Xiaomin Hu; Dingmei Zhang; Shuming Zhu; Yu Deng; Yuantao Hao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Plot based on 100 simulations using various sample sizes (n = 100, 200, 300 and 500). Left panel: the number of true predictors r = 8; right panel: the number of true predictors r = 12. Red bars represent the selection frequency of significant variables (true non-zero predictors) and grey bars represent that of noise variables (true zero predictors) in the simulated data. Fig B. Frequency that each variable is selected for the stability selection method when sample size changes. Plot based on 100 simulations using various sample sizes (n = 100, 200, 300 and 500). Left panel: the number of true predictors r = 8; right panel: the number of true predictors r = 12. Red bars represent the selection frequency of significant variables (true non-zero predictors) and grey bars represent that of noise variables (true zero predictors) in the simulated data. Fig C. Frequency that each variable is selected for the LASSO variable selection method when sample size changes. Plot based on 100 simulations using various sample sizes (n = 100, 200, 300 and 500). Left panel: the number of true predictors r = 8; right panel: the number of true predictors r = 12. Red bars represent the selection frequency of significant variables (true non-zero predictors) and grey bars represent that of noise variables (true zero predictors) in the simulated data. Fig D. Frequency that each variable is selected for the Bolasso variable selection method when sample size changes. Plot based on 100 simulations using various sample sizes (n = 100, 200, 300 and 500). Left panel: the number of true predictors r = 8; right panel: the number of true predictors r = 12. Red bars represent the selection frequency of significant variables (true non-zero predictors) and grey bars represent that of noise variables (true zero predictors) in the simulated data. Fig E. Frequency that each variable is selected for the two-stage hybrid variable selection method when sample size changes. Plot based on 100 simulations using various sample sizes (n = 100, 200, 300 and 500). Left panel: the number of true predictors r = 8; right panel: the number of true predictors r = 12. Red bars represent the selection frequency of significant variables (true non-zero predictors) and grey bars represent that of noise variables (true zero predictors) in the simulated data. Fig F. Frequency that each variable is selected for the bootstrap ranking variable selection method when sample size changes. Plot based on 100 simulations using various sample sizes (n = 100, 200, 300 and 500). Left panel: the number of true predictors r = 8; right panel: the number of true predictors r = 12. Red bars represent the selection frequency of significant variables (true non-zero predictors) and grey bars represent that of noise variables (true zero predictors) in the simulated data. Fig G. Sensitivity analysis based on the metric AUC to evaluate the performance of the compared methods when the number of true predictors (r = 8, 12, 16 and 20) and sample size (n = 50, 100, 200, 300 and 500) changes with respect to the small effect predictors of group 1. A total number of variables t = 100 were simulated. Six compared methods: stepwise, stability selection, LASSO, Bolasso, two-stage hybrid and bootstrap ranking procedures. Fig H. Sensitivity analysis based on the metric AUC to evaluate the performance of the compared methods when the number of true predictors (r = 8, 12, 16 and 20) and sample size (n = 50, 100, 200, 300 and 500) changes with respect to the small effect predictors of group 2. A total number of variables t = 100 were simulated. Six compared methods: stepwise, stability selection, LASSO, Bolasso, two-stage hybrid and bootstrap ranking procedures. Fig I. The estimation of tuning parameter λ and coefficients for the LASSO model. (A): The deviance with error bar of the LASSO logistic regression model using a 10-fold cross-validation across different values of the tuning parameter (log-scale). The optimal model is the one with a deviance of 0.2301 when the tuning parameter reaches 0.0017. (B): The path of the estimated coefficients over a grid of values for λ and the selected variables corresponding to the optimal λ. Table A. R codes of the two-stage hybrid procedure. The R function TSLasso was used for establishing the two-stage hybrid procedure. Table B. R codes of the bootstrap ranking procedure. The R function Bootranking was used for establishing the bootstrap ranking procedure. Table C. A de-identified dataset of this work was made publicly-available. (DOCX)

  9. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    HIV Prevention Trials Network
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  10. u

    Replication Data for: ecolRxC: Ecological inference estimation of R×C tables...

    • producciocientifica.uv.es
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavia, Jose M.; Pavia, Jose M. (2024). Replication Data for: ecolRxC: Ecological inference estimation of R×C tables using latent structure approaches [Dataset]. https://producciocientifica.uv.es/documentos/67321de3aea56d4af0484e77
    Explore at:
    Dataset updated
    2024
    Authors
    Pavia, Jose M.; Pavia, Jose M.
    Description

    Ecological inference is a statistical technique used to infer individual behaviour from aggregate data. A particularly relevant instance of ecological inference involves the estimation of the inner cells of a set of R×C related contingency tables when only their aggregate margins are known. This problem spans multiple disciplines, including quantitative history, epidemiology, political science, marketing, and sociology. This paper proposes new models for solving this problem using the latent structure theory, and presents the ecolRxC package, an R implementation of this methodology. This article exemplifies, explains and statistically documents the new extensions and, using real inner cell election data, shows how the new models in ecolRxC lead to significantly more accurate solutions than ecol and VTR, two Stata routines suggested within this framework. ecolRxC also holds its own against ei.MD.bayes and nslphom, the two algorithms currently identified in the literature as the most accurate to solve this problem. ecolRxC records accuracies as good as those reported for ei.MD.bayes and nslphom. Besides, from a theoretical perspective, ecolRxC stands up for modelling a causal theory of political behaviour to build its algorithm. This distinguishes it from other procedures proposed from different frameworks (such as ei.MD.bayes and nslphom) which model expected behaviours, instead of modelling how voters make choices based on their underlying preferences as ecolRxC does.

  11. Data from: Data and code for the publication entitled: tree growth and...

    • dataverse-qualification.cirad.fr
    • dataverse.cirad.fr
    Updated Aug 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CIRAD Dataverse (2022). Data and code for the publication entitled: tree growth and mortality of 42 timber species in Central Africa [Dataset]. http://doi.org/10.18167/DVN1/EBN15Y
    Explore at:
    html(2073677), xlsx(61277), application/x-rlang-transport(2362007), application/x-r-data(1684963), text/x-r-markdown(57769), csv(928), text/x-r-source(13462)Available download formats
    Dataset updated
    Aug 17, 2022
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Central Africa
    Dataset funded by
    This work was supported by the “Fonds Français pour l’Environnement Mondial” (DynAfFor project, convention CZZ1636.01D and CZZ1636.02D, P3FAC project, convention CZZ 2101.01 R).
    Description

    Introduction This archive contains all the necessary data and R script to reproduce the results of the manuscript entitled "tree growth and mortality of 42 timber species in Central Africa" submitted to Forest Ecology and Management journal. It includes cleansed data (text files and Rdata files), computed data (estimates of tree growth and mortality rates, Rdata files), R script to reproduce the computation of the estimates as well as the analyses and figures presented in the main paper and an excel files containing all the supplementary material tables of the manuscript. Cleansed data To produce the cleansed data, raw data was collected for each sites. The different datasets were standardized to store all of them in a single database. Next, consecutive diameter measurements were analyzed and some outliers were discarded (see the explanations in the main manuscript). The cleansed data can be loaded using either text delimited csv files or a Rdata file. It contains the five following tables. Table cleansed_data _species‧csv This table contains information about each study species. Each line corresponds to one species. It contains the following columns: code : species identifying name timber_name : species name as used by the ATIBT species_name_sci : current scientific species name (genus + species) species_name : vernacular species name dme : reference value of the minimum cutting diameter as defined by the Cameroonian Government incr : reference value of diameter increment (cm/year) as defined by the Cameroonian Government cjb_id : species id of the CJB database see_name : species CJB id of synonym names species_name_sci_full : full current scientific name (genus + species + authority) Table cleansed_data _observation_codes‧csv This table contains the description of the codes used in the field to note any particularities of the monitored trees. One line correspond to one code. There are three columns: code : observation code label_fr : French explanation of the code (as used in the field) label_en : English translation of the explanation of the code Table cleansed_data _mortality_codes‧csv This table contains the description of the codes used to characterize the likely cause of recorded tree death. There are three columns: code : mortality code label_fr : French explanation of the code (as used in the field) label_en : English translation of the explanation of the code Table cleansed_data _records‧csv This table contains the information collected for each tree. Each line corresponds to one record for one tree. There are several lines per tree as they were measured several times. It contains the following columns: site : site name id_site : site identifying number id_plot : plot identifying number treatment : treatment (control, exploited or mixed). exploitation_year : year of the exploitation (4 digits) species : species vernacular name (corresponding to species_name column of species table) id_tree : tree identifying number number : tree number (the number that was painted on the tree) id : record identifying number date : record date (yyyy-mm-dd) census_year : year of the census diameter : tree diameter measured at hom (cm) diameter450 : tree diameter measured at 450 cm in height (cm) hom : height of measurement of the diameter (cm) code_observation : observation codes. Multiple codes were sometimes used. They are separated by a dash (corresponding to code column of observation_codes table). code_mortality : mortality codes (corresponding to code column of mortality_codes table) comment : any additional comment Table cleansed_data _Increments‧csv id : id of the initial measurement id_tree : tree id number id_plot : plot id number treatment : treatment (control, exploited) species : species vernacular name (corresponding to species_name of species table) number : tree number (the number that was written on the tree) hom : height of measurement (cm) id_hom : id of HOM (sometimes the HOM had to be changed, e‧x. due to buttresses or wounds) initial_date : date of the first census initial_diameter : the diameter measured at the first census (cm) diameter_increment : The annual diameter increment computed between the two considered census (cm/year) increment_period : The number of years separating the two censuses (years) diameter_observation : Observation codes (corresponding to code of table observation_codes) that were noted during the first and second census. The observation of the two censuses are separated by a “/”. diameter_comment : Additional comments written during the two measurements. They are separated by a “/”. Id_species : species identifying number Id_site : site identifying number Site : name of the site Exploitation_year : year of the exploitation (if any) File cleansed_data.Rdata This Rdata file contains the five tables (species, mortality_codes, observation_codes, records and increment) of the cleansed data It can be used to rapidly load in R. Computed data From the cleansed data, we computed - as explained in the main manuscript - tree growth and mortality rates using an R script (3-computation.R). This script produces “computed data”. The computed data contains six tables that are provided with three additional csv files and one RData files. Computed_data_records‧csv This table is the same as record‧csv but with one additional column: exploitation_date : the assumed date of the exploitation if any (yyyy-mm-dd) Table computed_data_growth_rates‧csv This table contains one line per combination of tree and treatment. It contains the estimates of diameter increment computed over all available records. This table contains the following columns: site : site name id_site : site identifying number treatment : treatment (control or exploited) species : species vernacular name id_plot : plot id number id_tree : tree id number initial_diameter : tree diameter at the begining of the census period (cm) increment_period : length of the census period (year) initial_date : date of the first census (yyyy-mm-dd) diameter_observation : observation codes if any diameter_comment : comment if any exploitation_year : year of the exploitation (4 digits) exploitation_date : assumed date of the last exploitation (if treatement = logged or mixed) mid_point : mid-point of the census period (yyyy-mm-dd) years_after_expl : length of time between the exploitation date and the first measurement n_increment : number of consecutive increment n_hom : number of change of hom during the census period diameter_increment : estimate of the diameter increment (cm/year) Table computed_data_mortality_rates‧csv This table contains estimates of mortality rates for each species and site. This table contains the following columns: id_site : site id number treatment : treatment (control or exploited) time_min : minimum of the length of the census periods time_max : maximum of the length of the census periods time_sd : standard deviation of the length of the census periods -- deleted exploitation_year : exploitation year (if treatment = exploited) years_after_expl_mid : number of years between the assumed exploitation and the mid-period census. years_after_expl_start : number of years between the assumed exploitation and the first census. site : site name species : species vernacular name N0 : number of monitored trees N_surviving : number of surviving trees meantime : mean monitoring period length rate : estimates of the mortality rate lowerCI : lower bound of the confidence interval of the mortaltity rate upperCI : lower bound of the confidence interval of the mortaltity rate File computed_data.Rdata This Rdata file contains the six tables (species, records, growth_rates, mortality_rates, mortality_codes, observation_codes) of the computed data. It can be used to load them in R. Analyses The analyses presented in the main manuscript were produced with a Rmd script (4-analyses.Rmd). This script generates an HTML report (4-analyses.html), as well as the figure that are shown in the manuscript and an Excel file with all the supplementary tables (with one sheet per supplementary table).

  12. d

    Alaska Geochemical Database Version 3.0 (AGDB3) including best value data...

    • catalog.data.gov
    • data.usgs.gov
    • +3more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Alaska Geochemical Database Version 3.0 (AGDB3) including best value data compilations for rock, sediment, soil, mineral, and concentrate sample media [Dataset]. https://catalog.data.gov/dataset/alaska-geochemical-database-version-3-0-agdb3-including-best-value-data-compilations-for-r
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Alaska
    Description

    The Alaska Geochemical Database Version 3.0 (AGDB3) contains new geochemical data compilations in which each geologic material sample has one best value determination for each analyzed species, greatly improving speed and efficiency of use. Like the Alaska Geochemical Database Version 2.0 before it, the AGDB3 was created and designed to compile and integrate geochemical data from Alaska to facilitate geologic mapping, petrologic studies, mineral resource assessments, definition of geochemical baseline values and statistics, element concentrations and associations, environmental impact assessments, and studies in public health associated with geology. This relational database, created from databases and published datasets of the U.S. Geological Survey (USGS), Atomic Energy Commission National Uranium Resource Evaluation (NURE), Alaska Division of Geological & Geophysical Surveys (DGGS), U.S. Bureau of Mines, and U.S. Bureau of Land Management serves as a data archive in support of Alaskan geologic and geochemical projects and contains data tables in several different formats describing historical and new quantitative and qualitative geochemical analyses. The analytical results were determined by 112 laboratory and field analytical methods on 396,343 rock, sediment, soil, mineral, heavy-mineral concentrate, and oxalic acid leachate samples. Most samples were collected by personnel of these agencies and analyzed in agency laboratories or, under contracts, in commercial analytical laboratories. These data represent analyses of samples collected as part of various agency programs and projects from 1938 through 2017. In addition, mineralogical data from 18,138 nonmagnetic heavy-mineral concentrate samples are included in this database. The AGDB3 includes historical geochemical data archived in the USGS National Geochemical Database (NGDB) and NURE National Uranium Resource Evaluation-Hydrogeochemical and Stream Sediment Reconnaissance databases, and in the DGGS Geochemistry database. Retrievals from these databases were used to generate most of the AGDB data set. These data were checked for accuracy regarding sample location, sample media type, and analytical methods used. In other words, the data of the AGDB3 supersedes data in the AGDB and the AGDB2, but the background about the data in these two earlier versions are needed by users of the current AGDB3 to understand what has been done to amend, clean up, correct and format this data. Corrections were entered, resulting in a significantly improved Alaska geochemical dataset, the AGDB3. Data that were not previously in these databases because the data predate the earliest agency geochemical databases, or were once excluded for programmatic reasons, are included here in the AGDB3 and will be added to the NGDB and Alaska Geochemistry. The AGDB3 data provided here are the most accurate and complete to date and should be useful for a wide variety of geochemical studies. The AGDB3 data provided in the online version of the database may be updated or changed periodically.

  13. C

    Air Quality

    • data.ccrpc.org
    csv
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Champaign County Regional Planning Commission (2025). Air Quality [Dataset]. https://data.ccrpc.org/dataset/air-quality
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 13, 2025
    Dataset authored and provided by
    Champaign County Regional Planning Commission
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This indicator shows how many days per year were assessed to have air quality that was worse than “moderate” in Champaign County, according to the U.S. Environmental Protection Agency’s (U.S. EPA) Air Quality Index Reports. The period of analysis is 1980-2024, and the U.S. EPA’s air quality ratings analyzed here are as follows, from best to worst: “good,” “moderate,” “unhealthy for sensitive groups,” “unhealthy,” “very unhealthy,” and "hazardous."[1]

    In 2024, the number of days rated to have air quality worse than moderate was 0. This is a significant decrease from the 13 days in 2023 in the same category, the highest in the 21st century. That figure is likely due to the air pollution created by the unprecedented Canadian wildfire smoke in Summer 2023.

    While there has been no consistent year-to-year trend in the number of days per year rated to have air quality worse than moderate, the number of days in peak years had decreased from 2000 through 2022. Where peak years before 2000 had between one and two dozen days with air quality worse than moderate (e.g., 1983, 18 days; 1988, 23 days; 1994, 17 days; 1999, 24 days), the year with the greatest number of days with air quality worse than moderate from 2000-2022 was 2002, with 10 days. There were several years between 2006 and 2022 that had no days with air quality worse than moderate.

    This data is sourced from the U.S. EPA’s Air Quality Index Reports. The reports are released annually, and our period of analysis is 1980-2024. The Air Quality Index Report websites does caution that "[a]ir pollution levels measured at a particular monitoring site are not necessarily representative of the air quality for an entire county or urban area," and recommends that data users do not compare air quality between different locations[2].

    [1] Environmental Protection Agency. (1980-2024). Air Quality Index Reports. (Accessed 13 June 2025).

    [2] Ibid.

    Source: Environmental Protection Agency. (1980-2024). Air Quality Index Reports. https://www.epa.gov/outdoor-air-quality-data/air-quality-index-report. (Accessed 13 June 2025).

  14. Data for Figures and Tables in Journal Article "Assessment of the Effects of...

    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Data for Figures and Tables in Journal Article "Assessment of the Effects of Horizontal Grid Resolution on Long-Term Air Quality Trends using Coupled WRF-CMAQ Simulations", doi:10.1016/j.atmosenv.2016.02.036 [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/data-for-figures-and-tables-in-journal-article-assessment-of-the-effects-of-horizontal-g-0
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The dataset represents the data depicted in the Figures and Tables of a Journal Manuscript with the following abstract: "The objective of this study is to determine the adequacy of using a relatively coarse horizontal resolution (i.e. 36 km) to simulate long-term trends of pollutant concentrations and radiation variables with the coupled WRF-CMAQ model. WRF-CMAQ simulations over the continental United State are performed over the 2001 to 2010 time period at two different horizontal resolutions of 12 and 36 km. Both simulations used the same emission inventory and model configurations. Model results are compared both in space and time to assess the potential weaknesses and strengths of using coarse resolution in long-term air quality applications. The results show that the 36 km and 12 km simulations are comparable in terms of trends analysis for both pollutant concentrations and radiation variables. The advantage of using the coarser 36 km resolution is a significant reduction of computational cost, time and storage requirement which are key considerations when performing multiple years of simulations for trend analysis. However, if such simulations are to be used for local air quality analysis, finer horizontal resolution may be beneficial since it can provide information on local gradients. In particular, divergences between the two simulations are noticeable in urban, complex terrain and coastal regions.". This dataset is associated with the following publication: Gan , M., C. Hogrefe , R. Mathur , J. Pleim , J. Xing , D. Wong , R. Gilliam , G. Pouliot , and C. Wei. Assessment of the effects of horizontal grid resolution on long-term air quality trends using coupled WRF-CMAQ simulations. ATMOSPHERIC ENVIRONMENT. Elsevier Science Ltd, New York, NY, USA, 132: 207-216, (2016).

  15. f

    Data from: metLinkR: Facilitating Metaanalysis of Human Metabolomics Data...

    • figshare.com
    zip
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Patt; Iris Pang; Fred Lee; Chiraag Gohel; Eoin Fahy; Vicki Stevens; David Ruggieri; Steven C. Moore; Ewy A. Mathé (2025). metLinkR: Facilitating Metaanalysis of Human Metabolomics Data through Automated Linking of Metabolite Identifiers [Dataset]. http://doi.org/10.1021/acs.jproteome.4c01051.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 4, 2025
    Dataset provided by
    ACS Publications
    Authors
    Andrew Patt; Iris Pang; Fred Lee; Chiraag Gohel; Eoin Fahy; Vicki Stevens; David Ruggieri; Steven C. Moore; Ewy A. Mathé
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Metabolites are referenced in spectral, structural and pathway databases with a diverse array of schemas, including various internal database identifiers and large tables of common name synonyms. Cross-linking metabolite identifiers is a required step for meta-analysis of metabolomic results across studies but made difficult due to the lack of a consensus identifier system. We have implemented metLinkR, an R package that leverages RefMet and RaMP-DB to automate and simplify cross-linking metabolite identifiers across studies and generating common names. MetLinkR accepts as input metabolite common names and identifiers from five different databases (HMDB, KEGG, ChEBI, LIPIDMAPS and PubChem) to exhaustively search for possible overlap in supplied metabolites from input data sets. In an example of 13 metabolomic data sets totaling 10,400 metabolites, metLinkR identified and provided common names for 1377 metabolites in common between at least 2 data sets in less than 18 min and produced standardized names for 74.4% of the input metabolites. In another example comprising five data sets with 3512 metabolites, metLinkR identified 715 metabolites in common between at least two data sets in under 12 min and produced standardized names for 82.3% of the input metabolites. Outputs of MetLInR include output tables and metrics allowing users to readily double check the mappings and to get an overview of chemical classes represented. Overall, MetLinkR provides a streamlined solution for a common task in metabolomic epidemiology and other fields that meta-analyze metabolomic data. The R package, vignette and source code are freely downloadable at https://github.com/ncats/metLinkR.

  16. H

    National Health and Nutrition Examination Survey (NHANES)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). National Health and Nutrition Examination Survey (NHANES) [Dataset]. http://doi.org/10.7910/DVN/IMWQPJ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the national health and nutrition examination survey (nhanes) with r nhanes is this fascinating survey where doctors and dentists accompany survey interviewers in a little mobile medical center that drives around the country. while the survey folks are interviewing people, the medical professionals administer laboratory tests and conduct a real doctor's examination. the b lood work and medical exam allow researchers like you and me to answer tough questions like, "how many people have diabetes but don't know they have diabetes?" conducting the lab tests and the physical isn't cheap, so a new nhanes data set becomes available once every two years and only includes about twelve thousand respondents. since the number of respondents is so small, analysts often pool multiple years of data together. the replication scripts below give a few different examples of how multiple years of data can be pooled with r. the survey gets conducted by the centers for disease control and prevention (cdc), and generalizes to the united states non-institutional, non-active duty military population. most of the data tables produced by the cdc include only a small number of variables, so importation with the foreign package's read.xport function is pretty straightforward. but that makes merging the appropriate data sets trickier, since it might not be clear what to pull for which variables. for every analysis, start with the table with 'demo' in the name -- this file includes basic demographics, weighting, and complex sample survey design variables. since it's quick to download the files directly from the cdc's ftp site, there's no massive ftp download automation script. this new github repository co ntains five scripts: 2009-2010 interview only - download and analyze.R download, import, save the demographics and health insurance files onto your local computer load both files, limit them to the variables needed for the analysis, merge them together perform a few example variable recodes create the complex sample survey object, using the interview weights run a series of pretty generic analyses on the health insurance ques tions 2009-2010 interview plus laboratory - download and analyze.R download, import, save the demographics and cholesterol files onto your local computer load both files, limit them to the variables needed for the analysis, merge them together perform a few example variable recodes create the complex sample survey object, using the mobile examination component (mec) weights perform a direct-method age-adjustment and matc h figure 1 of this cdc cholesterol brief replicate 2005-2008 pooled cdc oral examination figure.R download, import, save, pool, recode, create a survey object, run some basic analyses replicate figure 3 from this cdc oral health databrief - the whole barplot replicate cdc publications.R download, import, save, pool, merge, and recode the demographics file plus cholesterol laboratory, blood pressure questionnaire, and blood pressure laboratory files match the cdc's example sas and sudaan syntax file's output for descriptive means match the cdc's example sas and sudaan synta x file's output for descriptive proportions match the cdc's example sas and sudaan syntax file's output for descriptive percentiles replicate human exposure to chemicals report.R (user-contributed) download, import, save, pool, merge, and recode the demographics file plus urinary bisphenol a (bpa) laboratory files log-transform some of the columns to calculate the geometric means and quantiles match the 2007-2008 statistics shown on pdf page 21 of the cdc's fourth edition of the report click here to view these five scripts for more detail about the national health and nutrition examination survey (nhanes), visit: the cdc's nhanes homepage the national cancer institute's page of nhanes web tutorials notes: nhanes includes interview-only weights and interview + mobile examination component (mec) weights. if you o nly use questions from the basic interview in your analysis, use the interview-only weights (the sample size is a bit larger). i haven't really figured out a use for the interview-only weights -- nhanes draws most of its power from the combination of the interview and the mobile examination component variables. if you're only using variables from the interview, see if you can use a data set with a larger sample size like the current population (cps), national health interview survey (nhis), or medical expenditure panel survey (meps) instead. confidential to sas, spss, stata, sudaan users: why are you still riding around on a donkey after we've invented the internal combustion engine? time to transition to r. :D

  17. Table 3 LCF Results and Figure S2 LCF data

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Jan 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Table 3 LCF Results and Figure S2 LCF data [Dataset]. https://catalog.data.gov/dataset/table-3-lcf-results-and-figure-s2-lcf-data
    Explore at:
    Dataset updated
    Jan 5, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Table 3 shows the distribution of lead speciation results for soil samples along with fitting error (R factor) Data for Figure S2 contains the sample spectra and linear combination fitting (LCF) data used to determine results in Table 3. This dataset is associated with the following publication: Kastury, F., R.R. Karna, K.G. Scheckel, and A.L. Juhasz. Correlation between lead speciation and inhalation bioaccessibility using two different simulated lung fluids. ENVIRONMENTAL POLLUTION. Elsevier Science Ltd, New York, NY, USA, 263 part B: 114609, (2020).

  18. d

    Data from: Grass-Cast Database - Data on aboveground net primary...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Grass-Cast Database - Data on aboveground net primary productivity (ANPP), climate data, NDVI, and cattle weight gain for Western U.S. rangelands [Dataset]. https://catalog.data.gov/dataset/grass-cast-database-data-on-aboveground-net-primary-productivity-anpp-climate-data-ndvi-an-ac7cd
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Area covered
    United States
    Description

    Grass-Cast: Experimental Grassland Productivity Forecast for the Great Plains Grass-Cast uses almost 40 years of historical data on weather and vegetation growth in order to project grassland productivity in the Western U.S. More details on the projection model and method can be found at https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecs2.3280. Every spring, ranchers in the drought‐prone U.S. Great Plains face the same difficult challenge—trying to estimate how much forage will be available for livestock to graze during the upcoming summer grazing season. To reduce this uncertainty in predicting forage availability, we developed an innovative new grassland productivity forecast system, named Grass‐Cast, to provide science‐informed estimates of growing season aboveground net primary production (ANPP). Grass‐Cast uses over 30 yr of historical data including weather and the satellite‐derived normalized vegetation difference index (NDVI)—combined with ecosystem modeling and seasonal precipitation forecasts—to predict if rangelands in individual counties are likely to produce below‐normal, near‐normal, or above‐normal amounts of grass biomass (lbs/ac). Grass‐Cast also provides a view of rangeland productivity in the broader region, to assist in larger‐scale decision‐making—such as where forage resources for grazing might be more plentiful if a rancher’s own region is at risk of drought. Grass‐Cast is updated approximately every two weeks from April through July. Each Grass‐Cast forecast provides three scenarios of ANPP for the upcoming growing season based on different precipitation outlooks. Near real‐time 8‐d NDVI can be used to supplement Grass‐Cast in predicting cumulative growing season NDVI and ANPP starting in mid‐April for the Southern Great Plains and mid‐May to early June for the Central and Northern Great Plains. Here, we present the scientific basis and methods for Grass‐Cast along with the county‐level production forecasts from 2017 and 2018 for ten states in the U.S. Great Plains. The correlation between early growing season forecasts and the end‐of‐growing season ANPP estimate is >50% by late May or early June. In a retrospective evaluation, we compared Grass‐Cast end‐of‐growing season ANPP results to an independent dataset and found that the two agreed 69% of the time over a 20‐yr period. Although some predictive tools exist for forecasting upcoming growing season conditions, none predict actual productivity for the entire Great Plains. The Grass‐Cast system could be adapted to predict grassland ANPP outside of the Great Plains or to predict perennial biofuel grass production. This new experimental grassland forecast is the result of a collaboration between Colorado State University, U.S. Department of Agriculture (USDA), National Drought Mitigation Center, and the University of Arizona. Funding for this project was provided by the USDA Natural Resources Conservation Service (NRCS), USDA Agricultural Research Service (ARS), and the National Drought Mitigation Center. Watch for updates on the Grass-Cast website or on Twitter (@PeckAgEc). Project Contact: Dannele Peck, Director of the USDA Northern Plains Climate Hub, at dannele.peck@ars.usda.gov or 970-744-9043. Resources in this dataset:Resource Title: Cattle weight gain. File Name: Cattle_weight_gains.xlsxResource Description: Cattle weight gain data for Grass-Cast Database. Resource Title: NDVI. File Name: NDVI.xlsxResource Description: Annual NDVI growing season values for Grass-Cast sites. See readme for more information and NDVI_raw for the raw values. Resource Title: NDVI_raw . File Name: NDVI_raw.xlsxResource Description: Raw bimonthly NDVI values for Grass-Cast sites. Resource Title: ANPP. File Name: ANPP.xlsxResource Description: Dataset for annual aboveground net primary productivity (ANPP). Excel sheet is broken into two tabs, 1) 'readme' describing the data, 2) 'ANPP' with the actual data. Resource Title: Grass-Cast_sitelist . File Name: Grass-Cast_sitelist.xlsxResource Description: This provides a list of sites-studies that are currently incorporated into the Database as well as meta-data and contact info associated with the data sets. Includes a 'readme' tab and 'sitelist' tab. Resource Title: Grass-Cast_AgDataCommons_overview. File Name: Grass-Cast_AgDataCommons_download.htmlResource Description: Html document that shows database overview information. This document provides a glimpse of the data tables available within the data resource as well as respective meta-data tables. The R script (R markdown, .Rmd format) that generates the html file, and can be used to upload the Grass-Cast associated Ag Data Commons data files can be downloaded at the 'Grass-Cast R script' zip folder. The Grass-Cast files still need to be locally downloaded before use, but we are looking to make a download automated. Resource Title: Grass-Cast R script . File Name: R_access_script.zipResource Description: R script (in Rmarkdown [Rmd] format) for uploading and looking at Grass-Cast data.

  19. Dataset: Benthic Invertebrate Conductivity Extirpation Estimates with...

    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • catalog.data.gov
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2025). Dataset: Benthic Invertebrate Conductivity Extirpation Estimates with Chloride, Bicarbonate, and Sulfate Mixtures [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/dataset-benthic-invertebrate-conductivity-extirpation-estimates-with-chloride-bicarbonate-
    Explore at:
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These data include benthic invertebrate occurrence data and associated water quality data within the states of Maryland, Pennsylvania, Vermont, and West Virginia. Data are sorted into stations dominated by chloride or sulfate ions or a mix of the two. Also included are the original and curated ion mixture data, R scripts, summary plots and tables. Additional detail on methods and applications are available in these two papers: Cormier, S., L. Zheng, and C. Flaherty. A field-based model of the relationship between extirpation of salt-intolerant benthic invertebrates and background conductivity. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 633: 1629-1636, (2018). Cormier, S.M., Suter, G.W., Fernandez, M.B. and Zheng, L., 2020. Adequacy of sample size for estimating a value from field observational data. Ecotoxicology and environmental safety, 203, p.110992. This dataset is associated with the following publication: Cormier, S., T. Newcomer Johnson, and C. Wharton. Freshwater Explorer 2.0 Data and mapping capabilities and assessment examples. Presented at OWOW Webinar, Webinar, CT, USA, 06/24/2025 - 06/24/2025.

  20. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Samoilova, Evgenia (Zhenya)
    Loist, Skadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Maria Gisbert-Queral; Nathan Mueller (2025). Data for: Climate impacts and adaptation in US dairy systems 1981–2018 [Dataset]. http://doi.org/10.5281/zenodo.4818011

Data for: Climate impacts and adaptation in US dairy systems 1981–2018

Related Article
Explore at:
binAvailable download formats
Dataset updated
May 30, 2025
Dataset provided by
Zenodo
Authors
Maria Gisbert-Queral; Nathan Mueller
License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Area covered
United States
Description

Data is archived here: https://doi.org/10.5281/zenodo.4818011Data and code archive provides all the files that are necessary to replicate the empirical analyses that are presented in the paper "Climate impacts and adaptation in US dairy systems 1981-2018" authored by Maria Gisbert-Queral, Arne Henningsen, Bo Markussen, Meredith T. Niles, Ermias Kebreab, Angela J. Rigden, and Nathaniel D. Mueller and published in 'Nature Food' (2021, DOI: 10.1038/s43016-021-00372-z). The empirical analyses are entirely conducted with the "R" statistical software using the add-on packages "car", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "lmtest", "lubridate", "magrittr", "nlme", "OneR", "plyr", "pracma", "quadprog", "readxl", "sandwich", "tidyr", "usfertilizer", and "usmap". The R code was written by Maria Gisbert-Queral and Arne Henningsen with assistance from Bo Markussen. Some parts of the data preparation and the analyses require substantial amounts of memory (RAM) and computational power (CPU). Running the entire analysis (all R scripts consecutively) on a laptop computer with 32 GB physical memory (RAM), 16 GB swap memory, an 8-core Intel Xeon CPU E3-1505M @ 3.00 GHz, and a GNU/Linux/Ubuntu operating system takes around 11 hours. Running some parts in parallel can speed up the computations but bears the risk that the computations terminate when two or more memory-demanding computations are executed at the same time.This data and code archive contains the following files and folders:* READMEDescription: text file with this description* flowchart.pdfDescription: a PDF file with a flow chart that illustrates how R scripts transform the raw data files to files that contain generated data sets and intermediate results and, finally, to the tables and figures that are presented in the paper.* runAll.shDescription: a (bash) shell script that runs all R scripts in this data and code archive sequentially and in a suitable order (on computers with a "bash" shell such as most computers with MacOS, GNU/Linux, or Unix operating systems)* Folder "DataRaw"Description: folder for raw data filesThis folder contains the following files:- DataRaw/COWS.xlsxDescription: MS-Excel file with the number of cows per countySource: USDA NASS QuickstatsObservations: All available counties and years from 2002 to 2012- DataRaw/milk_state.xlsxDescription: MS-Excel file with average monthly milk yields per cowSource: USDA NASS QuickstatsObservations: All available states from 1981 to 2018- DataRaw/TMAX.csvDescription: CSV file with daily maximum temperaturesSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/VPD.csvDescription: CSV file with daily maximum vapor pressure deficitsSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/countynamesandID.csvDescription: CSV file with county names, state FIPS codes, and county FIPS codesSource: US Census BureauObservations: All counties- DataRaw/statecentroids.csvDescriptions: CSV file with latitudes and longitudes of state centroidsSource: Generated by Nathan Mueller from Matlab state shapefiles using the Matlab "centroid" functionObservations: All states* Folder "DataGenerated"Description: folder for data sets that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these generated data files so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Results"Description: folder for intermediate results that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these intermediate results so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Figures"Description: folder for the figures that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these figures so that people who replicate our analysis can more easily compare the figures that they get with the figures that are presented in our paper. Additionally, this folder contains CSV files with the data that are required to reproduce the figures.* Folder "Tables"Description: folder for the tables that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these tables so that people who replicate our analysis can more easily compare the tables that they get with the tables that are presented in our paper.* Folder "logFiles"Description: the shell script runAll.sh writes the output of each R script that it runs into this folder. We provide these log files so that people who replicate our analysis can more easily compare the R output that they get with the R output that we got.* PrepareCowsData.RDescription: R script that imports the raw data set COWS.xlsx and prepares it for the further analyses* PrepareWeatherData.RDescription: R script that imports the raw data sets TMAX.csv, VPD.csv, and countynamesandID.csv, merges these three data sets, and prepares the data for the further analyses* PrepareMilkData.RDescription: R script that imports the raw data set milk_state.xlsx and prepares it for the further analyses* CalcFrequenciesTHI_Temp.RDescription: R script that calculates the frequencies of days with the different THI bins and the different temperature bins in each month for each state* CalcAvgTHI.RDescription: R script that calculates the average THI in each state* PreparePanelTHI.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different THI bins* PreparePanelTemp.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different temperature bins* PreparePanelFinal.RDescription: R script that creates the state-month panel/longitudinal data set with all variables (e.g., THI bins, temperature bins, milk yield) that are used in our statistical analyses* EstimateTrendsTHI.RDescription: R script that estimates the trends of the frequencies of the different THI bins within our sampling period for each state in our data set* EstimateModels.RDescription: R script that estimates all model specifications that are used for generating results that are presented in the paper or for comparing or testing different model specifications* CalcCoefStateYear.RDescription: R script that calculates the effects of each THI bin on the milk yield for all combinations of states and years based on our 'final' model specification* SearchWeightMonths.RDescription: R script that estimates our 'final' model specification with different values of the weight of the temporal component relative to the weight of the spatial component in the temporally and spatially correlated error term* TestModelSpec.RDescription: R script that applies Wald tests and Likelihood-Ratio tests to compare different model specifications and creates Table S10* CreateFigure1a.RDescription: R script that creates subfigure a of Figure 1* CreateFigure1b.RDescription: R script that creates subfigure b of Figure 1* CreateFigure2a.RDescription: R script that creates subfigure a of Figure 2* CreateFigure2b.RDescription: R script that creates subfigure b of Figure 2* CreateFigure2c.RDescription: R script that creates subfigure c of Figure 2* CreateFigure3.RDescription: R script that creates the subfigures of Figure 3* CreateFigure4.RDescription: R script that creates the subfigures of Figure 4* CreateFigure5_TableS6.RDescription: R script that creates the subfigures of Figure 5 and Table S6* CreateFigureS1.RDescription: R script that creates Figure S1* CreateFigureS2.RDescription: R script that creates Figure S2* CreateTableS2_S3_S7.RDescription: R script that creates Tables S2, S3, and S7* CreateTableS4_S5.RDescription: R script that creates Tables S4 and S5* CreateTableS8.RDescription: R script that creates Table S8* CreateTableS9.RDescription: R script that creates Table S9

Search
Clear search
Close search
Google apps
Main menu