49 datasets found
  1. f

    Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  2. Z

    Datasets used in the benchmarking study of MR methods

    • data.niaid.nih.gov
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xianghong, Hu (2024). Datasets used in the benchmarking study of MR methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10929571
    Explore at:
    Dataset updated
    Aug 4, 2024
    Dataset authored and provided by
    Xianghong, Hu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We conducted a benchmarking analysis of 16 summary-level data-based MR methods for causal inference with five real-world genetic datasets, focusing on three key aspects: type I error control, the accuracy of causal effect estimates, replicability, and power.

    The datasets used in the MR benchmarking study can be downloaded here:

    "dataset-GWASATLAS-negativecontrol.zip": the GWASATLAS dataset for evaluation of type I error control in confounding scenario (a): Population stratification

    "dataset-NealeLab-negativecontrol.zip": the Neale Lab dataset for evaluation of type I error control in confounding scenario (a): Population stratification;

    "dataset-PanUKBB-negativecontrol.zip": the Pan UKBB dataset for evaluation of type I error control in confounding scenario (a): Population stratification;

    "dataset-Pleiotropy-negativecontrol": the dataset used for evaluation of type I error control in confounding scenario (b): Pleiotropy;

    "dataset-familylevelconf-negativecontrol.zip": the dataset used for evaluation of type I error control in confounding scenario (c): Family-level confounders;

    "dataset_ukb-ukb.zip": the dataset used for evaluation of the accuracy of causal effect estimates;

    "dataset-LDL-CAD_clumped.zip": the dataset used for evaluation of replicability and power;

    Each of the datasets contains the following files:

    "Tested Trait pairs": the exposure-outcome trait pairs to be analyzed;

    "MRdat" refers to the summary statistics after performing IV selection (p-value < 5e-05) and PLINK LD clumping with a clumping window size of 1000kb and an r^2 threshold of 0.001.

    "bg_paras" are the estimated background parameters "Omega" and "C" which will be used for MR estimation in MR-APSS.

    Note:

    Supplemental Tables S1-S7.xlxs provide the download link for the original GWAS summary-level data for the traits used as exposures or outcomes.

    The formatted dataset after quality control can be accessible at our GitHub website (https://github.com/YangLabHKUST/MRbenchmarking).

    The details on quality control of GWAS summary statistics, formatting GWASs, and LD clumping for IV selection can be found on the MR-APSS software tutorial on the MR-APSS website (https://github.com/YangLabHKUST/MR-APSS).

    R code for running MR methods is also available at https://github.com/YangLabHKUST/MRbenchmarking.

  3. Codes in R for spatial statistics analysis, ecological response models and...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya (2025). Codes in R for spatial statistics analysis, ecological response models and spatial distribution models [Dataset]. http://doi.org/10.5281/zenodo.7603557
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

    It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

    In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

    Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

    After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

    Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

    Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

    On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

    Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

    Validation set

    Model

    True

    False

    Presence

    A

    B

    Background

    C

    D

    We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

    The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

    Regarding the model evaluation and estimation, we selected the following estimators:

    1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

    2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

  4. H

    Time-Series Matrix (TSMx): A visualization tool for plotting multiscale...

    • dataverse.harvard.edu
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgios Boumis; Brad Peter (2024). Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends [Dataset]. http://doi.org/10.7910/DVN/ZZDYM9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Georgios Boumis; Brad Peter
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...

  5. Brazil Gross Value Added: by Activity: Current Prices: Architectural...

    • ceicdata.com
    Updated Aug 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2021). Brazil Gross Value Added: by Activity: Current Prices: Architectural Services, Engineering, Testing / Technical Analysis and R & D [Dataset]. https://www.ceicdata.com/en/brazil/sna-2008-gross-value-added-by-activity-current-prices/gross-value-added-by-activity-current-prices-architectural-services-engineering-testing--technical-analysis-and-r--d
    Explore at:
    Dataset updated
    Aug 8, 2021
    Dataset provided by
    CEIC Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2010 - Dec 1, 2016
    Area covered
    Brazil
    Variables measured
    Gross Domestic Product
    Description

    Brazil Gross Value Added: by Activity: Current Prices: Architectural Services, Engineering, Testing / Technical Analysis and R & D data was reported at 39,471.000 BRL mn in 2016. This records a decrease from the previous number of 44,866.000 BRL mn for 2015. Brazil Gross Value Added: by Activity: Current Prices: Architectural Services, Engineering, Testing / Technical Analysis and R & D data is updated yearly, averaging 40,146.000 BRL mn from Dec 2010 (Median) to 2016, with 7 observations. The data reached an all-time high of 46,499.000 BRL mn in 2014 and a record low of 30,003.000 BRL mn in 2010. Brazil Gross Value Added: by Activity: Current Prices: Architectural Services, Engineering, Testing / Technical Analysis and R & D data remains active status in CEIC and is reported by Brazilian Institute of Geography and Statistics. The data is categorized under Brazil Premium Database’s National Accounts – Table BR.AC001: SNA 2008: Gross Value Added: by Activity: Current Prices.

  6. u

    Data from: EWAS of lung function in Latinos with asthma - Summary Statistics...

    • portalciencia.ull.es
    • data.niaid.nih.gov
    • +1more
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Herrera-Luis, Esther; Li, Annie; Mak, Angel C. Y.; Perez-Garcia, Javier; Elhawary, Jennifer R.; Oh, Sam S.; Hu, Donglei; Eng, Celeste; Keys, Kevin L.; Huntsman, Scott; Beckman, Kenneth B.; Borrell, Luisa N.; Rodriguez-Santana, Jose; Burchard, Esteban G.; Pino-Yanes, Maria; Herrera-Luis, Esther; Li, Annie; Mak, Angel C. Y.; Perez-Garcia, Javier; Elhawary, Jennifer R.; Oh, Sam S.; Hu, Donglei; Eng, Celeste; Keys, Kevin L.; Huntsman, Scott; Beckman, Kenneth B.; Borrell, Luisa N.; Rodriguez-Santana, Jose; Burchard, Esteban G.; Pino-Yanes, Maria (2021). EWAS of lung function in Latinos with asthma - Summary Statistics [Dataset]. https://portalciencia.ull.es/documentos/668fc446b9e7c03b01bd86a2?lang=ca
    Explore at:
    Dataset updated
    2021
    Authors
    Herrera-Luis, Esther; Li, Annie; Mak, Angel C. Y.; Perez-Garcia, Javier; Elhawary, Jennifer R.; Oh, Sam S.; Hu, Donglei; Eng, Celeste; Keys, Kevin L.; Huntsman, Scott; Beckman, Kenneth B.; Borrell, Luisa N.; Rodriguez-Santana, Jose; Burchard, Esteban G.; Pino-Yanes, Maria; Herrera-Luis, Esther; Li, Annie; Mak, Angel C. Y.; Perez-Garcia, Javier; Elhawary, Jennifer R.; Oh, Sam S.; Hu, Donglei; Eng, Celeste; Keys, Kevin L.; Huntsman, Scott; Beckman, Kenneth B.; Borrell, Luisa N.; Rodriguez-Santana, Jose; Burchard, Esteban G.; Pino-Yanes, Maria
    Description

    Summary statistics generated for the manuscript entitled "Epigenome-wide association study of lung function in Latino children and youth with asthma" Our aim was to identify DNA methylation signals associated with lung function in Latino youth with asthma and validate previous epigenetic signals from non-Latino populations. For that, we performed multiple epigenome-wide association studies (EWAS) of lung function measurements analyzing whole blood from 250 Puerto Rican (PR) and 148 Mexican American (MEX) youth with asthma from the Genes-Environment and Admixture in Latino Americans (GALA II) study. The following measurements were evaluated Pre- and post- albuterol administration: Forced expiratory volume in one second (FEV1.Meas), forced vital capacity (FVC.Meas) and their ratio (FEV1.FVC.Meas). DNA methylation was profiled with the Infinium EPIC BeadChip or the Infinium HumanMethylation450 BeadChip array (Illumina, San Diego, CA, USA). The association of methylation beta-values and raw PFT values (in liters) was tested by robust linear regressions with correction for age, sex, height, the first three genotype principal components (PCs), in utero maternal smoking exposure, the first six ReFACTor components, and batch, when appropriate, via limma R package. The results for individuals of the same ethnic subgroup were meta-analyzed using fixed- or random-effects models, based on Cochran's Q p-value. Version 1 is deprecated. The EWAS result files (*.txt) contains: RSID: CpG name. STUDY: Number of sets of individuals included in the meta-analysis. BETA_meta: Coefficient of the regression. SEBETA_meta: Standard error of the coefficient of the regression. PVALUE_meta: P-value for the association. PVALUE_Q: Cochran's Q p-value. Model: Fixed-effect (FE) or Random-effects (RE2) model. PVALUE_meta_adj: False discovery rate (Benjamini & Hochberg method).

  7. Z

    Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Price, Juan José (2023). Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7882078
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Price, Juan José
    Henningsen, Arne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).

    We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).

    This replication package contains the following files and folders:

    README This file

    MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:

    museum: Name of the museum.

    type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).

    munic: Municipality, in which the museum is located.

    yr: Year of the observation.

    units: Number of visit sites.

    resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).

    vis: Number of (physical) visitors.

    aarc: Number of articles published (archeology).

    ach: Number of articles published (cultural history).

    aah: Number of articles published (art history).

    anh: Number of articles published (natural history).

    exh: Number of temporary exhibitions.

    edu: Number of primary school classes on educational visits to the museum.

    ev: Number of events other than exhibitions.

    ftesc: Scientific labor (full-time equivalents).

    ftensc: Non-scientific labor (full-time equivalents).

    expProperty: Running and maintenance costs [1,000 DKK].

    expCons: Conservation expenditure [1,000 DKK].

    ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).

    prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.

    DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.

    make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.

    IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.

    IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.

    allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.

    IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.

    /tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.

    /figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.

  8. H

    Consumer Expenditure Survey (CE)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Consumer Expenditure Survey (CE) [Dataset]. http://doi.org/10.7910/DVN/UTNJAH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the consumer expenditure survey (ce) with r the consumer expenditure survey (ce) is the primo data source to understand how americans spend money. participating households keep a running diary about every little purchase over the year. those diaries are then summed up into precise expenditure categories. how else are you gonna know that the average american household spent $34 (±2) on bacon, $826 (±17) on cellular phones, and $13 (±2) on digital e-readers in 2011? an integral component of the market basket calculation in the consumer price index, this survey recently became available as public-use microdata and they're slowly releasing historical files back to 1996. hooray! for a t aste of what's possible with ce data, look at the quick tables listed on their main page - these tables contain approximately a bazillion different expenditure categories broken down by demographic groups. guess what? i just learned that americans living in households with $5,000 to $9,999 of annual income spent an average of $283 (±90) on pets, toys, hobbies, and playground equipment (pdf page 3). you can often get close to your statistic of interest from these web tables. but say you wanted to look at domestic pet expenditure among only households with children between 12 and 17 years old. another one of the thirteen web tables - the consumer unit composition table - shows a few different breakouts of households with kids, but none matching that exact population of interest. the bureau of labor statistics (bls) (the survey's designers) and the census bureau (the survey's administrators) have provided plenty of the major statistics and breakouts for you, but they're not psychic. if you want to comb through this data for specific expenditure categories broken out by a you-defined segment of the united states' population, then let a little r into your life. fun starts now. fair warning: only analyze t he consumer expenditure survey if you are nerd to the core. the microdata ship with two different survey types (interview and diary), each containing five or six quarterly table formats that need to be stacked, merged, and manipulated prior to a methodologically-correct analysis. the scripts in this repository contain examples to prepare 'em all, just be advised that magnificent data like this will never be no-assembly-required. the folks at bls have posted an excellent summary of what's av ailable - read it before anything else. after that, read the getting started guide. don't skim. a few of the descriptions below refer to sas programs provided by the bureau of labor statistics. you'll find these in the C:\My Directory\CES\2011\docs directory after you run the download program. this new github repository contains three scripts: 2010-2011 - download all microdata.R lo op through every year and download every file hosted on the bls's ce ftp site import each of the comma-separated value files into r with read.csv depending on user-settings, save each table as an r data file (.rda) or stat a-readable file (.dta) 2011 fmly intrvw - analysis examples.R load the r data files (.rda) necessary to create the 'fmly' table shown in the ce macros program documentation.doc file construct that 'fmly' table, using five quarters of interviews (q1 2011 thru q1 2012) initiate a replicate-weighted survey design object perform some lovely li'l analysis examples replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using unimputed variables replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t -tests using unimputed variables create an rsqlite database (to minimize ram usage) containing the five imputed variable files, after identifying which variables were imputed based on pdf page 3 of the user's guide to income imputation initiate a replicate-weighted, database-backed, multiply-imputed survey design object perform a few additional analyses that highlight the modified syntax required for multiply-imputed survey designs replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using imputed variables repl icate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using imputed variables replicate the %proc_reg() and %proc_logistic() macros found in "ce macros.sas" and provide some examples of regressions and logistic regressions using both unimputed and imputed variables replicate integrated mean and se.R match each step in the bls-provided sas program "integr ated mean and se.sas" but with r instead of sas create an rsqlite database when the expenditure table gets too large for older computers to handle in ram export a table "2011 integrated mean and se.csv" that exactly matches the contents of the sas-produced "2011 integrated mean and se.lst" text file click here to view these three scripts for...

  9. Live tables on commercial and industrial floorspace and rateable value...

    • gov.uk
    Updated Nov 10, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Levelling Up, Housing and Communities (2012). Live tables on commercial and industrial floorspace and rateable value statistics [Dataset]. https://www.gov.uk/government/statistical-data-sets/live-tables-on-commercial-and-industrial-floorspace-and-rateable-value-statistics
    Explore at:
    Dataset updated
    Nov 10, 2012
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Levelling Up, Housing and Communities
    Description

    Commercial and industrial floorspace and rateable value statistics are now the responsibility of the Valuation Office Agency (VOA). More details are available at: https://www.gov.uk/government/collections/non-domestic-rating-business-floorspace-statistics.

    https://assets.publishing.service.gov.uk/media/5a79c6d5ed915d07d35b804c/1179479.xls">Table P401 - Commercial and industrial property: summary statistics England and Wales, 1st April, 1998-2008

     <p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">26 KB</span></p>
    
    
    
    
     <p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
     <details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
    

    Request an accessible format.

      If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:alternativeformats@communities.gov.uk" target="_blank" class="govuk-link">alternativeformats@communities.gov.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.
    

    https://assets.publishing.service.gov.uk/media/5a78989d40f0b63247698a24/1179482.xls">Table P402 - Commercial and industrial property: summary statistics for all bulk premises, Government Office Regions, 1st April, 1998-2008

     <p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">25 KB</span></p>
    
    
    
    
     <p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
     <details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-de
    
  10. f

    Long Covid Risk

    • figshare.com
    txt
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Shaheen (2024). Long Covid Risk [Dataset]. http://doi.org/10.6084/m9.figshare.25599591.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 13, 2024
    Dataset provided by
    figshare
    Authors
    Ahmed Shaheen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Feature preparation Preprocessing was applied to the data, such as creating dummy variables and performing transformations (centering, scaling, YeoJohnson) using the preProcess() function from the “caret” package in R. The correlation among the variables was examined and no serious multicollinearity problems were found. A stepwise variable selection was performed using a logistic regression model. The final set of variables included: Demographic: age, body mass index, sex, ethnicity, smoking History of disease: heart disease, migraine, insomnia, gastrointestinal disease, COVID-19 history: covid vaccination, rashes, conjunctivitis, shortness of breath, chest pain, cough, runny nose, dysgeusia, muscle and joint pain, fatigue, fever ,COVID-19 reinfection, and ICU admission. These variables were used to train and test various machine-learning models Model selection and training The data was randomly split into 80% training and 20% testing subsets. The “h2o” package in R version 4.3.1 was employed to implement different algorithms. AutoML was first used, which automatically explored a range of models with different configurations. Gradient Boosting Machines (GBM), Random Forest (RF), and Regularized Generalized Linear Model (GLM) were identified as the best-performing models on our data and their parameters were fine-tuned. An ensemble method that stacked different models together was also used, as it could sometimes improve the accuracy. The models were evaluated using the area under the curve (AUC) and C-statistics as diagnostic measures. The model with the highest AUC was selected for further analysis using the confusion matrix, accuracy, sensitivity, specificity, and F1 and F2 scores. The optimal prediction threshold was determined by plotting the sensitivity, specificity, and accuracy and choosing the point of intersection as it balanced the trade-off between the three metrics. The model’s predictions were also plotted, and the quantile ranges were used to classify the model’s prediction as follows: > 1st quantile, > 2nd quantile, > 3rd quartile and < 3rd quartile (very low, low, moderate, high) respectively. Metric Formula C-statistics (TPR + TNR - 1) / 2 Sensitivity/Recall TP / (TP + FN) Specificity TN / (TN + FP) Accuracy (TP + TN) / (TP + TN + FP + FN) F1 score 2 * (precision * recall) / (precision + recall) Model interpretation We used the variable importance plot, which is a measure of how much each variable contributes to the prediction power of a machine learning model. In H2O package, variable importance for GBM and RF is calculated by measuring the decrease in the model's error when a variable is split on. The more a variable's split decreases the error, the more important that variable is considered to be. The error is calculated using the following formula: 𝑆𝐸=𝑀𝑆𝐸∗𝑁=𝑉𝐴𝑅∗𝑁 and then it is scaled between 0 and 1 and plotted. Also, we used The SHAP summary plot which is a graphical tool to visualize the impact of input features on the prediction of a machine learning model. SHAP stands for SHapley Additive exPlanations, a method to calculate the contribution of each feature to the prediction by averaging over all possible subsets of features [28]. SHAP summary plot shows the distribution of the SHAP values for each feature across the data instances. We use the h2o.shap_summary_plot() function in R to generate the SHAP summary plot for our GBM model. We pass the model object and the test data as arguments, and optionally specify the columns (features) we want to include in the plot. The plot shows the SHAP values for each feature on the x-axis, and the features on the y-axis. The color indicates whether the feature value is low (blue) or high (red). The plot also shows the distribution of the feature values as a density plot on the right.

  11. PolarMorphism results

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, csv
    Updated Jan 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joanna von Berg; Joanna von Berg (2022). PolarMorphism results [Dataset]. http://doi.org/10.5281/zenodo.5844193
    Explore at:
    csv, application/gzipAvailable download formats
    Dataset updated
    Jan 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joanna von Berg; Joanna von Berg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    polarmorphism_results.tar.gz is a directory containing PolarMorphism results with significantly shared SNPs for all pairwise combinations of the traits mentioned at the end of this description. Each row contains one SNP with the respective rsid (snpid), distance (r), angle (angle), whitened z-scores for trait 1 and trait 2 (z.whitened.*), p-value and q-value for r (r.pval and r.qval) and p-value and q-value for the angle (theta.pval and theta.qval).

    polarmorphism_results_clumped.tar.gz is a directory containing the significant loci obtained after clumping the above results. See the accompanying preprint/paper for details).

    The following traits were used (see the file "GWAS.csv" on this page for references to the original papers and the accompanying preprint / paper for preprocessing details):

    Alzheimer's Disease, Atrial Fibrilation, Amyotrophic Lateral Sclerosis, Any stroke, Autism Spectrum Disorder, Asthma, Breast Cancer, Bipolar Disorder, Body Mass Index, Coronary Calcification, Coronary Artery Disease, Cardio-Embolic stroke, Carotid Intima-Media Thickness, Cigarettes per Day, Diastolic Blood Pressure, Depressive symptoms, Educational Attainment, Ever smoked, Forearm Bone Mass Density, Femoral Neck Bone Mass Density, Former Smoker, High-Density Lipoprotein, Height, Heart Failure, Inflammatory Bowel Disease, Insomnia, Intelligence Quotient, Any Ischemic stroke, Large Artery Stroke, Low-Density Lipoprotein, Onset Smoking, Lumbar Spine Bone Mass Density, Major Depression Disorder, Neuroticism, Nonischemic Cardiomyopathy, Parkinson's Disease, Plaque Presence, Pulse Pressure, Prostate Cancer, Systolic Blood Pressure, Small Vessel Disease, Subjective Well-Being, Type 2 Diabetes, Type 2 Diabetes adjusted for BMI, Total Cholesterol, Triglycerides

  12. r

    SYD ALL climate data statistics summary

    • researchdata.edu.au
    • demo.dev.magda.io
    Updated Mar 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). SYD ALL climate data statistics summary [Dataset]. https://researchdata.edu.au/syd-all-climate-statistics-summary/2989432
    Explore at:
    Dataset updated
    Mar 13, 2019
    Dataset provided by
    data.gov.au
    Authors
    Bioregional Assessment Program
    License

    Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
    License information was derived automatically

    Description

    Abstract \r

    \r The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.\r \r \r \r There are 4 csv files here:\r \r BAWAP_P_annual_BA_SYB_GLO.csv\r \r Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.\r \r Source data: annual BILO rainfall on \\wron\Project\BA\BA_N_Sydney\Working\li036_Lingtao_LI\Grids\BILO_Rain_Ann\\r \r \r \r P_PET_monthly_BA_SYB_GLO.csv\r \r long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month\r \r \r \r Climatology_Trend_BA_SYB_GLO.csv\r \r Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend\r \r \r \r Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv\r \r Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). All data used in this analysis came directly from James Risbey, CMAR, Hobart. As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).\r \r

    Dataset History \r

    \r Dataset was created from various BILO source data, including Monthly BILO rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET (calculated by Randall Donohue), Correlation coefficient data from James Risbey\r \r

    Dataset Citation \r

    \r Bioregional Assessment Programme (XXXX) SYD ALL climate data statistics summary. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/b0a6ccf1-395d-430e-adf1-5068f8371dea.\r \r

    Dataset Ancestors \r

    \r * Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012\r \r

  13. Z

    CONTENT -- Multi-context genetic modeling TWAS summary statistics

    • data.niaid.nih.gov
    Updated Jun 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zaitlen, Noah (2022). CONTENT -- Multi-context genetic modeling TWAS summary statistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5208182
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Halperin, Eran
    Zaitlen, Noah
    Gusev, Alexander
    Balliu, Brunilda
    Lu, Andrew
    Chun, Jimmie Ye
    Gordon, Mary Grace
    Thompson, Mike
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We provide the summary statistics of running CONTENT, the context-by-context approach, and UTMOST on over 22 phenotypes. The phenotypes are listed in the manuscript, and their respective studies and sample size can be found in a table under the supplementary section of the manuscript. All 3 methods were trained on GTEx v7 as well as CLUES, a single-cell RNA sequencing dataset of PBMCs. The data include the gene name, model, cross-validated R^2, prediction pvalue, TWAS p value, TWAS Z score, and a column titled "hFDR" indicating whether the association was statistically significant while employing hierarchical FDR. The benefits of employing such an approach for all methods can be found in the manuscript.

  14. d

    Data from: A comment on the use of stochastic character maps to estimate...

    • datadryad.org
    zip
    Updated Oct 19, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liam J. Revell (2012). A comment on the use of stochastic character maps to estimate evolutionary rate variation in a continuously valued trait [Dataset]. http://doi.org/10.5061/dryad.8mj66m5c
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 19, 2012
    Dataset provided by
    Dryad
    Authors
    Liam J. Revell
    Time period covered
    2012
    Description

    Phylogenetic comparative biology has progressed considerably in recent years. One of the most important developments has been the application of likelihood-based methods to fit alternative models for trait evolution in a phylogenetic tree with branch lengths proportional to time. An important example of this type of method is O’Meara et al.’s (2006) “noncensored” test for variation in the evolutionary rate for a continuously valued character trait through time or across the branches of a phylogenetic tree. According to this method, we first hypothesize evolutionary rate regimes on the tree (called “painting” in Butler and King, 2004); and then we fit an evolutionary model, specifically the popular Brownian model, in which the instantaneous variance of the Brownian random diffusion process has different values in different parts of the phylogeny. The authors suggest that to test a hypothesis that the state of a discrete character influenced the rate of a continuous character, one could u...

  15. Output Speed vs. Price by Command-R Endpoint

    • artificialanalysis.ai
    Updated Aug 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2024). Output Speed vs. Price by Command-R Endpoint [Dataset]. https://artificialanalysis.ai/models/command-r
    Explore at:
    Dataset updated
    Aug 15, 2024
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comprehensive comparison of Output Speed (Output Tokens per Second) vs. Price (USD per M Tokens) by Model

  16. 96 wells fluorescence reading and R code statistic for analysis

    • zenodo.org
    bin, csv, doc, pdf
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JVD Molino; JVD Molino (2024). 96 wells fluorescence reading and R code statistic for analysis [Dataset]. http://doi.org/10.5281/zenodo.1119285
    Explore at:
    doc, csv, pdf, binAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    JVD Molino; JVD Molino
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    Data points present in this dataset were obtained following the subsequent steps: To assess the secretion efficiency of the constructs, 96 colonies from the selection plates were evaluated using the workflow presented in Figure Workflow. We picked transformed colonies and cultured in 400 μL TAP medium for 7 days in Deep-well plates (Corning Axygen®, No.: PDW500CS, Thermo Fisher Scientific Inc., Waltham, MA), covered with Breathe-Easy® (Sigma-Aldrich®). Cultivation was performed on a rotary shaker, set to 150 rpm, under constant illumination (50 μmol photons/m2s). Then 100 μL sample were transferred clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA) and fluorescence was measured using an Infinite® M200 PRO plate reader (Tecan, Männedorf, Switzerland). Fluorescence was measured at excitation 575/9 nm and emission 608/20 nm. Supernatant samples were obtained by spinning Deep-well plates at 3000 × g for 10 min and transferring 100 μL from each well to the clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA), followed by fluorescence measurement. To compare the constructs, R Statistic version 3.3.3 was used to perform one-way ANOVA (with Tukey's test), and to test statistical hypotheses, the significance level was set at 0.05. Graphs were generated in RStudio v1.0.136. The codes are deposit herein.

    Info

    ANOVA_Turkey_Sub.R -> code for ANOVA analysis in R statistic 3.3.3

    barplot_R.R -> code to generate bar plot in R statistic 3.3.3

    boxplotv2.R -> code to generate boxplot in R statistic 3.3.3

    pRFU_+_bk.csv -> relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

    sup_+_bl.csv -> supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

    sup_raw.csv -> supernatant mCherry fluorescence dataset of 96 colonies for each construct.

    who_+_bl2.csv -> whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

    who_raw.csv -> whole culture mCherry fluorescence dataset of 96 colonies for each construct.

    who_+_Chlo.csv -> whole culture chlorophyll fluorescence dataset of 96 colonies for each construct.

    Anova_Output_Summary_Guide.pdf -> Explain the ANOVA files content

    ANOVA_pRFU_+_bk.doc -> ANOVA of relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

    ANOVA_sup_+_bk.doc -> ANOVA of supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

    ANOVA_who_+_bk.doc -> ANOVA of whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

    ANOVA_Chlo.doc -> ANOVA of whole culture chlorophyll fluorescence of all constructs, plus average and standard deviation values.

    Consider citing our work.

    Molino JVD, de Carvalho JCM, Mayfield SP (2018) Comparison of secretory signal peptides for heterologous protein expression in microalgae: Expanding the secretion portfolio for Chlamydomonas reinhardtii. PLoS ONE 13(2): e0192433. https://doi.org/10.1371/journal. pone.0192433

  17. m

    Mitoplate S-1 analysis using R

    • data.mendeley.com
    Updated Mar 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flavia Radogna (2020). Mitoplate S-1 analysis using R [Dataset]. http://doi.org/10.17632/b9mprfdvmv.1
    Explore at:
    Dataset updated
    Mar 5, 2020
    Authors
    Flavia Radogna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This R script performs normalisation of data obtained with the MitoPlate S-1 commercialised by Biolog. In addition, it creates a scatterplot of initial rate values between conditions of interest. The script includes a first normalisation step using the "No substrate" well (A1) required for the rows A to H and a second normalisation step using the "L-Malic Acid 100 µM" (G1) only required for the rows G and H. Initial rate values are calculated as the slope of a linear regression fitted between 30 minutes and 2 hours.

  18. d

    Factors Affecting United States Geological Survey Irrigation Freshwater...

    • search.dataone.org
    • beta.hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Levi Manley (2023). Factors Affecting United States Geological Survey Irrigation Freshwater Withdrawal Estimates In Utah: PRISM Analysis Results and R Codes [Dataset]. https://search.dataone.org/view/sha256%3A4a8b3f77b51143a5d1f90ddaca426072477db8937941265e67db7bce8f083e08
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    J. Levi Manley
    Time period covered
    Jan 1, 1895 - Sep 30, 2020
    Area covered
    Description

    This Resource serves to explain and contain the methodology, R codes, and results of the PRISM freshwater supply key indicator analysis for my thesis. For more information, see my thesis at the USU Digital Commons.

    Freshwater availability in the state can be summarized using streamflow, reservoir level, precipitation, and temperature data. Climate data for this study have a period of record greater than 30 years, preferably extending beyond 1950, and are representative of natural conditions at the county-level.

    Oregon State University, Northwest Alliance for Computational Science and Engineering PRISM precipitation and temperature gridded data are representative of statewide, to county-level, from 1895-2015. These data are available online from the PRISM Climate Group. Using the R ‘prism’ package, monthly PRISM 4km raster grids were downloaded. Boundary shapefiles of Utah state, and each county, were obtained online from the Utah Geospatial Resource Center webpage. Using the R ‘rgdal’ and ‘sp’ packages, these shapefiles were transformed from their native World Geodetic System 1984 coordinate system to match the PRISM BIL raster’s native North American Datum 1983 coordinate system. Using the R ‘raster’ package, medians of PRISM precipitation grids at each spatial area of interest were calculated and summed for water years and seasons. Medians were also calculated for PRISM temperature grids and averaged over water years and seasons. For analysis of single months, the median results were used for all PRISM indicators. Seasons were analyzed for the calendar year which they are in, Winter being the first season of each year. Freshwater availability key indicators were non-parametrically separated per temporal/spatial delineation into quintiles representing Very Wet/Very High/Hot (top 20% of values), Wet/High/Hot (60-80%), Moderate/Mid-level (40-60%), Dry/Low/Cool (20-40%), to Very Dry/Very Low/Cool (bottom 20%). Each quintile bin was assigned a rank value 1-5, with ‘5’ being the value of the top quintile, in preparation for the Kendall Tau-b correlation analysis. These results, along with USGS irrigation withdrawal and acreage data, were loaded into R. State-level quintile results were matched according to USGS report year. County quintile results were matched with corresponding USGS irrigation withdrawal and acreage county-level data per report year for all other areas of interest. Using the base R function cor(), with the “kendall” method selected (which is, by default, the Kendall Tau-b calculation), relationship correlation matrices were produced for all areas of interest. The USGS irrigation withdrawal and acreage data correlation analysis matrices were created using the R ‘corrplot’ package for all areas of interest.

    See Word file for an Example PRISM Analysis, made by Alan Butler at the United States Bureau of Reclamation, which was used as a guide for this analysis.

  19. Intelligence vs. Price by Command-R Endpoint

    • artificialanalysis.ai
    Updated Aug 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2024). Intelligence vs. Price by Command-R Endpoint [Dataset]. https://artificialanalysis.ai/models/command-r
    Explore at:
    Dataset updated
    Aug 15, 2024
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comprehensive comparison of Artificial Analysis Intelligence Index vs. Price (USD per M Tokens) by Model

  20. r

    Open Market Operations – 2009 to Current

    • researchdata.edu.au
    Updated May 12, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reserve Bank of Australia (2013). Open Market Operations – 2009 to Current [Dataset]. https://researchdata.edu.au/open-market-operations-2009-current/2999218
    Explore at:
    Dataset updated
    May 12, 2013
    Dataset provided by
    data.gov.au
    Authors
    Reserve Bank of Australia
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    ‘System cash position’ is an estimate of the change in the aggregate level of Exchange Settlement (ES) balances at the RBA, prior to the RBA’s open market operations on that day. A negative value indicates a projected fall in the level of ES balances, while a positive value indicates a projected rise. The estimate is based on information about settlements arising from transactions by the RBA’s clients, including the Australian Government, as well as the RBA’s own transactions, and is announced at 9:30 am each trading day.\r \r ‘Outright transactions’ is the cash value of purchases and sales, conducted as part of the Bank’s open market operations, of securities issued by the Australian Government and State and Territory central borrowing authorities with remaining terms to maturity up to around 18 months. A positive value indicates the RBA has purchased securities while a negative value indicates the RBA has sold securities.\r \r ‘Foreign exchange swaps’ is the aggregate value of the first leg of foreign exchange swaps transacted for same-day value specifically for domestic liquidity management purposes. A positive value indicates the RBA has sold Australian dollars for foreign currency while a negative value indicates the RBA has purchased Australian dollars. The value of the second leg of a foreign exchange swap is captured in the ‘System cash position’ on the unwind date.\r \r ‘Repurchase agreements (RPs)’ is the amount of the first leg of securities bought/sold by the RBA under repurchase agreement (RP). 'General Collateral' refers to eligible eligible securities issued by the Australian Government, State and Territory governments, supranational institutions, foreign governments and government agencies as well as eligible securities with a sovereign government guarantee. ‘Private securities’ covers all other eligible collateral, including ADI-issued securities (eligible bank-issued discount securities and certificates of deposit with 12 months or less to maturity and bonds issued by ADIs), asset-backed securities (eligible residential mortgage-backed securities and asset-backed commercial paper) and eligible commercial paper. A positive value indicates the RBA has purchased securities under RPs while a negative value indicates the RBA has sold securities under RPs. It does not include RPs which are transacted through the RBA’s overnight RP facility. The value of the second leg of all RPs is captured in the ‘System cash position’ on the respective value dates.\r \r ‘Exchange Settlement account balances (end day)’ is the aggregate of all ES balances held at the RBA at the close of business. Unexpected movements in ES balances and overnight RPs transacted through the RBA’s overnight RP facility mean that ‘Exchange Settlement account balances (end day)’ will not necessarily be the sum of the previous day’s ‘Exchange Settlement account balances (end day)’, the ‘System cash position’ and the total of ‘Open market operations’ transacted.\r \r ‘Overnight repurchase agreements with RBA’ is the aggregate of the first leg of securities bought by the RBA through the overnight RP facility. These data are updated with a one month lag.\r \r

    Outright Transaction Details\r

    \r The 'Outright Transactions Details' sheet provides further information on the outright purchases and sales of Bonds and Discount Securities issued by the Australian Commonwealth, State & Territory Governments, conducted as part of the Bank's open market operations. “Issuer” is the acronym of the issuer of the bond/security. A positive “Face value dealt” indicates a purchase while a negative value indicates a sale. 'Weighted average rate' is the average of the rates dealt for each bond/security, weighted by the amount transacted. 'Cut-off rate' is the lowest yield accepted.\r \r

    Repo Details\r

    \r The Repo Details sheets provide a summary of the type of securities delivered to/by the RBA under RP at each term dealt through the open market operations. 'Govt and Quasi-Govt Repo Details' covers repo against General Collateral (eligible securities issued by the Australian Government, State and Territory governments, supranational institutions, foreign governments and government agencies as well as eligible securities with a sovereign government guarantee). ‘Private securities’ covers all other eligible collateral, including ADI-issued securities (eligible bank-issued discount securities and certificates of deposit with 12 months or less to maturity and bonds issued by ADIs), asset-backed securities (eligible residential mortgage-backed securities and asset-backed commercial paper) and eligible commercial paper.\r \r 'Term' is the number of days dealt in open market operations.\r \r 'Value Dealt' is the amount of the first leg of securities bought/sold by the RBA under RP.\r \r Weighted average rate' is the is the average of the rates on RPs dealt by the RBA through open market operations, weighted by the amount transacted.\r \r 'Cut-off rate' is the lowest rate dealt by the RBA through open market operations for each term dealt.\r \r

    Repo Unwinds\r

    \r The Repos Unwinds sheet provides a summary of the value of repurchase agreements due to unwind in the future, for both General Collateral and Private Securities. The unwind amount is equal to the sum of the total value dealt to that date plus accrued interest. \r \r

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s003

Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

Search
Clear search
Close search
Google apps
Main menu