26 datasets found
  1. f

    Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  2. f

    Collection of example datasets used for the book - R Programming -...

    • figshare.com
    txt
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    figshare
    Authors
    Kingsley Okoye; Samira Hosseini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

  3. d

    funspace: an R package to build, analyze and plot functional trait spaces

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Perez Carmona; Nicola Pavanetto; Giacomo Puglielli (2024). funspace: an R package to build, analyze and plot functional trait spaces [Dataset]. http://doi.org/10.5061/dryad.4tmpg4fg6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Dryad
    Authors
    Carlos Perez Carmona; Nicola Pavanetto; Giacomo Puglielli
    Time period covered
    2023
    Description

    funspace - Creating and representing functional trait spaces

    Estimation of functional spaces based on traits of organisms. The package includes functions to impute missing trait values (with or without considering phylogenetic information), and to create, represent and analyse two dimensional functional spaces based on principal components analysis, other ordination methods, or raw traits. It also allows for mapping a third variable onto the functional space.

    Description of the Data and file structure

    We provide the package as a .tar file (filename: funspace_0.1.1.tar). Once the package has been downloaded, it can be directly uploaded in R from Packages >> Install >> Install from >> Package Archive File (.zip, .tar.gz). All the functions and example datasets included in funspace and that are necessary to reproduce the worked example in the paper will be automatically uploaded. Functions and example datasets can be then accessed using the standard syntax funspace:

    Detailed ...

  4. o

    Lost in the Code?

    • explore.openaire.eu
    Updated Mar 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Eduardo Muñoz (2023). Lost in the Code? [Dataset]. http://doi.org/10.5281/zenodo.7589898
    Explore at:
    Dataset updated
    Mar 13, 2023
    Authors
    Luis Eduardo Muñoz
    Description

    The community behind R is built by inspired scientists that share their tools and knowledge freely to encourage equal access for all aspiring researchers and championing academic integrity. The tools available through R aid in every step of data analysis; including creating experiments, cataloging and organizing data, analyzing the results, and visualizing our findings all in one software environment. The power of programming also increases the flexibility and automation of these tasks saving an abundance of time and ensuring each step can be accurately reproduced. Often, courses that use the R software to demonstrate statistical concepts face the dual challenge of introducing two distinct and equally intricate topics at once; programming and statistics. In most cases, the focus must be shifted away from programming due to constraints on time and breadth to the potential confusion and dismay (repeated appearance of error messages) of novice learners in statistics. This workshop aims to provide a solid foundation of programming concepts such that attendees can confidently approach more advanced statistical courses or independently improve their statistical skills. Many of the ideas that will be covered can apply to many different programming languages, despite R being the main tool. Online recordings. Part 1: https://youtu.be/3zUkPvYTePo Part 2: https://youtu.be/Knjbu6JwNI0 When reading through the word documents with exercises, please use the keyboard shortcut "Ctrl + *" ("Command + *" for Mac) to show the hidden text that provides hints and advice for solving the exercises.

  5. f

    Data_Sheet_6_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_6_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s006
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  6. Data from: Bike Sharing Dataset

    • kaggle.com
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ram Vishnu R
    Description

    Problem Statement:

    A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

    A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

    In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

    They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

    • Which variables are significant in predicting the demand for shared bikes.
    • How well those variables describe the bike demands

    Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

    Business Goal:

    You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

    Data Preparation:

    1. You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.
    2. You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

    Model Building:

    In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

    Model Evaluation:

    When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.

  7. h

    BR-TaxQA-R

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    unicamp-dl, BR-TaxQA-R [Dataset]. https://huggingface.co/datasets/unicamp-dl/BR-TaxQA-R
    Explore at:
    Dataset authored and provided by
    unicamp-dl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Retrieval Augmented Generation (RAG) dataset for Brazilian Federal Revenue Service (Receita Federal do Brasil ― RFB)

    This dataset aims to explore the capabilities and performance of RAG-like systems, focused in the Brazilian Legal Domain, more specifically the Tax Law. This dataset is initially built upon a Question & Answers document the RFB releases every year since 2016, where common questions regarding Personal Income Tax are answered with explicit references to official legal… See the full description on the dataset page: https://huggingface.co/datasets/unicamp-dl/BR-TaxQA-R.

  8. n

    Data from: TipDatingBeast: an R package to assist the implementation of...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Sep 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrien Rieux; Camilo E. Khatchikian (2016). TipDatingBeast: an R package to assist the implementation of phylogenetic tip-dating tests using BEAST [Dataset]. http://doi.org/10.5061/dryad.43q71
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 22, 2016
    Dataset provided by
    Centre de Coopération Internationale en Recherche Agronomique pour le Développement
    University of Pennsylvania
    Authors
    Adrien Rieux; Camilo E. Khatchikian
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Molecular tip-dating of phylogenetic trees is a growing discipline that uses DNA sequences sampled at different points in time to co-estimate the timing of evolutionary events with rates of molecular evolution. In this context, BEAST, a program for Bayesian analysis of molecular sequences, is the most widely used phylogenetic tool. Here, we introduce TipDatingBeast, an R package built to assist the implementation of various phylogenetic tip-dating tests using BEAST. TipDatingBeast currently contains two main functions. The first one allows preparing date-randomization analyses, which assess the temporal signal of a dataset. The second function allows performing leave-one-out analyses, which test for the consistency between independent calibration sequences and allow pinpointing those leading to potential bias. We apply those functions to an empirical dataset and supply practical guidance for results interpretation.

  9. a

    GBM models, constructed from historical data

    • arcticdata.io
    • search.dataone.org
    • +1more
    Updated May 24, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam M. Young (2018). GBM models, constructed from historical data [Dataset]. http://doi.org/10.18739/A22Z3K
    Explore at:
    Dataset updated
    May 24, 2018
    Dataset provided by
    Arctic Data Center
    Authors
    Adam M. Young
    Time period covered
    Jan 1, 1950 - Dec 31, 2009
    Area covered
    Description

    Datasets used in: Young, A.M., Higuera, P.E., Duffy, P.A., and F.S. Hu. Climatic thresholds shape northern high-latitude fire regimes and imply vulnerability to future climate change. In Review at Ecography as of 10/2015. ---------------------------------------------------------------------- ----------------------- Description ---------------------------------- ---------------------------------------------------------------------- These data are the raw results/output from the boosted regression tree modeling conducted in Young et al. (In Review). Results for each of the three models (AK, BOREAL, and TUNDRA) are located separate folders. Each folder contains 100 RData files which contain the output/results from running the 'gbm()' function in R v3.2.0 100 times. Each of the 100 gbms was built using a different subsample of available data. The 'gbm()' function is available in the 'gbm' package in R (https://cran.r-project.org/). Details regarding meta-parameter selection and model building can be found in Young et al. (In Review). ---------------------------------------------------------------------- ------------------------ File Naming --------------------------------- ---------------------------------------------------------------------- 'MODEL_gbm_xx.RData' 'MODEL_' - Three different sets of GBM models: 'AK', 'BOREAL', and 'TUNDRA' 'gbm' - Generalized boosting model '_xx' - Model number (1-100) '.RData' - File Extension

  10. r

    2016 SoE Built environment Public transport by capital city 1990 to 2014

    • researchdata.edu.au
    Updated Jul 21, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of the Environment (2016). 2016 SoE Built environment Public transport by capital city 1990 to 2014 [Dataset]. https://researchdata.edu.au/2016-soe-built-1990-2014/2980609
    Explore at:
    Dataset updated
    Jul 21, 2016
    Dataset provided by
    data.gov.au
    Authors
    State of the Environment
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    Billions of passenger kms. This data was sourced from the Bureau of Infrastructure, Transport and Regional Economics displaying public transport in billions of kilometres by year.\r \r For more information see http://bitre.gov.au/publications/2014/is_059.aspx.\r \r Figure BLT33 in Built environment. See; https://soe.environment.gov.au/theme/built-environment/topic/2016/livability-transport#built-environment-figure-BLT33\r

  11. r

    2016 SoE Built environment Water efficiency selected industries 2008-09 to...

    • researchdata.edu.au
    Updated Jul 6, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of the Environment (2016). 2016 SoE Built environment Water efficiency selected industries 2008-09 to 2014-2015 [Dataset]. https://researchdata.edu.au/2016-soe-built-2014-2015/2987371
    Explore at:
    Dataset updated
    Jul 6, 2016
    Dataset provided by
    data.gov.au
    Authors
    State of the Environment
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    Water efficiency ($m IGVA per gigalitre), selected industries - Includes Manufacturing, and Commercial and services (including Construction and Transport), 2008-09 to 2014-15\r \r Data provided by ABS from: http://www.abs.gov.au/AUSSTATS/abs@.nsf/allprimarymainfeatures/49F854E3831E4294CA2580580015E2A6?opendocument\r \r Figure BLT46 in Built environment theme.\r https://soe.environment.gov.au/theme/built-environment/topic/2016/urban-environmental-efficiency-water-efficiency#built-environment-figure-BLT46\r

  12. [Superseded] Intellectual Property Government Open Data 2019

    • researchdata.edu.au
    • data.gov.au
    Updated Jun 6, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IP Australia (2019). [Superseded] Intellectual Property Government Open Data 2019 [Dataset]. https://researchdata.edu.au/superseded-intellectual-property-data-2019/2994670
    Explore at:
    Dataset updated
    Jun 6, 2019
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    IP Australia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is IPGOD?\r

    The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.\r \r \r

    How do I use IPGOD?\r

    IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.\r \r \r

    IP Data Platform\r

    IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform\r \r

    References\r

    \r The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.\r \r * Patents\r * Trade Marks\r * Designs\r * Plant Breeder’s Rights\r \r \r

    Updates\r

    \r

    Tables and columns\r

    \r Due to the changes in our systems, some tables have been affected.\r \r * We have added IPGOD 225 and IPGOD 325 to the dataset!\r * The IPGOD 206 table is not available this year.\r * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.\r \r

    Data quality improvements\r

    \r Data quality has been improved across all tables.\r \r * Null values are simply empty rather than '31/12/9999'.\r * All date columns are now in ISO format 'yyyy-mm-dd'.\r * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.\r * All tables are encoded in UTF-8.\r * All tables use the backslash \ as the escape character.\r * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.

  13. n

    Data and Rscripts from: An integrated experimental and mathematical approach...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Apr 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barbara Joncour; William Nelson; Damie Pak; Ottar Bjornstad (2022). Data and Rscripts from: An integrated experimental and mathematical approach to inferring the role of food exploitation and interference interactions in shaping life history [Dataset]. http://doi.org/10.5061/dryad.1g1jwstzd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2022
    Dataset provided by
    Queen's University
    Pennsylvania State University
    Authors
    Barbara Joncour; William Nelson; Damie Pak; Ottar Bjornstad
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Intraspecific interactions can occur through many ways but the mechanisms can be broadly categorized as food exploitation and interference interactions. Identifying how intraspecific interactions impact life history is crucial to accurately predict how population density and structure influence dynamics. However, disentangling the effects of interference interactions from exploitation using experiments, is challenging for most biological systems. Here we propose an approach that combines experiments with modeling to infer the pathways of intraspecific interactions in a system. First, a consumer-resource model is built without intraspecific interactions. Then, the model is parameterized by fitting it to life-history data from a first experiment in which food abundance was varied. Next, hypothesized scenarios of intraspecific interactions are incorporated into the model which is then used to predict life histories with increasing competitor density. Lastly, model predictions are compared against data from a second experiment which raised groups of competitors of different densities. This comparison allows us to infer the role of interference and exploitation in shaping life history. We demonstrated the approach using the smaller tea tortrix Adoxophyes honmai across a range of temperature. We investigated five scenarios of interactions that included exploitation and three pathways for interference through some effects either on energetics to represent changes in ingestion or activity, or on mortality to model deadly interactions, or on mortality and ingestion to model cannibalism. Overall, intraspecific interactions in tea tortrix are best explained by a high level of deadly interactions along with some level of interference that acts on energy such as escaping and blocking access to food. Deadly interactions increase with temperature while interference that acts on energy is strongest close to the optimal temperature for reproduction. Interestingly, exploitation is more important than interference at low competitor density. The combination of mathematical modeling and experimentation allowed us to mechanistically characterize the intraspecific interactions in tea tortrix in a way that is readily incorporated into population-level mathematical models. The primary value of this approach, however, is that it can be applied to a much wider range of taxa than is possible with pure experimental approaches. Methods We designed an approach to infer the most likely pathways of intraspecific interactions that shape life histories in a studied system. The approach is in four steps that weave together theory and experiments. We demonstrated the approach with the smaller tea tortrix moth (Adoxophies honmai). Step 1. Build base model We first built the base model which is the baseline for the theoretical framework used later to predict how different pathways of intraspecific interactions influence life histories. The base model is a consumer-resource cohort model that assumes no intraspecific interactions – no food exploitation and no interference interactions. As such, the base model describes solely how vital rates are impacted by changes in food abundance. Step 2. Parameterize base model (provided R script: Step2.r) Most model parameters can be directly estimated from independent data but a few remained unknown. Unknown parameters were estimated by fitting the base model to the observed life-history traits in the food experiment (FoodExperiment.csv). The food experiment raised individuals in the absence of intraspecific interactions and exposed them to a wide range of food abundance. Step 3. Incorporate intraspecific interactions in base model to predict their effects on life histories (provided R script: Step3.r) In this step, the parameterized base model was modified to incorporate several hypothesized scenarios of intraspecific interactions. For each scenario, we predicted how intraspecific interactions impact life-history traits and stage-structure distributions for groups of competitors. Step 4. Test model predictions using experiment To evaluate the support for each hypothesis, we compared model predictions with data from the competition experiment (CompetitionExperiment.csv). The competition experiment measured the impact of intraspecific interactions (i.e. competitor density) on life-history traits and on the stage-structure of groups of competitors. The comparison of life-history data from experiment with model predictions allowed to infer the role of interference interactions and the one of food exploitation in shaping life histories, as well as the functional dependencies for interference interactions in the studied system.

  14. n

    Data from: Tree functional traits across Caribbean island dry forests are...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Dec 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Catherine Hulshof; Pablo Lopez; Alanis Rosa-Santiago; Janet Franklin; Jonathan Walter (2023). Tree functional traits across Caribbean island dry forests are remarkably similar [Dataset]. http://doi.org/10.5061/dryad.z08kprrj5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 25, 2023
    Dataset provided by
    Virginia Commonwealth University
    University of Puerto Rico at Río Piedras
    San Diego State University
    Athenys Research
    Authors
    Catherine Hulshof; Pablo Lopez; Alanis Rosa-Santiago; Janet Franklin; Jonathan Walter
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Caribbean
    Description

    Delineation of potential dry forest and estimated actual dry forest on Caribbean islands. Potential dry forest is delineated based on CHELSA climate data (www.chelsa-climate.org) and the FAO definition of dry forest. Estimated actual dry forest is corrected for land cover using data from Hansen et al. (2022) https://doi.org/10.1088/1748-9326/ac46ec. Areas of potential dry forest, estimated actual dry forest, and area of built-up land covers are summarized by islands and joined to CHELSA bioclimatic variables for selected islands where data on functional traits are available. Trait values by sites are also included. The package consists of data outputs and R scripts to reproduce the data outputs from identified publicly available data sources. Methods Areas potentially supporting tropical dry forest in the Caribbean were delineated based on precipitation normals data from CHELSA and a climatically-based definition of tropical dry forest established by the Food and Agriculture Association of the United Nations (FAO): total annual precipitation 500 to 1500 mm, 5-8 mo < 100 mm precipitation. Land cover data (2019 conditions) from Hansen et al. were used to constrain estimated actual dry forest based on the intersection of climate-based potential dry forest with forested land cover. To facilitate analysis with functional trait data obtained from forests on a subset of islands, data on the area of potential and estimated actual dry forest and forest and built land covers were summarized by island and associated with a standard suite of 19 bioclimate variables. R scripts showing and reproducing our detailed methods are provided with the repository.

  15. f

    Comparison of the Predictive Performance and Interpretability of Random...

    • acs.figshare.com
    • figshare.com
    zip
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley (2023). Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets [Dataset]. http://doi.org/10.1021/acs.jcim.6b00753.s006
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.

  16. Data from: Dataset for Vector space model and the usage patterns of...

    • figshare.com
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave (2023). Dataset for Vector space model and the usage patterns of Indonesian denominal verbs [Dataset]. http://doi.org/10.6084/m9.figshare.8187155.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PrefaceThis is the data repository for the paper accepted for publication in NUSA's special issue on Linguistic studies using large annotated corpora (co-edited by Hiroki Nomoto and David Moeljadi).How to cite the datasetIf you use, adapt, and/or modify any of the dataset in this repository for your research or teaching purposes (except for the malindo_dbase, see below), please cite as:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Fileset. https://doi.org/10.6084/m9.figshare.8187155.Alternatively, click on the dark pink Cite button to browse different citation style (default is DataCite).The malindo_dbase data in this repository is from Nomoto et al. (2018) (cf the GitHub repository). So please also cite their work if you use it for your research:Nomoto, Hiroki, Hannah Choi, David Moeljadi and Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.Tutorial on how to use the data together with the R Markdown Notebook for the analyses is available on GitHub and figshare:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Software. doi: https://doi.org/10.6084/m9.figshare.9970205Dataset description1. Leipzig_w2v_vector_full.bin is the vector space model used in the paper. We built it using wordVectors package (Schmidt & Li 2017) via the MonARCH High Performance Computing Cluster (We thank Philip Chan for his help with access to MonARCH).2. Files beginning with ngramexmpl_... are data for the n-grams (i.e. words sequence) of verbs discussed in the paper. The files are in tab-separated format.3. Files beginning with sentence_... are full sentences for the verbs discussed in the paper (in the plain text format and R dataset format [.rds]). Information of the corpus file and sentence number in which the verb is found are included.4. me_parsed_nountaggedbase (in three different file-formats) contains database of the me- words with noun-tagged root that MorphInd identified to occur in three morphological schemas we focus on (me-, me-/-kan, and me-/-i). The database has columns for the verbs' token frequency in the corpus, root forms, MorphInd parsing output, among others.5. wordcount_leipzig_allcorpus (in three different file-formats) contains information on the size of each corpus file used in the paper and from which the vector space model is built.6. wordlist_leipzig_ME_DI_TER_percorpus.tsv is a tab-separated frequency list of words prefixed with me-, di-, and ter- in all thirteen corpus files used. The wordlist is built by first tokenising each corpus file, lowercasing the tokens, and then extracting the words with the corresponding three prefixes using the following regular expressions: - For me-: ^(?i)(me)([a-z-]{3,})$- For di-: ^(?i)(di)([a-z-]{3,})$- For ter-: ^(?i)(ter)([a-z-]{3,})$7. malindo_dbase is the MALINDO Morphological Dictionary (see above).ReferencesSchmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. R package. http://github.com/bmschmidt/wordVectors.

  17. f

    Decision tree inversion model results.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Mar 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    He Jing; Wang Bin; He Jiachen (2025). Decision tree inversion model results. [Dataset]. http://doi.org/10.1371/journal.pone.0319657.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 24, 2025
    Dataset provided by
    PLOS ONE
    Authors
    He Jing; Wang Bin; He Jiachen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As a key substance for crop photosynthesis, chlorophyll content is closely related to crop growth and health. Inversion of chlorophyll content using unmanned aerial vehicle (UAV) visible light images can provide a theoretical basis for crop growth monitoring and health diagnosis. We used rice at the tasseling stage as the research object and obtained UAV visible orthophotos of two experimental fields planted manually (experimental area A) and mechanically (experimental area B), respectively. We constructed 14 vegetation indices and 15 texture features and utilized the correlation coefficient method to analyze them comprehensively. Then, four vegetation indices and four texture features were selected from them as feature variables to be added into three models, namely, K-neighborhood (KNN), decision tree (DT), and AdaBoost, respectively, for inverting chlorophyll content in experimental areas A and B. In the KNN model, the inversion model built with BGRI as the independent variable in region A has the highest accuracy, with R2 of 0.666 and RSME of 0.79; the inversion model built with RGRI as the independent variable in region B has the highest accuracy, with R2 of 0.729 and RSME of 0.626. In the DT model, the inversion model built with B-variance as the independent variable in region A has the highest accuracy, with R2 of 0.840 and RSME of 0.464; the inversion model built with G-mean as the independent variable in region B has the highest accuracy, with R2 of 0.845 and RSME of 0.530. In the AdaBoost model, the inversion model built with R-skewness as the independent variable in region A has the highest accuracy, with R2 of 0.826 and RSME of 0.642; the inversion model established with g as the independent variable in area B had the highest accuracy, with R2 of 0.879 and RSME of 0.599. In the comprehensive analysis, the best inversion models for experimental areas A and B were B-variance-decision tree and g-AdaBoost, respectively, whose models can quickly and accurately carry out the inversion of chlorophyll content of rice, and provide a theoretical basis for the monitoring of the crop’s growth and health under different cultivation methods.

  18. R code, data, and analysis documentation for Colour biases in learned...

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wyatt Toure; Simon M. Reader (2023). R code, data, and analysis documentation for Colour biases in learned foraging preferences in Trinidadian guppies [Dataset]. http://doi.org/10.6084/m9.figshare.14404868.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Wyatt Toure; Simon M. Reader
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary---------------This is the repository containing the R code and data to produce the analyses and figures in the manuscript ‘Colour biases in learned foraging preferences in Trinidadian guppies’. R version 3.6.2 was used for this project. Here, we explain how to reproduce the results, provides the location of the metadata for the data sheets, and gives descriptions of the root directory contents and folder contents. This material is adapted from the README file of the project, README.md which is located in the root directory.How to reproduce the results-------------------------------------------This project uses the renv package from RStudio to manage package dependencies and ensure reproducibility through time. To ensure results are reproduced based on the versions of the packages used at the time this project was created, you will need to install renv using install.packages("renv") in R.If you want to reproduce the results it is best to download the entire repository onto your system. This can be done by clicking the Download button on the FigShare repository (DOI: 10.6084/m9.figshare.14404868). This will download a zip file of the entire repository. Unzip the zip file to get access to the project files.Once the repository is downloaded onto your system, navigate to the root directory and open guppy-colour-learning-project.Rproj. It is important to open the project using the .Rproj file to ensure the working directory is set correctly. Then install the package dependencies onto your system using renv::restore(). Running renv::restore() will install the correct versions of all the packages needed to reproduce our results. Packages are installed in a stand-alone library for this project and will not affect your installed R packages anywhere else.If you want to reproduce specific results from the analyses you can open either analysis-experiment-1.Rmd for results from experiment 1 or analysis-experiment-2.Rmd for results from experiment 2. Both are located in the root directory. You can select the Run All option under the Code option in the navbar of RStudio to execute all the code chunks. You can also run all chunks independently as well though we advise that you do so sequentially since variables necessary for the analysis are created as the script progresses.Metadata--------------Data are available in the data/ directory. - colour-learning-experiment-1-data.csv are the data for experiment 1- colour-learning-experiment-2-full-data.csv are the data for experiment 2We provide the variable descriptions for the data sets in the file metadata.md located in the data/ directory. The packages required to conduct the analyses and construct the website as well as their versions and citations are provided in the file required-r-packages.md.Directory structure---------------------------- - data/ contains the raw data used to conduct the analyses - docs/ contains the reader-friendly html write-up of the analyses, the GitHub pages site is built from this folder - R/ contains custom R functions used in the analysis - references/ contains reference information and formatting for citations used in the project - renv/ contains an activation script and configuration files for the renv package manager - figs/ contains the individual files for the figures and residual diagnostic plots produced by the analysis scripts. This directory is created and populated by running analysis-experiment-1.Rmd, analysis-experiment-2.Rmd and combined-figures.RmdRoot directory contents------------------------------------The root directory contains Rmd scripts used to conduct the analyses, create figures, and render the website pages. Below we describe the contents of these files as well as the additional files contained in the root directory. - analysis-experiment-1.Rmd is the R code and documentation for the experiment 1 data preparation and analysis. This script generates the Analysis 1 page of the website. - analysis-experiment-2.Rmd is the R code and documentation for the experiment 2 data preparation and analysis. This script generates the Analysis 2 page of the website. - protocols.Rmd contains the protocols used to conduct the experiments and generate the data. This script generates the Protocols page of the website. - index.Rmd creates the Homepage of the project site. - combined-figures.Rmd is the R code used to create figures that combine data from experiments 1 and 2. Not used in the project site. - treatment-object-side-assignment.Rmd is the R code used to assign treatments and object sides during trials for experiment 2. Not used in the project site. - renv.lock is a JSON formatted plain text file which contains package information for the project. renv will install the packages listed in this file upon executing renv::restore() - required-r-packages.md is a plain text file containing the versions and sources of the packages required for the project. - styles.css contains the CSS formatting for the rendered html pages - LICENSE.md contains the license indicating the conditions upon which the code can be reused - guppy-colour-learning-project.Rproj is the R project file which sets the working directory of the R instance to the root directory of this repository. If trying to run the code in this repository to reproduce results it is important to open R by clicking on this .Rproj file.

  19. f

    Top 15 predictors for machine learning algorithms with a built-in importance...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca Craig-Schapiro; Max Kuhn; Chengjie Xiong; Eve H. Pickering; Jingxia Liu; Thomas P. Misko; Richard J. Perrin; Kelly R. Bales; Holly Soares; Anne M. Fagan; David M. Holtzman (2023). Top 15 predictors for machine learning algorithms with a built-in importance measure. [Dataset]. http://doi.org/10.1371/journal.pone.0018850.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rebecca Craig-Schapiro; Max Kuhn; Chengjie Xiong; Eve H. Pickering; Jingxia Liu; Thomas P. Misko; Richard J. Perrin; Kelly R. Bales; Holly Soares; Anne M. Fagan; David M. Holtzman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ranking of the top 15 predictors for the four models with a built-in importance statistic demonstrates considerable overlap in the top predictors for each model. Furthermore, nearly all of the markers found to best discriminate CDR 0 from CDR>0 participants in the more targeted ROC analyses (Table 5) were also identified as the top predictors in the machine learning models, reconfirming their biomarker potential.

  20. f

    Vegetation index.

    • figshare.com
    xls
    Updated Mar 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    He Jing; Wang Bin; He Jiachen (2025). Vegetation index. [Dataset]. http://doi.org/10.1371/journal.pone.0319657.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 24, 2025
    Dataset provided by
    PLOS ONE
    Authors
    He Jing; Wang Bin; He Jiachen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As a key substance for crop photosynthesis, chlorophyll content is closely related to crop growth and health. Inversion of chlorophyll content using unmanned aerial vehicle (UAV) visible light images can provide a theoretical basis for crop growth monitoring and health diagnosis. We used rice at the tasseling stage as the research object and obtained UAV visible orthophotos of two experimental fields planted manually (experimental area A) and mechanically (experimental area B), respectively. We constructed 14 vegetation indices and 15 texture features and utilized the correlation coefficient method to analyze them comprehensively. Then, four vegetation indices and four texture features were selected from them as feature variables to be added into three models, namely, K-neighborhood (KNN), decision tree (DT), and AdaBoost, respectively, for inverting chlorophyll content in experimental areas A and B. In the KNN model, the inversion model built with BGRI as the independent variable in region A has the highest accuracy, with R2 of 0.666 and RSME of 0.79; the inversion model built with RGRI as the independent variable in region B has the highest accuracy, with R2 of 0.729 and RSME of 0.626. In the DT model, the inversion model built with B-variance as the independent variable in region A has the highest accuracy, with R2 of 0.840 and RSME of 0.464; the inversion model built with G-mean as the independent variable in region B has the highest accuracy, with R2 of 0.845 and RSME of 0.530. In the AdaBoost model, the inversion model built with R-skewness as the independent variable in region A has the highest accuracy, with R2 of 0.826 and RSME of 0.642; the inversion model established with g as the independent variable in area B had the highest accuracy, with R2 of 0.879 and RSME of 0.599. In the comprehensive analysis, the best inversion models for experimental areas A and B were B-variance-decision tree and g-AdaBoost, respectively, whose models can quickly and accurately carry out the inversion of chlorophyll content of rice, and provide a theoretical basis for the monitoring of the crop’s growth and health under different cultivation methods.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s003

Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

Search
Clear search
Close search
Google apps
Main menu