100+ datasets found
  1. f

    Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  2. ✈️Air Passengers Dataset✈️

    • kaggle.com
    Updated Feb 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bryan Milleanno (2024). ✈️Air Passengers Dataset✈️ [Dataset]. https://www.kaggle.com/datasets/brmil07/air-passengers-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bryan Milleanno
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides monthly totals of US airline passengers from 1949 to 1960. This dataset is taken from a built-in dataset of R called AirPassengers. Analysts often employ various statistical techniques, such as decomposition, smoothing, and forecasting models, to analyze patterns, trends, and seasonal fluctuations within the data. Due to its historical nature and consistent temporal granularity, the Air Passengers dataset serves as a valuable resource for researchers, practitioners, and students in the fields of statistics, econometrics, and transportation planning.

  3. Collection of example datasets used for the book - R Programming -...

    • figshare.com
    txt
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kingsley Okoye; Samira Hosseini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

  4. CO2_data

    • kaggle.com
    zip
    Updated Oct 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saheli Basu Roy (2022). CO2_data [Dataset]. https://www.kaggle.com/datasets/sahelibasuroy/co2data
    Explore at:
    zip(681 bytes)Available download formats
    Dataset updated
    Oct 29, 2022
    Authors
    Saheli Basu Roy
    Description

    The CO2 dataframe is a dataset built into R showing the results of an experiment on the cold tolerance of grass. Grass samples from two regions were grown in either a chilled or nonchilled environment, and their CO2 uptake rate was tested.The dataset has been downloaded as a .csv file.

    The two types of region chosen for this experiment are Quebec and Mississippi. Each type has three different plants used for this experiment. Each plant has been treated in either a chilled and nonchilled environment. Average concentration is found to be constant for all categories. Thus, plots are made with respect to variation in average CO2 uptake.

  5. Dataset of Indoor and Built Environment Publication in 2016, Laboratory...

    • catalog.data.gov
    • datasets.ai
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Dataset of Indoor and Built Environment Publication in 2016, Laboratory evaluation of polychlorinated biphenyls encapsulation methods [Dataset]. https://catalog.data.gov/dataset/dataset-of-indoor-and-built-environment-publication-in-2016-laboratory-evaluation-of-polyc
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The data presented in this data file is a product of a journal publication. The dataset contains PCB sorption concentrations on encapsulants, PCB concentrations in the air and in wipe samples, model simulation of the PCB concentration gradient in the source and encapsulant layers on exposed surfaces of encapsulants and in room air at different times, the ranking of encapsulants’ performance. This dataset is associated with the following publication: Liu , X., Z. Guo, K. Krebs , N. Roache, R. Stinson, J. Nardin, R. Pope, C. Mocka, and R. Logan. Laboratory evaluation of PCBs encapsulation method. Indoor and Built Environment. Sage Publications, THOUSAND OAKS, CA, USA, 25(6): 895-915, (2016).

  6. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  7. d

    Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
    Description

    This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.

  8. d

    Plankton measurements found in dataset OSD taken from the MIKHAIL LOMONOSOV...

    • catalog.data.gov
    • data.cnra.ca.gov
    • +2more
    Updated Nov 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2025). Plankton measurements found in dataset OSD taken from the MIKHAIL LOMONOSOV (R/V; call sign UQIH; built 1957; IMO5234955) and EKVATOR in the North Atlantic, Coastal N Atlantic and other locations from 1958 - 1959 (NCEI Accession 0052915) [Dataset]. https://catalog.data.gov/dataset/plankton-measurements-found-in-dataset-osd-taken-from-the-mikhail-lomonosov-r-v-call-sign-uqih-
    Explore at:
    Dataset updated
    Nov 1, 2025
    Dataset provided by
    (Point of Contact)
    Description

    Zooplankton biomass data collected from North Atlantic Ocean in 1958 - 1959 years received from NMFS.

  9. Reddit /r/datasets Dataset

    • kaggle.com
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
    Explore at:
    zip(9619636 bytes)Available download formats
    Dataset updated
    Nov 28, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Meta-Corpus of Datasets: The Reddit Dataset

    The Complete Collection of Datasets Posted on Reddit

    By SocialGrep [source]

    About this dataset

    A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

    Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

    In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

    You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

    Research Ideas

    • Finding correlations between different types of datasets
    • Determining which datasets are most popular on Reddit
    • Analyzing the sentiments of post and comments on Reddit's /r/datasets board

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

    File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.

  10. AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River...

    • data.csiro.au
    • researchdata.edu.au
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson (2024). AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River Water Resource Assessment [Dataset]. http://doi.org/10.25919/y0v9-7b58
    Explore at:
    Dataset updated
    Apr 16, 2024
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 1, 2020 - Jun 30, 2023
    Area covered
    Dataset funded by
    Northern Territory Department of Environment, Parks and Water Security
    CSIROhttp://www.csiro.au/
    Description

    AWC to 60cm is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). AWC (available water capacity) indicates the ability of a soil to retain and supply water for plant growth. This AWC raster data represents a modelled dataset of AWC to 60cm (mm of water to 60cm of soil depth) and is derived from analysed site data, spline calculations and environmental covariates. AWC is a parameter used in land suitability assessments for rainfed cropping and for water use efficiency in irrigated land uses. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This AWC to 60cm dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create AWC to 60cm Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.

  11. Measurements made by the NOAA R/V Ron H. Brown

    • data.nasa.gov
    • cmr.earthdata.nasa.gov
    • +1more
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Measurements made by the NOAA R/V Ron H. Brown [Dataset]. https://data.nasa.gov/dataset/measurements-made-by-the-noaa-r-v-ron-h-brown-b879f
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Measurements made by the NOAA research vessel, the Ron H. Brown between 2000 and 2002.

  12. Z

    Data from: AgrImOnIA: Open Access dataset correlating livestock and air...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fassò, Alessandro; Rodeschini, Jacopo; Fusta Moro, Alessandro; Shaboviq, Qendrim; Vinciguerra, Marco; Maranzano, Paolo; Cameletti, Michela; Finazzi, Francesco; Golini, Natalia; Ignaccolo, Rosaria; Otto, Philipp (2024). AgrImOnIA: Open Access dataset correlating livestock and air quality in the Lombardy region, Italy [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6620529
    Explore at:
    Dataset updated
    Feb 6, 2024
    Dataset provided by
    University of Turin
    University of Milano-Bicocca
    University of Bergamo
    Leibniz University Hannover
    Authors
    Fassò, Alessandro; Rodeschini, Jacopo; Fusta Moro, Alessandro; Shaboviq, Qendrim; Vinciguerra, Marco; Maranzano, Paolo; Cameletti, Michela; Finazzi, Francesco; Golini, Natalia; Ignaccolo, Rosaria; Otto, Philipp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Lombardy, Italy
    Description

    The AgrImOnIA dataset is a comprehensive dataset relating air quality and livestock (expressed as the density of bovines and swine bred) along with weather and other variables. The AgrImOnIA Dataset represents the first step of the AgrImOnIA project. The purpose of this dataset is to give the opportunity to assess the impact of agriculture on air quality in Lombardy through statistical techniques capable of highlighting the relationship between the livestock sector and air pollutants concentrations.

    The building process of the dataset is detailed in the companion paper:

    A. Fassò, J. Rodeschini, A. Fusta Moro, Q. Shaboviq, P. Maranzano, M. Cameletti, F. Finazzi, N. Golini, R. Ignaccolo, and P. Otto (2023). Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy. SCIENTIFIC DATA, 1-19.

    available here.

    This dataset is a collection of estimated daily values for a range of measurements of different dimensions as: air quality, meteorology, emissions, livestock animals and land use. Data are related to Lombardy and the surrounding area for 2016-2021, inclusive. The surrounding area is obtained by applying a 0.3° buffer on Lombardy borders.

    The data uses several aggregation and interpolation methods to estimate the measurement for all days.

    The files in the record, renamed according to their version (es. .._v_3_0_0), are:

    Agrimonia_Dataset.csv(.mat and .Rdata) which is built by joining the daily time series related to the AQ, WE, EM, LI and LA variables. In order to simplify access to variables in the Agrimonia dataset, the variable name starts with the dimension of the variable, i.e., the name of the variables related to the AQ dimension start with 'AQ_'. This file is archived also in the format for MATLAB and R software.

    Metadata_Agrimonia.csv which provides further information about the Agrimonia variables: e.g. sources used, original names of the variables imported, transformations applied.

    Metadata_AQ_imputation_uncertainty.csv which contains the daily uncertainty estimate of the imputed observation for the AQ to mitigate missing data in the hourly time series.

    Metadata_LA_CORINE_labels.csv which contains the label and the description associated with the CLC class.

    Metadata_monitoring_network_registry.csv which contains all details about the AQ monitoring station used to build the dataset. Information about air quality monitoring stations include: station type, municipality code, environment type, altitude, pollutants sampled and other. Each row represents a single sensor.

    Metadata_LA_SIARL_labels.csv which contains the label and the description associated with the SIARL class.

    AGC_Dataset.csv(.mat and .Rdata) that includes daily data of almost all variables available in the Agrimonia Dataset (excluding AQ variables) on an equidistant grid covering the Lombardy region and its surrounding area.

    The Agrimonia dataset can be reproduced using the code available at the GitHub page: https://github.com/AgrImOnIA-project/AgrImOnIA_Data

    UPDATE 31/05/2023 - NEW RELEASE - V 3.0.0

    A new version of the dataset is released: Agrimonia_Dataset_v_3_0_0.csv (.Rdata and .mat), where variable WE_rh_min, WE_rh_mean and WE_rh_max have been recomputed due to some bugs.

    In addition, two new columns are added, they are LI_pigs_v2 and LI_bovine_v2 and represents the density of the pigs and bovine (expressed as animals per kilometer squared) of a square of size ~ 10 x 10 km centered at the station localisation.

    A new dataset is released: the Agrimonia Grid Covariates (AGC) that includes daily information for the period from 2016 to 2020 of almost all variables within the Agrimonia Dataset on a equidistant grid containing the Lombardy region and its surrounding area. The AGC does not include AQ variables as they come from the monitoring stations that are irregularly spread over the area considered.

    UPDATE 11/03/2023 - NEW RELEASE - V 2.0.2

    A new version of the dataset is released: Agrimonia_Dataset_v_2_0_2.csv (.Rdata), where variable WE_tot_precipitation have been recomputed due to some bugs.

    A new version of the metadata is available: Metadata_Agrimonia_v_2_0_2.csv where the spatial resolution of the variable WE_precipitation_t is corrected.

    UPDATE 24/01/2023 - NEW RELEASE - V 2.0.1

    minor bug fixed

    UPDATE 16/01/2023 - NEW RELEASE - V 2.0.0

    A new version of the dataset is released, Agrimonia_Dataset_v_2_0_0.csv (.Rdata) and Metadata_monitoring_network_registry_v_2_0_0.csv. Some minor points have been addressed:

    Added values for LA_land_use variable for Switzerland stations (in Agrimonia Dataset_v_2_0_0.csv)

    Deleted incorrect values for LA_soil_use variable for stations outside Lombardy region during 2018 (in Agrimonia Dataset_v_2_0_0.csv)

    Fixed duplicate sensors corresponding to the same pollutant within the same station (in Metadata_monitoring_network_registry_v_2_0_0.csv)

  13. Using Descriptive Statistics to Analyse Data in R

    • kaggle.com
    zip
    Updated May 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrico68 (2024). Using Descriptive Statistics to Analyse Data in R [Dataset]. https://www.kaggle.com/datasets/enrico68/using-descriptive-statistics-to-analyse-data-in-r
    Explore at:
    zip(105561 bytes)Available download formats
    Dataset updated
    May 9, 2024
    Authors
    Enrico68
    Description

    Load and view a real-world dataset in RStudio

    • Calculate “Measure of Frequency” metrics

    • Calculate “Measure of Central Tendency” metrics

    • Calculate “Measure of Dispersion” metrics

    • Use R’s in-built functions for additional data quality metrics

    • Create a custom R function to calculate descriptive statistics on any given dataset

  14. h

    R-PRM

    • huggingface.co
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuaijie She (2025). R-PRM [Dataset]. https://huggingface.co/datasets/kevinpro/R-PRM
    Explore at:
    Dataset updated
    Mar 31, 2025
    Authors
    Shuaijie She
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📘 R-PRM Dataset (SFT + DPO)

    This dataset is developed for training Reasoning-Driven Process Reward Models (R-PRM), proposed in our ACL 2025 paper. It consists of two stages:

    SFT (Supervised Fine-Tuning): collected from strong LLMs prompted with limited annotated examples, enabling reasoning-style evaluation. DPO (Direct Preference Optimization): constructed by sampling multiple reasoning trajectories and forming preference pairs without additional labels.

    These datasets are used… See the full description on the dataset page: https://huggingface.co/datasets/kevinpro/R-PRM.

  15. c

    Temperature, salinity and other measurements found in datasets XBT and CTD...

    • s.cnmilf.com
    • catalog.data.gov
    Updated Oct 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2025). Temperature, salinity and other measurements found in datasets XBT and CTD taken from the MIRAI (R/V; call sign JNSR; built 1972 as Mutsu; renamed on 1996-02-02) in the North Pacific, Coastal N Pacific and other locations in 1999 (NCEI Accession 0000857) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/temperature-salinity-and-other-measurements-found-in-datasets-xbt-and-ctd-taken-from-the-mirai-
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    (Point of Contact)
    Description

    Temperature, salinity and other measurements found in datasets XBT and CTD taken from the MIRAI (R/V; call sign JNSR; built 1972 as Mutsu; renamed on 02/02/1996) in the North Pacific, Coastal N Pacific and other locations in 1999 (NODC Accession 0000857). Data were submitted by the Japan Agency for Marine-Earth Science and Technology (JAMSTEC).

  16. d

    Temperature measurements found in dataset XBT taken from the RMAS NEWTON...

    • catalog.data.gov
    • data.cnra.ca.gov
    • +3more
    Updated Oct 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2025). Temperature measurements found in dataset XBT taken from the RMAS NEWTON (CALL SIGN GURN - launched 1975), DISCOVERY (R/V; call sign GLNE; built 12.1962; IMO5090660) and other platforms in the North Atlantic, Arctic and other locations from 1984 to 2000 (NCEI Accession 0000722) [Dataset]. https://catalog.data.gov/dataset/temperature-measurements-found-in-dataset-xbt-taken-from-the-rmas-newton-call-sign-gurn-launche1
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    (Point of Contact)
    Description

    Release of ships of opportunity XBT data compiled by the Maritime Environment Information Centre at the UKHO. The data represents 5 years (1995 - 2000) of archived SOOP XBT observations. The dataset contains 7214 observations from various vessels (see vessel summary), the data ranges from years 1984 to 2000. Previously to 1995 the data was released annually to the oceanographic community through ICES and US NODC, World Data Centre A.

  17. Z

    Synset-based dataset built by using semantic-based feature reduction...

    • data.niaid.nih.gov
    • portalcientifico.uvigo.gal
    • +1more
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Novo-Lourés; Javier Quintás; Reyes Pavón; Rosalía Laza; José R. Méndez; David Ruano-Ordás (2022). Synset-based dataset built by using semantic-based feature reduction techniques. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7351575
    Explore at:
    Dataset updated
    Nov 23, 2022
    Dataset provided by
    Universidade de Vigo
    Authors
    Maria Novo-Lourés; Javier Quintás; Reyes Pavón; Rosalía Laza; José R. Méndez; David Ruano-Ordás
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset was built by applying an enhanced semantic-based feature selection method over to publicly available datasets: SpamAssassin (SA.zip) and Youtube Comments (YT.zip).

  18. Geoscape Administrative Boundaries

    • researchdata.edu.au
    • data.gov.au
    Updated Jul 11, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Industry, Science and Resources (DISR) (2014). Geoscape Administrative Boundaries [Dataset]. https://researchdata.edu.au/geoscape-administrative-boundaries/2976730
    Explore at:
    Dataset updated
    Jul 11, 2014
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Department of Industry, Science and Resources (DISR)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please note this dataset is the most recent version of the Administrative Boundaries (AB). For previous versions of the AB please go to this url: https://data.gov.au/data/dataset/previous-versions-of-the-geoscape-administrative-boundaries\r \r ----------------------------------\r \r Geoscape Administrative Boundaries is Australia’s most comprehensive national collection of boundaries, including government, statistical and electoral boundaries. It is built and maintained by Geoscape Australia using authoritative government data. Further information about contributors to Administrative Boundaries is available here.\r \r This dataset comprises seven Geoscape products:\r \r * Localities\r * Local Government Areas (LGAs)\r * Wards\r * Australian Bureau of Statistics (ABS) Boundaries\r * Electoral Boundaries\r * State Boundaries and\r * Town Points\r \r Updated versions of Administrative Boundaries are published on a quarterly basis.\r \r Users have the option to download datasets with feature coordinates referencing either GDA94 or GDA2020 datums.\r \r Notable changes in the November 2025 release\r \r * There have been spatial changes (area) greater than 1 km2 to the local government areas ‘Derwent Valley Council’ and ‘Brighton Council’ in Tasmania.\r \r * ‘Kairabah’ locality in ‘Logan City’ local government area has been retired in Queensland.\r \r * There have been spatial changes (area) greater than 1 km2 to the localities ‘Neilrex’, ‘Daruka’ and ‘Merrygoen’ in New South Wales, ‘Yarrabilba’, ‘Petrie’ and ‘Kallangur’ in Queensland, and ‘Flinders Ranges’ and ‘Angorigina’ in South Australia.\r \r \r Further information on Administrative Boundaries, including FAQs on the data, is available here or through Geoscape Australia’s network of partners. They provide a range of commercial products based on Administrative Boundaries, including software solutions, consultancy and support.\r \r \r \r Users must also note the following attribution requirements:\r \r Preferred attribution for the Licensed Material:\r \r

    Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International license (CC BY 4.0).\r \r Preferred attribution for Adapted Material:\r \r Incorporates or developed using Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International licence (CC BY 4.0).\r \r

    What to Expect When You Download Administrative Boundaries\r

    \r Administrative Boundaries is large dataset (around 1.5GB unpacked), made up of seven themes each containing multiple layers.\r \r Users are advised to read the technical documentation including the product change notices and the individual product descriptions before downloading and using the product.\r \r Please note this dataset is the most recent version of the Administrative Boundaries (AB). For previous versions of the AB please go to this url: https://data.gov.au/dataset/ds-dga-b4ad5702-ea2b-4f04-833c-d0229bfd689e/details?q=previous\r \r

    License Information\r

    \r The Australian Government has negotiated the release of Administrative Boundaries to the whole economy under an open CCBY 4.0 license.\r \r Users must only use the data in ways that are consistent with the Australian Privacy Principles issued under the Privacy Act 1988 (Cth).\r \r

  19. O

    Mechanical Ventilation and Indoor Air Quality in Recently Constructed U.S....

    • data.openei.org
    • datasets.ai
    • +2more
    archive +2
    Updated Mar 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoran Zhao; Chrissi A. Antonopoulos; Samuel I. Rosenberg; Iain S. Walker; William W. Delp; Wanyu R. Chan; Marion Russell; Brett C. Singer; Haoran Zhao; Chrissi A. Antonopoulos; Samuel I. Rosenberg; Iain S. Walker; William W. Delp; Wanyu R. Chan; Marion Russell; Brett C. Singer (2023). Mechanical Ventilation and Indoor Air Quality in Recently Constructed U.S. Homes in Marine and Cold-Dry Climates Data from Building America Project [Dataset]. http://doi.org/10.25984/1971379
    Explore at:
    archive, text_document, websiteAvailable download formats
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
    Lawrence Berkeley National Laboratory
    Open Energy Data Initiative (OEDI)
    Authors
    Haoran Zhao; Chrissi A. Antonopoulos; Samuel I. Rosenberg; Iain S. Walker; William W. Delp; Wanyu R. Chan; Marion Russell; Brett C. Singer; Haoran Zhao; Chrissi A. Antonopoulos; Samuel I. Rosenberg; Iain S. Walker; William W. Delp; Wanyu R. Chan; Marion Russell; Brett C. Singer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Data were collected to characterize whole-house mechanical ventilation (WHMV) and indoor air quality (IAQ) in 55 homes in the Marine climate of Oregon and Cold-Dry climate of Colorado in the U.S. Sixteen homes were monitored for two weeks, with and without WHMV operating. Ventilation airflows; airtightness; time-resolved CO2, PM2.5 and radon; and time-integrated NO2, NOX and formaldehyde were measured. Participants provided information about IAQ-impacting activities, perceptions and ventilation use. All homes had operational cooktop ventilation and bathroom exhaust. Thirty homes had equipment that could meet the ASHRAE 62.2-2010 standard with continuous or controlled runtime and 34 had some WHMV operating as found. Thirty-five of 46 participants with WHMV reported they did not know how to operate it, and only half of the systems were properly labeled. Two-week homes had lower formaldehyde, radon, CO2, and NO (NOX-NO2) when operated with WHMV; and also had faster PM2.5 decays following indoor emission events. Overall IAQ satisfaction was similar in Oregon and Colorado, but more Colorado participants (19 vs. 3%) felt their IAQ could be improved and more reported dryness as a problem (58 vs. 14%). The collected data indicate that there are benefits of operating WHMV, even when continuous use may not be needed because outdoor pollutant concentrations are low and indoor sources do not present substantial challenges.

  20. Measurements made onboard the R/V MIRAI between 2000 and 2003

    • data.nasa.gov
    • cmr.earthdata.nasa.gov
    • +1more
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Measurements made onboard the R/V MIRAI between 2000 and 2003 [Dataset]. https://data.nasa.gov/dataset/measurements-made-onboard-the-r-v-mirai-between-2000-and-2003-407da
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Measurements made by the MIRAI research vessel in the JAMSTEC fleet between 2000 and 2003.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s001

Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

Search
Clear search
Close search
Google apps
Main menu