70 datasets found
  1. Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...

    • osti.gov
    • dataone.org
    • +1more
    Updated Dec 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series Data for Billy Barr, East River, Colorado USA [Dataset]. http://doi.org/10.15485/1823516
    Explore at:
    Dataset updated
    Dec 31, 2020
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Environmental System Science Data Infrastructure for a Virtual Ecosystem
    Area covered
    Colorado, East River, United States
    Description

    A comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.

  2. d

    Scripts and data to run R-QWTREND models and produce results

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Scripts and data to run R-QWTREND models and produce results [Dataset]. https://catalog.data.gov/dataset/scripts-and-data-to-run-r-qwtrend-models-and-produce-results
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This child page contains a zipped folder which contains all items necessary to run trend models and produce results published in U.S. Geological Scientific Investigations Report 2021–XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p.]. To run the R-QWTREND program in R 6 files are required and each is included in this child page: prepQWdataV4.txt, runQWmodelV4XXUEP.txt, plotQWtrendV4XXUEP.txt, qwtrend2018v4.exe, salflibc.dll, and StartQWTrendV4.R (Vecchia and Nustad, 2020). The folder contains: six items required to run the R–QWTREND trend analysis tool; a readme.txt file; a flowtrendData.RData file; an allsiteinfo.table.csv file, a folder called "scripts", and a folder called "waterqualitydata". The "scripts" folder contains the scripts that can be used to reproduce the results found in the USGS Scientific Investigations Report referenced above. The "waterqualitydata" folder contains .csv files with the naming convention of site_ions or site_nuts for major ions and nutrients constituents and contains machine readable files with the water-quality data used for the trend analysis at each site. R–QWTREND is a software package for analyzing trends in stream-water quality. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for using R–QWTREND: • Windows 10 operating system • R (version 3.4 or later; 64 bit recommended) • RStudio (version 1.1.456 or later). An accompanying report (Vecchia and Nustad, 2020) serves as the formal documentation for R–QWTREND. Vecchia, A.V., and Nustad, R.A., 2020, Time-series model, statistical methods, and software documentation for R–QWTREND—An R package for analyzing trends in stream-water quality: U.S. Geological Survey Open-File Report 2020–1014, 51 p., https://doi.org/10.3133/ofr20201014 R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.

  3. Additional file 1: of trumpet: transcriptome-guided quality assessment of...

    • springernature.figshare.com
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teng Zhang; Shao-Wu Zhang; Lin Zhang; Jia Meng (2024). Additional file 1: of trumpet: transcriptome-guided quality assessment of m6A-seq data [Dataset]. http://doi.org/10.6084/m9.figshare.6813755.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Teng Zhang; Shao-Wu Zhang; Lin Zhang; Jia Meng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source code of the trumpet R package. (ZIP 2229 kb)

  4. f

    Data from: Analysis and Visualization of Quantitative Proteomics Data Using...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi Hsiao; Haijian Zhang; Ginny Xiaohe Li; Yamei Deng; Fengchao Yu; Hossein Valipour Kahrood; Joel R. Steele; Ralf B. Schittenhelm; Alexey I. Nesvizhskii (2024). Analysis and Visualization of Quantitative Proteomics Data Using FragPipe-Analyst [Dataset]. http://doi.org/10.1021/acs.jproteome.4c00294.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    ACS Publications
    Authors
    Yi Hsiao; Haijian Zhang; Ginny Xiaohe Li; Yamei Deng; Fengchao Yu; Hossein Valipour Kahrood; Joel R. Steele; Ralf B. Schittenhelm; Alexey I. Nesvizhskii
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The FragPipe computational proteomics platform is gaining widespread popularity among the proteomics research community because of its fast processing speed and user-friendly graphical interface. Although FragPipe produces well-formatted output tables that are ready for analysis, there is still a need for an easy-to-use and user-friendly downstream statistical analysis and visualization tool. FragPipe-Analyst addresses this need by providing an R shiny web server to assist FragPipe users in conducting downstream analyses of the resulting quantitative proteomics data. It supports major quantification workflows, including label-free quantification, tandem mass tags, and data-independent acquisition. FragPipe-Analyst offers a range of useful functionalities, such as various missing value imputation options, data quality control, unsupervised clustering, differential expression (DE) analysis using Limma, and gene ontology and pathway enrichment analysis using Enrichr. To support advanced analysis and customized visualizations, we also developed FragPipeAnalystR, an R package encompassing all FragPipe-Analyst functionalities that is extended to support site-specific analysis of post-translational modifications (PTMs). FragPipe-Analyst and FragPipeAnalystR are both open-source and freely available.

  5. Using Descriptive Statistics to Analyse Data in R

    • kaggle.com
    zip
    Updated May 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrico68 (2024). Using Descriptive Statistics to Analyse Data in R [Dataset]. https://www.kaggle.com/datasets/enrico68/using-descriptive-statistics-to-analyse-data-in-r
    Explore at:
    zip(105561 bytes)Available download formats
    Dataset updated
    May 9, 2024
    Authors
    Enrico68
    Description

    Load and view a real-world dataset in RStudio

    • Calculate “Measure of Frequency” metrics

    • Calculate “Measure of Central Tendency” metrics

    • Calculate “Measure of Dispersion” metrics

    • Use R’s in-built functions for additional data quality metrics

    • Create a custom R function to calculate descriptive statistics on any given dataset

  6. r

    Data from: Supplementary tables:MetaFetcheR: An R package for complete...

    • researchdata.se
    Updated Jun 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara A. Yones; Rajmund Csombordi; Jan Komorowski; Klev Diamanti (2024). Supplementary tables:MetaFetcheR: An R package for complete mapping of small compound data [Dataset]. http://doi.org/10.57804/7sf1-fw75
    Explore at:
    (78625), (728116)Available download formats
    Dataset updated
    Jun 24, 2024
    Dataset provided by
    Uppsala University
    Authors
    Sara A. Yones; Rajmund Csombordi; Jan Komorowski; Klev Diamanti
    Description

    The dataset includes a PDF file containing the results and an Excel file with the following tables:

    Table S1 Results of comparing the performance of MetaFetcheR to MetaboAnalystR using Diamanti et al. Table S2 Results of comparing the performance of MetaFetcheR to MetaboAnalystR for Priolo et al. Table S3 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool using Diamanti et al. Table S4 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool for Priolo et al. Table S5 Data quality test results for running 100 iterations on HMDB database. Table S6 Data quality test results for running 100 iterations on KEGG database. Table S7 Data quality test results for running 100 iterations on ChEBI database. Table S8 Data quality test results for running 100 iterations on PubChem database. Table S9 Data quality test results for running 100 iterations on LIPID MAPS database. Table S10 The list of metabolites that were not mapped by MetaboAnalystR for Diamanti et al. Table S11 An example of an input matrix for MetaFetcheR. Table S12 Results of comparing the performance of MetaFetcheR to MS_targeted using Diamanti et al. Table S13 Data set from Diamanti et al. Table S14 Data set from Priolo et al. Table S15 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Diamanti et al. Table S16 Results of comparing the performance of MetaFetcheR to CTS using LIPID MAPS identifiers available in Diamanti et al. Table S17 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. Table S18 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. (See the "index" tab in the Excel file for more information)

    Small-compound databases contain a large amount of information for metabolites and metabolic pathways. However, the plethora of such databases and the redundancy of their information lead to major issues with analysis and standardization. Lack of preventive establishment of means of data access at the infant stages of a project might lead to mislabelled compounds, reduced statistical power and large delays in delivery of results.

    We developed MetaFetcheR, an open-source R package that links metabolite data from several small-compound databases, resolves inconsistencies and covers a variety of use-cases of data fetching. We showed that the performance of MetaFetcheR was superior to existing approaches and databases by benchmarking the performance of the algorithm in three independent case studies based on two published datasets.

    The dataset was originally published in DiVA and moved to SND in 2024.

  7. f

    Data from: Concepts and Software Package for Efficient Quality Control in...

    • acs.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathias Kuhring; Alina Eisenberger; Vanessa Schmidt; Nicolle Kränkel; David M. Leistner; Jennifer Kirwan; Dieter Beule (2023). Concepts and Software Package for Efficient Quality Control in Targeted Metabolomics Studies: MeTaQuaC [Dataset]. http://doi.org/10.1021/acs.analchem.0c00136.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Mathias Kuhring; Alina Eisenberger; Vanessa Schmidt; Nicolle Kränkel; David M. Leistner; Jennifer Kirwan; Dieter Beule
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Targeted quantitative mass spectrometry metabolite profiling is the workhorse of metabolomics research. Robust and reproducible data are essential for confidence in analytical results and are particularly important with large-scale studies. Commercial kits are now available which use carefully calibrated and validated internal and external standards to provide such reliability. However, they are still subject to processing and technical errors in their use and should be subject to a laboratory’s routine quality assurance and quality control measures to maintain confidence in the results. We discuss important systematic and random measurement errors when using these kits and suggest measures to detect and quantify them. We demonstrate how wider analysis of the entire data set alongside standard analyses of quality control samples can be used to identify outliers and quantify systematic trends to improve downstream analysis. Finally, we present the MeTaQuaC software which implements the above concepts and methods for Biocrates kits and other target data sets and creates a comprehensive quality control report containing rich visualization and informative scores and summary statistics. Preliminary unsupervised multivariate analysis methods are also included to provide rapid insight into study variables and groups. MeTaQuaC is provided as an open source R package under a permissive MIT license and includes detailed user documentation.

  8. Summary of the data preprocessing software currently available for scRNA-seq...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luyi Tian; Shian Su; Xueyi Dong; Daniela Amann-Zalcenstein; Christine Biben; Azadeh Seidi; Douglas J. Hilton; Shalin H. Naik; Matthew E. Ritchie (2023). Summary of the data preprocessing software currently available for scRNA-seq analysis and the particular tasks covered by each package. [Dataset]. http://doi.org/10.1371/journal.pcbi.1006361.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Luyi Tian; Shian Su; Xueyi Dong; Daniela Amann-Zalcenstein; Christine Biben; Azadeh Seidi; Douglas J. Hilton; Shalin H. Naik; Matthew E. Ritchie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of the data preprocessing software currently available for scRNA-seq analysis and the particular tasks covered by each package.

  9. d

    Data from: Water-quality trends and trend component estimates for the...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Water-quality trends and trend component estimates for the Nation's rivers and streams using Weighted Regressions on Time, Discharge, and Season (WRTDS) models and generalized flow normalization, 1972-2012 [Dataset]. https://catalog.data.gov/dataset/water-quality-trends-and-trend-component-estimates-for-the-nations-rivers-and-streams-1972
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Nonstationary streamflow due to environmental and human-induced causes can affect water quality over time, yet these effects are poorly accounted for in water-quality trend models. This data release provides instream water-quality trends and estimates of two components of change, for sites across the Nation previously presented in Oelsner et al. (2017). We used previously calibrated Weighted Regressions on Time, Discharge, and Season (WRTDS) models published in De Cicco et al. (2017) to estimate instream water-quality trends and associated uncertainties with the generalized flow normalization procedure available in EGRET version 3.0 (Hirsch et al., 2018a) and EGRETci version 2.0 (Hirsch et al., 2018b). The procedure allows for nonstationarity in the flow regime, whereas previous versions of EGRET assumed streamflow stationarity. Water-quality trends of annual mean concentrations and loads (also referred to as fluxes) are provided as an annual series and the change between the start and end year for four trend periods (1972-2012, 1982-2012, 1992-2012, and 2002-2012). Information about the sites, including the collecting agency and associated streamflow gage, and information about site selection and the data screening process can be found in Oelsner et al. (2017). This data release includes results for 19 water-quality parameters including nutrients (ammonia, nitrate, filtered and unfiltered orthophosphate, total nitrogen, total phosphorus), major ions (calcium, chloride, magnesium, potassium, sodium, sulfate), salinity indicators (specific conductance, total dissolved solids), carbon (alkalinity, dissolved organic carbon, total organic carbon), and sediment (total suspended solids, suspended-sediment concentration) at over 1,200 sites. Note, the number of parameters with data varies by site with most sites having data for 1-4 parameters. Each water-quality trend was parsed into two components of change: (1) the streamflow trend component (QTC) and (2) the watershed management trend component (MTC). The QTC is an indicator of the amount of change in the water-quality trend attributed to changes in the streamflow regime, and the MTC is an indicator of the amount of change in the water-quality trend that may be attributed to human actions and changes in point and non-point sources in a watershed. Note, the MTC is referred to as the concentration-discharge trend component (CQTC) in the EGRET version 3.0 software. For our work, we chose to refer to this trend component as the MTC because it provides a more conceptual description (Murphy and Sprague, 2019). The trend results presented here expand upon the results in De Cicco et al. (2017) and Oelsner et al. (2017), which were analyzed using flow-normalization under the stationary streamflow assumption. The results presented in this data release are intended to complement these previously published results and support investigations into natural and human effects on water-quality trends across the United States. Data preparation information and WRTDS model specifications are described in Oelsner et al. (2017) and Murphy and Sprague (2019). This work was completed as part of the National Water-Quality Assessment (NAWQA) project of the National Water-Quality Program. De Cicco, L.A., Sprague, L.A., Murphy, J.C., Riskin, M.L., Falcone, J.A., Stets, E.G., Oelsner, G.P., and Johnson, H.M., 2017, Water-quality and streamflow datasets used in the Weighted Regressions on Time, Discharge, and Season (WRTDS) models to determine trends in the Nation’s rivers and streams, 1972-2012 (ver. 1.1 July 7, 2017): U.S. Geological Survey data release, https://doi.org/10.5066/F7KW5D4H. Hirsch, R., De Cicco, L., Watkins, D., Carr, L., and Murphy, J., 2018a, EGRET: Exploration and Graphics for RivEr Trends, version 3.0, https://CRAN.R-project.org/package=EGRET. Hirsch, R., De Cicco, L., and Murphy, J., 2018b, EGRETci: Exploration and Graphics for RivEr Trends (EGRET) Confidence Intervals, version 2.0. https://CRAN.R-project.org/package=EGRETci. Murphy, J.C., and Sprague, L.A., 2019, Water-quality trends in US rivers: Exploring effects from streamflow trends and changes in watershed management: The Science of the total environment, ISSN: 1879-1026, Vol: 656, Page: 645-658, https://doi.org/10.1016/j.scitotenv.2018.11.255. Oelsner, G.P., Sprague, L.A., Murphy, J.C., Zuellig, R.E., Johnson, H.M., Ryberg, K.R., Falcone, J.A., Stets, E.G., Vecchia, A.V., Riskin, M.L., De Cicco, L.A., Mills, T.J., and Farmer, W.H., 2017, Water-quality trends in the Nation’s rivers and streams, 1972–2012—Data preparation, statistical methods, and trend results (ver. 2.0, October 2017): U.S. Geological Survey Scientific Investigations Report 2017–5006, 136 p., https://doi.org/10.3133/sir20175006.

  10. C

    Air Quality

    • data.ccrpc.org
    csv
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Champaign County Regional Planning Commission (2025). Air Quality [Dataset]. https://data.ccrpc.org/dataset/air-quality
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 13, 2025
    Dataset authored and provided by
    Champaign County Regional Planning Commission
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This indicator shows how many days per year were assessed to have air quality that was worse than “moderate” in Champaign County, according to the U.S. Environmental Protection Agency’s (U.S. EPA) Air Quality Index Reports. The period of analysis is 1980-2024, and the U.S. EPA’s air quality ratings analyzed here are as follows, from best to worst: “good,” “moderate,” “unhealthy for sensitive groups,” “unhealthy,” “very unhealthy,” and "hazardous."[1]

    In 2024, the number of days rated to have air quality worse than moderate was 0. This is a significant decrease from the 13 days in 2023 in the same category, the highest in the 21st century. That figure is likely due to the air pollution created by the unprecedented Canadian wildfire smoke in Summer 2023.

    While there has been no consistent year-to-year trend in the number of days per year rated to have air quality worse than moderate, the number of days in peak years had decreased from 2000 through 2022. Where peak years before 2000 had between one and two dozen days with air quality worse than moderate (e.g., 1983, 18 days; 1988, 23 days; 1994, 17 days; 1999, 24 days), the year with the greatest number of days with air quality worse than moderate from 2000-2022 was 2002, with 10 days. There were several years between 2006 and 2022 that had no days with air quality worse than moderate.

    This data is sourced from the U.S. EPA’s Air Quality Index Reports. The reports are released annually, and our period of analysis is 1980-2024. The Air Quality Index Report websites does caution that "[a]ir pollution levels measured at a particular monitoring site are not necessarily representative of the air quality for an entire county or urban area," and recommends that data users do not compare air quality between different locations[2].

    [1] Environmental Protection Agency. (1980-2024). Air Quality Index Reports. (Accessed 13 June 2025).

    [2] Ibid.

    Source: Environmental Protection Agency. (1980-2024). Air Quality Index Reports. https://www.epa.gov/outdoor-air-quality-data/air-quality-index-report. (Accessed 13 June 2025).

  11. H

    Replication Data for: Measuring Wikipedia Article Quality in One Dimension...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan TeBlunthuis (2024). Replication Data for: Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression [Dataset]. http://doi.org/10.7910/DVN/U5V0G1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Nathan TeBlunthuis
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides code, data, and instructions for replicating the analysis of Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression published in OpenSym 2021 (link to come). The paper introduces a method for transforming scores from the ORES quality models into a single dimensional measure of quality amenable for statistical analysis that is well-calibrated to a dataset. The purpose is to improve the validity of research into article quality through more precise measurement. The code and data for replicating the paper are found in this dataverse repository. If you wish to use method on a new dataset, you should obtain the actively maintaned version of the code from this git repository. If you attempt to replicate part of this repository please let me know via an email to nathante@uw.edu. Replicating the Analysis from the OpenSym Paper This project analyzes a sample of articles with quality labels from the English Wikipedia XML dumps from March 2020. Copies of the dumps are not provided in this dataset. They can be obtained via https://dumps.wikimedia.org/. Everything else you need to replicate the project (other than a sufficiently powerful computer) should be available here. The project is organized into stages. The prerequisite data files are provided at each stage so you do not need to rerun the entire pipeline from the beginning, which is not easily done without a high-performance computer. If you start replicating at an intermediate stage, this should overwrite the inputs to the downstream stages. This should make it easier to verify a partial replication. To help manage the size of the dataverse, all code files are included in code.tar.gz. Extracting this with tar xzvf code.tar.gz is the first step. Getting Set Up You need a version of R >= 4.0 and a version of Python >= 3.7.8. You also need a bash shell, tar, gzip, and make installed as they should be installed on any Unix system. To install brms you need a working C++ compiler. If you run into trouble see the instructions for installing Rstan. The datasets were built on CentOS 7, except for the ORES scoring which was done on Ubuntu 18.04.5 and building which was done on Debian 9. The RemembR and pyRembr projects provide simple tools for saving intermediate variables for building papers with LaTex. First, extract the articlequality.tar.gz, RemembR.tar.gz and pyRembr.tar.gz archives. Then, install the following: Python Packages Running the following steps in a new Python virtual environment is strongly recommended. Run pip3 install -r requirements.txt to install the Python dependencies. Then navigate into the pyRembr directory and run python3 setup.py install. R Packages Run Rscript install_requirements.R to install the necessary R libraries. If you run into trouble installing brms see the instructions on Drawing a Sample of Labeled Articles I provide steps and intermediate data files for replicating the sampling of labeled articles. The steps in this section are quite computationally intensive. Those only interested in replicating the models and analyses should skip this section. Extracting Metadata from Wikipedia Dumps Metadata from the Wikipedia dumps is required for calibrating models to the revision and article levels of analysis. You can use the wikiq Python script from the mediawiki dump tools git repository to extract metadata from the XML dumps as TSV files. The version of wikiq that was used is provided here. Running Wikiq on a full dump of English Wikipedia in a reasonable amount of requires considerable computing resources. For this project, Wikiq was run on Hyak a high performance computer at the University of Washington. The code for doing so is highly speicific to Hyak. For transparency and in case it helps others using similar academic computers this code is included in WikiqRunning.tar.gz. A copy of the wikiq output is included in this dataset in the multi-part archive enwiki202003-wikiq.tar.gz. To extract this archive, download all the parts and then run cat enwiki202003-wikiq.tar.gz* > enwiki202003-wikiq.tar.gz && tar xvzf enwiki202003-wikiq.tar.gz. Obtaining Quality Labels for Articles We obtain up-to-date labels for each article using the articlequality python package included in articlequality.tar.gz. The XML dumps are also the input to this step, and while it does not require a great deal of memory, a powerful computer (we used 28 cores) is helpful so that it completes in a reasonable amount of time. extract_quality_labels.sh runs the command to extract the labels from the xml dumps. The resulting files have the format data/enwiki-20200301-pages-meta-history*.xml-p*.7z_article_labelings.json and are included in this dataset in the archive enwiki202003-article_labelings-json.tar.gz. Taking a Sample of Quality Labels I used Apache Spark to merge the metadata from Wikiq with the quality labels and to draw a sample of articles where each quality class is equally represented. To...

  12. d

    Data from: specleanr: An R package for automated flagging of environmental...

    • datadryad.org
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Basooma; Astrid Schmidt-Kloiber; Sami Domisch; Yusdiel Torres-Cambas; Marija Smederevac-Lalić; Vanessa Bremerich; Martin Tschikof; Paul Meulenbroek; Andrea Funk; Thomas Hein; Florian Borgwardt (2025). specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows [Dataset]. http://doi.org/10.5061/dryad.6m905qgd7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 4, 2025
    Dataset provided by
    Dryad
    Authors
    Anthony Basooma; Astrid Schmidt-Kloiber; Sami Domisch; Yusdiel Torres-Cambas; Marija Smederevac-Lalić; Vanessa Bremerich; Martin Tschikof; Paul Meulenbroek; Andrea Funk; Thomas Hein; Florian Borgwardt
    Time period covered
    Sep 24, 2025
    Description

    specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows

    Dataset DOI: 10.5061/dryad.6m905qgd7

    Description of the data and file structure

    1. The files include species occurrences from the Global Biodiversity Information Facility. Refer to the data links file to access the original data.
    2. Environmental data was retrieved from CHELSA and Hydrography90m. These files included B101 to 19 for CHELSA and cti, order*strahler, slopecurv*dw_cel, accumulation, spi, sti, and subcatchment from Hydrography90m. The data link file has the URL to connect to the original dataset.
    3. Model outputs were data outputs packaged after model implementation, including modeloutput and modeloutput2.
    4. The sdm function was implemented in the sdm_function file.
    5. sdmodeling file that processed all files.
    6. species prediction were archived in species model prediction output.

    Files and variables

    ...

  13. n

    Data from: aniMotum, an R package for animal movement data: rapid quality...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Dec 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Jonsen; James Grecian; Lachlan Phillips; Gemma Carroll; Clive McMahon; Robert Harcourt; Mark Hindell; Toby Patterson (2022). aniMotum, an R package for animal movement data: rapid quality control, behavioural estimation and simulation [Dataset]. http://doi.org/10.5061/dryad.qz612jmjw
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 23, 2022
    Dataset provided by
    Durham University
    Sydney Institute of Marine Science
    Macquarie University
    CSIRO Oceans and Atmosphere
    University of Tasmania
    Environmental Defence Fund
    Authors
    Ian Jonsen; James Grecian; Lachlan Phillips; Gemma Carroll; Clive McMahon; Robert Harcourt; Mark Hindell; Toby Patterson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description
    1. Animal tracking data are indispensable for understanding the ecology, behaviour and physiology of mobile or cryptic species. Meaningful signals in these data can be obscured by noise due to imperfect measurement technologies, requiring rigorous quality control as part of any comprehensive analysis.
    2. State-space models are powerful tools that separate signal from noise. These tools are ideal for quality control of error-prone location data and for inferring where animals are and what they are doing when they record or transmit other information. However, these statistical models can be challenging and time-consuming to fit to diverse animal tracking data sets.
    3. The R package aniMotum eases the tasks of conducting quality control on and inference of changes in movement from animal tracking data. This is achieved via: 1) a simple but extensible workflow that accommodates both novice and experienced users; 2) automated processes that alleviate complexity from data processing and model specification/fitting steps; 3) simple movement models coupled with a powerful numerical optimization approach for rapid and reliable model fitting.
    4. We highlight aniMotum's capabilities through three applications to real animal tracking data. Full R code for these and additional applications are included as Supporting Information so users can gain a deeper understanding of how to use aniMotum for their own analyses.
  14. Mexico Hourly Air Pollution (2010-2021)

    • kaggle.com
    zip
    Updated Sep 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    eliana kai juarez (2022). Mexico Hourly Air Pollution (2010-2021) [Dataset]. https://www.kaggle.com/datasets/elianaj/mexico-air-quality-dataset
    Explore at:
    zip(124753030 bytes)Available download formats
    Dataset updated
    Sep 6, 2022
    Authors
    eliana kai juarez
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Mexico
    Description

    Air pollution is one of the leading causes of premature deaths across the world, and is often under monitored in developing countries. Mexico presents an interesting case study with greatly improved air pollution thanks to regulation since the 1990s in Mexico City while other places have continued to have dangerous air pollution levels, responsible for tens of thousands of deaths per year.

    Public pollution datasets are crucial to research and policy-making regarding public health. Mexico's air quality information program is named Sinaica, but can be time-consuming to use as data can only be retrieved one month at a time manually. Using the rSinaica R package by Diego Valle-Jones, this dataset compiles all recorded hourly values for 28 pollution and weather variables from the years 2010-2021 for all stations in Mexico.

    La contaminación del aire es uno de los más grandes causas de muertos prematuras por todo el mundo, aproximadamente 1 en 5 muertes, y es mucho menos monitoreado en naciones en desarrollo. México nos presenta un caso de estudio interesante, con muchas mejoradas niveles de contaminantes gracias a regulaciones desde los 1990s en la Ciudad de México mientras otras lugares han mantenidos niveles de contaminantes peligrosos. Estos son responsibles por decenas de miles de muertos prevenibles cada año en México.

    Públicos conjuntos de datos de la contaminación del aire son cruciales para la investigación y la formulación de leyes que protegen la salud publica. En México, el programa llamada la 'Sistema Nacional de Información de la Calidad del Aire', o SINAICA, tiene maneras de conseguir datos, pero requiere mucho tiempo porque información solo se puede ser obtenido un mes, parámetro, y estación a la vez a mano. Usando el rSinaica paquete de R creador por Diego Valle-Jones, este conjunto de datos incluye todos medidos horarios por 28 polución y metorológico variables desde los años 2010-2021 por todos estaciones en México.

  15. d

    Scripts and data to run and produce results from R-QWTREND models

    • datasets.ai
    • data.usgs.gov
    • +1more
    55
    Updated Aug 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2023). Scripts and data to run and produce results from R-QWTREND models [Dataset]. https://datasets.ai/datasets/scripts-and-data-to-run-and-produce-results-from-r-qwtrend-models-b9fa7
    Explore at:
    55Available download formats
    Dataset updated
    Aug 24, 2023
    Dataset authored and provided by
    Department of the Interior
    Description

    This child page contains a zipped folder which contains all items necessary to run trend models and produce results published in U.S. Geological Scientific Investigations Report 2022–XXXX [Nustad, R.A., and Tatge, W.S., 2023, Comprehensive Water-Quality Trend Analysis for Selected Sites and Constituents in the International Souris River Basin, Saskatchewan and Manitoba, Canada and North Dakota, United States, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2023-XXXX, XX p.]. To run the R-QWTREND program in R, 6 files are required and each is included in this child page: prepQWdataV4.txt, runQWmodelV4.txt, plotQWtrendV4.txt, qwtrend2018v4.exe, salflibc.dll, and StartQWTrendV4.R (Vecchia and Nustad, 2020). The folder contains: three items required to run the R–QWTREND trend analysis tool; a README.txt file; a folder called "dataout"; and a folder called "scripts". The "scripts" folder contains the scripts that can be used to reproduce the results found in the USGS Scientific Investigations Report referenced above. The "dataout" folder contains folders for each site that contain .RData files with the naming convention of site_flow for streamflow data and site_qw_XXX depending upon the group of constituents MI, NUT, or TM. R–QWTREND is a software package for analyzing trends in stream-water quality. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for using R–QWTREND: • Windows 10 operating system • R (version 3.4 or later; 64 bit recommended) • RStudio (version 1.1.456 or later). An accompanying report (Vecchia and Nustad, 2020) serves as the formal documentation for R–QWTREND. Vecchia, A.V., and Nustad, R.A., 2020, Time-series model, statistical methods, and software documentation for R–QWTREND—An R package for analyzing trends in stream-water quality: U.S. Geological Survey Open-File Report 2020–1014, 51 p., https://doi.org/10.3133/ofr20201014 R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.

  16. Additional file 21 of BREC: an R package/Shiny app for automatically...

    • springernature.figshare.com
    xlsx
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasmine Mansour; Annie Chateau; Anna-Sophie Fiston-Lavier (2023). Additional file 21 of BREC: an R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes [Dataset]. http://doi.org/10.6084/m9.figshare.15128355.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Yasmine Mansour; Annie Chateau; Anna-Sophie Fiston-Lavier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Addtional file 21. This is a spreadsheet (.xlsx file). It includes accessible links to the genetic and physical maps for the 44 genomes mentioned in Additional file 20. Adapted from the Additional file named Table S1, published by [41].

  17. d

    USGS dataretrieval Python Package Usage Examples

    • dataone.org
    • hydroshare.org
    • +1more
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery S. Horsburgh; Amber Spackman Jones; Scott Steven Black; Timothy O. Hodson (2024). USGS dataretrieval Python Package Usage Examples [Dataset]. https://dataone.org/datasets/sha256%3A4f7a003397b7f344e7b0ad4d235a36dd4f2a937ba42e9ec4fd298891971da5be
    Explore at:
    Dataset updated
    Aug 31, 2024
    Dataset provided by
    Hydroshare
    Authors
    Jeffery S. Horsburgh; Amber Spackman Jones; Scott Steven Black; Timothy O. Hodson
    Description

    This resource contains a set of Jupyter Notebooks that provide Python code examples for using the Python dataretrieval package for retrieving data from the United States Geological Survey's (USGS) National Water Information System (NWIS).The dataretrieval package is a Python alternative to USGS-R's dataRetrieval package for the R Statistical Computing Environment used for obtaining USGS or Environmental Protection Agency (EPA) water quality data, streamflow data, and metadata directly from web services. The dataretrieval Python package is an alternative to the R package, not a port, in that it reproduces the functionality of the R package but its organization and functionality differ to some degree. The dataretrieval package was originally created by Timothy Hodson at USGS. Additional contributions to the Python package and these Jupyter Notebook examples were created at Utah State University under funding from the National Science Foundation. A link to the GitHub source code repository for the dataretrieval package is provided in the related resources section below.

  18. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  19. r

    mirrorCheck results for 4 public datasets

    • researchdata.edu.au
    • bridges.monash.edu
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katherine Scull (2025). mirrorCheck results for 4 public datasets [Dataset]. http://doi.org/10.26180/27289017.V1
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Monash University
    Authors
    Katherine Scull
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each zipped folder contains results files from reanalysis of public data in our publication, "mirrorCheck: an R package facilitating informed use of DESeq2’s lfcShrink() function for differential gene expression analysis of clinical samples" (see also the Collection description).

    These files were produced by rendering the Quarto documents provided in the supplementary data with the publication (one per dataset). The Quarto codes for the 3 main analyses (COVID, BRCA and Cell line datasets) performed differential gene expression (DGE) analysis using both DESeq2 with lfcShrink() via our R package mirrorCheck, and also edgeR. Each zipped folder here contains 2 folders, one for each DGE analysis. Since DESeq2 was run on data without prior data cleaning, with prefiltering or after Surrogate Variable Analysis, the 'mirrorCheck output' folders themselves contain 3 sub-folders titled 'DESeq_noclean', 'DESeq_prefilt' and 'DESeq_sva". The COVID dataset also has a folder with results from Gene Set Enrichment Analysis. Finally, the fourth folder contains results from a tutorial/vignette-style supplementary file using the Bioconductor "parathyroidSE" dataset. This analysis only utilised DESeq2, with both data cleaning methods and testing two different design formulae, resulting in 5 sub-folders in the zipped folder.

  20. d

    Data from: Water stargrass biomass, stream metabolism estimates, and...

    • catalog.data.gov
    • data.usgs.gov
    Updated Oct 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Water stargrass biomass, stream metabolism estimates, and nutrient data quality control data for the lower Yakima River: June 2018 through September 2020 (ver. 2.0, August 2025) [Dataset]. https://catalog.data.gov/dataset/water-stargrass-biomass-stream-metabolism-estimates-and-nutrient-data-quality-control-data
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Yakima River
    Description

    This dataset provides information collected at three stream monitoring sites on the lower Yakima River in Washington State from June 2018 through September 2020 by the U.S. Geological Survey. Sample locations included the Yakima River at Prosser, WA (USGS Station, 12509489), the Yakima River at Kiona, WA (USGS Station, 12510500), and the Yakima River at Van Giesen bridge near Richland, WA (USGS Station, 12511800). Water stargrass (Heteranthera dubia) biomass was collected six times from August 2018 to August 2020 from all three sites. Plant samples were dried and areal mass was measured in grams per square meter squared (g/m2). These data are provided in the files “Yakima.biomass.csv”. The presence/absence of water stargrass surrounding each monitoring station was determined by counting if water stargrass was present at 1-meter intervals along 5 to 10 transects at each monitoring station; these counts are provided in the file "Yakima.biomass.point.counts.csv". In addition, stream metabolism was modeled at two sites: the Yakima River at Kiona, WA from June 2018 to October 2020, and Yakima River and at Van Giesen bridge near Richland, WA from August 2018 to March 2020 using the streamMetabolizer package in R. For each site, four files are provided; (1) the input data, which includes continuous dissolved oxygen, solar time, water temperature, light, and stream depth, (2) the output file containing the daily metabolism estimates, (3) a site-specific html 'guide' for running the streamMetabolizer package in R, and (4) a README text file explaining file contents. Model input data were downloaded directly from the USGS National Water Information System (NWIS) (dissolved oxygen and temperature), downloaded directly from data loggers deployed in the field (light and depth), or calculated in the StreamMetabolizer program (DO saturation and solar time). Finally, water quality control data from discrete nutrient sampling are provided. These data include results from 6 field blanks (blanks.csv) and 6 sets of field replicates (replicates.csv) collected during this study. Refer to "version_notes.txt" for specific updates to version 2.0 of this data release. First release: 2024-07-29 (ver. 1.0) Revised: 2025-08-13 (ver. 2.0)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series Data for Billy Barr, East River, Colorado USA [Dataset]. http://doi.org/10.15485/1823516
Organization logo

Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series Data for Billy Barr, East River, Colorado USA

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 31, 2020
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Environmental System Science Data Infrastructure for a Virtual Ecosystem
Area covered
Colorado, East River, United States
Description

A comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.

Search
Clear search
Close search
Google apps
Main menu