53 datasets found
  1. Z

    Data and R-script for a tutorial that explains how to convert spreadsheet...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Goedhart, Joachim (2024). Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4056965
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset authored and provided by
    Goedhart, Joachim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)

  2. f

    Comparing spatial regression to random forests for large environmental data...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric W. Fox; Jay M. Ver Hoef; Anthony R. Olsen (2023). Comparing spatial regression to random forests for large environmental data sets [Dataset]. http://doi.org/10.1371/journal.pone.0229509
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Eric W. Fox; Jay M. Ver Hoef; Anthony R. Olsen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Environmental data may be “large” due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates with nonlinear relationships, whereas spatial regression, when using reduced rank methods, has a reputation for good predictive performance when using many records that are spatially autocorrelated. In this study, we compare these two techniques using a data set containing the macroinvertebrate multimetric index (MMI) at 1859 stream sites with over 200 landscape covariates. A primary application is mapping MMI predictions and prediction errors at 1.1 million perennial stream reaches across the conterminous United States. For the spatial regression model, we develop a novel transformation procedure that estimates Box-Cox transformations to linearize covariate relationships and handles possibly zero-inflated covariates. We find that the spatial regression model with transformations, and a subsequent selection of significant covariates, has cross-validation performance comparable to random forests. We also find that prediction interval coverage is close to nominal for each method, but that spatial regression prediction intervals tend to be narrower and have less variability than quantile regression forest prediction intervals. A simulation study is used to generalize results and clarify advantages of each modeling approach.

  3. NOAA R/V Ron Brown Fourier Transform Infrared Spectroscopy (FTIR) Data

    • data.ucar.edu
    ascii
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lynn Russell (2024). NOAA R/V Ron Brown Fourier Transform Infrared Spectroscopy (FTIR) Data [Dataset]. http://doi.org/10.26023/87N8-35T6-RE0C
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Dec 26, 2024
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    Lynn Russell
    Time period covered
    Oct 21, 2008 - Nov 29, 2008
    Area covered
    Description

    This file contains the Fourier Transform Infrared Spectroscopy (FTIR) Spectroscopy Data from NOAA R/V Ronald H. Brown ship during VOCALS-REx 2008.

  4. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +2more
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  5. t

    Solar self-sufficient households as a driving factor for sustainability...

    • service.tib.eu
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Solar self-sufficient households as a driving factor for sustainability transformation - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/luh-solar-self-sufficient-households-as-a-driving-factor-for-sustainability-transformation
    Explore at:
    Dataset updated
    Nov 14, 2024
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,). To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures. To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R. To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R

  6. Z

    Transformations in PubChem - Full Dataset

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang, Jian (Jeff) (2025). Transformations in PubChem - Full Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5644560
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Cheng, Tiejun
    Schymanski, Emma
    Thiessen, Paul
    Bolton, Evan
    Blanke, Gerd
    Zhang, Jian (Jeff)
    Helmus, Rick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an archive of the data contained in the "Transformations" section in PubChem for integration into patRoon and other workflows.

    For further details see the ECI GitLab site: README and main "tps" folder.

    Credits:

    Concepts: E Schymanski, E Bolton, J Zhang, T Cheng;

    Code (in R): E Schymanski, R Helmus, P Thiessen

    Transformations: E Schymanski, J Zhang, T Cheng and many contributors to various lists!

    PubChem infrastructure: PubChem team

    Reaction InChI (RInChI) calculations (v1.0): Gerd Blanke (previous versions of these files)

    Acknowledgements: ECI team who contributed to related efforts, especially: J. Krier, A. Lai, M. Narayanan, T. Kondic, P. Chirsir, E. Palm. All contributors to the NORMAN-SLE transformations!

    March 2025 released as v0.2.0 since the dataset grew by >3000 entries! The stats are:

    14 March 2025

    Unique Transformation Entries: 10904# Unique Reactions by CID: 9152# Unique Reactions by IK: 9139# Unique Reactions by IKFB: 8574# Unique NORMAN-SLE Compounds by CID: 8207# Unique ChEMBL Compounds by CID: 1419# Unique Compounds (all) by CID: 9267# Unique Predecessors (all) by CID: 3724# Unique Successors (all) by CID: 7331# Range of XlogP Differences: -9.9,10# Range of Mass Differences: -957.97490813,820.227106427

  7. F

    Data from: Solar self-sufficient households as a driving factor for...

    • data.uni-hannover.de
    .zip, r, rdata +2
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut für Kartographie und Geoinformatik (2024). Solar self-sufficient households as a driving factor for sustainability transformation [Dataset]. https://data.uni-hannover.de/eu/dataset/19503682-5752-4352-97f6-511ae31d97df
    Explore at:
    rdata(426), rdata(1024592), r(21968), txt(1431), rdata(408277), text/x-sh(183), .zip, r(63854), r(24773), r(3406), r(6280)Available download formats
    Dataset updated
    Dec 12, 2024
    Dataset authored and provided by
    Institut für Kartographie und Geoinformatik
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata

    To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,).

    To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures.

    To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R.

    To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R

  8. Data and Code for "Climate impacts and adaptation in US dairy systems...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Gisbert-Queral; Maria Gisbert-Queral; Arne Henningsen; Arne Henningsen; Bo Markussen; Bo Markussen; Meredith T. Niles; Ermias Kebreab; Ermias Kebreab; Angela J. Rigden; Angela J. Rigden; Nathaniel D. Mueller; Nathaniel D. Mueller; Meredith T. Niles (2021). Data and Code for "Climate impacts and adaptation in US dairy systems 1981-2018" [Dataset]. http://doi.org/10.5281/zenodo.4818011
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 22, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Gisbert-Queral; Maria Gisbert-Queral; Arne Henningsen; Arne Henningsen; Bo Markussen; Bo Markussen; Meredith T. Niles; Ermias Kebreab; Ermias Kebreab; Angela J. Rigden; Angela J. Rigden; Nathaniel D. Mueller; Nathaniel D. Mueller; Meredith T. Niles
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This data and code archive provides all the files that are necessary to replicate the empirical analyses that are presented in the paper "Climate impacts and adaptation in US dairy systems 1981-2018" authored by Maria Gisbert-Queral, Arne Henningsen, Bo Markussen, Meredith T. Niles, Ermias Kebreab, Angela J. Rigden, and Nathaniel D. Mueller and published in 'Nature Food' (2021, DOI: 10.1038/s43016-021-00372-z). The empirical analyses are entirely conducted with the "R" statistical software using the add-on packages "car", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "lmtest", "lubridate", "magrittr", "nlme", "OneR", "plyr", "pracma", "quadprog", "readxl", "sandwich", "tidyr", "usfertilizer", and "usmap". The R code was written by Maria Gisbert-Queral and Arne Henningsen with assistance from Bo Markussen. Some parts of the data preparation and the analyses require substantial amounts of memory (RAM) and computational power (CPU). Running the entire analysis (all R scripts consecutively) on a laptop computer with 32 GB physical memory (RAM), 16 GB swap memory, an 8-core Intel Xeon CPU E3-1505M @ 3.00 GHz, and a GNU/Linux/Ubuntu operating system takes around 11 hours. Running some parts in parallel can speed up the computations but bears the risk that the computations terminate when two or more memory-demanding computations are executed at the same time.

    This data and code archive contains the following files and folders:

    * README
    Description: text file with this description

    * flowchart.pdf
    Description: a PDF file with a flow chart that illustrates how R scripts transform the raw data files to files that contain generated data sets and intermediate results and, finally, to the tables and figures that are presented in the paper.

    * runAll.sh
    Description: a (bash) shell script that runs all R scripts in this data and code archive sequentially and in a suitable order (on computers with a "bash" shell such as most computers with MacOS, GNU/Linux, or Unix operating systems)

    * Folder "DataRaw"
    Description: folder for raw data files
    This folder contains the following files:

    - DataRaw/COWS.xlsx
    Description: MS-Excel file with the number of cows per county
    Source: USDA NASS Quickstats
    Observations: All available counties and years from 2002 to 2012

    - DataRaw/milk_state.xlsx
    Description: MS-Excel file with average monthly milk yields per cow
    Source: USDA NASS Quickstats
    Observations: All available states from 1981 to 2018

    - DataRaw/TMAX.csv
    Description: CSV file with daily maximum temperatures
    Source: PRISM Climate Group (spatially averaged)
    Observations: All counties from 1981 to 2018

    - DataRaw/VPD.csv
    Description: CSV file with daily maximum vapor pressure deficits
    Source: PRISM Climate Group (spatially averaged)
    Observations: All counties from 1981 to 2018

    - DataRaw/countynamesandID.csv
    Description: CSV file with county names, state FIPS codes, and county FIPS codes
    Source: US Census Bureau
    Observations: All counties

    - DataRaw/statecentroids.csv
    Descriptions: CSV file with latitudes and longitudes of state centroids
    Source: Generated by Nathan Mueller from Matlab state shapefiles using the Matlab "centroid" function
    Observations: All states

    * Folder "DataGenerated"
    Description: folder for data sets that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these generated data files so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).

    * Folder "Results"
    Description: folder for intermediate results that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these intermediate results so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).

    * Folder "Figures"
    Description: folder for the figures that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these figures so that people who replicate our analysis can more easily compare the figures that they get with the figures that are presented in our paper. Additionally, this folder contains CSV files with the data that are required to reproduce the figures.

    * Folder "Tables"
    Description: folder for the tables that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these tables so that people who replicate our analysis can more easily compare the tables that they get with the tables that are presented in our paper.

    * Folder "logFiles"
    Description: the shell script runAll.sh writes the output of each R script that it runs into this folder. We provide these log files so that people who replicate our analysis can more easily compare the R output that they get with the R output that we got.

    * PrepareCowsData.R
    Description: R script that imports the raw data set COWS.xlsx and prepares it for the further analyses

    * PrepareWeatherData.R
    Description: R script that imports the raw data sets TMAX.csv, VPD.csv, and countynamesandID.csv, merges these three data sets, and prepares the data for the further analyses

    * PrepareMilkData.R
    Description: R script that imports the raw data set milk_state.xlsx and prepares it for the further analyses

    * CalcFrequenciesTHI_Temp.R
    Description: R script that calculates the frequencies of days with the different THI bins and the different temperature bins in each month for each state

    * CalcAvgTHI.R
    Description: R script that calculates the average THI in each state

    * PreparePanelTHI.R
    Description: R script that creates a state-month panel/longitudinal data set with exposure to the different THI bins

    * PreparePanelTemp.R
    Description: R script that creates a state-month panel/longitudinal data set with exposure to the different temperature bins

    * PreparePanelFinal.R
    Description: R script that creates the state-month panel/longitudinal data set with all variables (e.g., THI bins, temperature bins, milk yield) that are used in our statistical analyses

    * EstimateTrendsTHI.R
    Description: R script that estimates the trends of the frequencies of the different THI bins within our sampling period for each state in our data set

    * EstimateModels.R
    Description: R script that estimates all model specifications that are used for generating results that are presented in the paper or for comparing or testing different model specifications

    * CalcCoefStateYear.R
    Description: R script that calculates the effects of each THI bin on the milk yield for all combinations of states and years based on our 'final' model specification

    * SearchWeightMonths.R
    Description: R script that estimates our 'final' model specification with different values of the weight of the temporal component relative to the weight of the spatial component in the temporally and spatially correlated error term

    * TestModelSpec.R
    Description: R script that applies Wald tests and Likelihood-Ratio tests to compare different model specifications and creates Table S10

    * CreateFigure1a.R
    Description: R script that creates subfigure a of Figure 1

    * CreateFigure1b.R
    Description: R script that creates subfigure b of Figure 1

    * CreateFigure2a.R
    Description: R script that creates subfigure a of Figure 2

    * CreateFigure2b.R
    Description: R script that creates subfigure b of Figure 2

    * CreateFigure2c.R
    Description: R script that creates subfigure c of Figure 2

    * CreateFigure3.R
    Description: R script that creates the subfigures of Figure 3

    * CreateFigure4.R
    Description: R script that creates the subfigures of Figure 4

    * CreateFigure5_TableS6.R
    Description: R script that creates the subfigures of Figure 5 and Table S6

    * CreateFigureS1.R
    Description: R script that creates Figure S1

    * CreateFigureS2.R
    Description: R script that creates Figure S2

    * CreateTableS2_S3_S7.R
    Description: R script that creates Tables S2, S3, and S7

    * CreateTableS4_S5.R
    Description: R script that creates Tables S4 and S5

    * CreateTableS8.R
    Description: R script that creates Table S8

    * CreateTableS9.R
    Description: R script that creates Table S9

  9. Data from: Data and scripts associated with a manuscript investigating...

    • osti.gov
    Updated Feb 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnon, Shai; Bar-Zeev, Edo; Borton, Mikayla A.; Brooks, Scott; Chu, Rosalie; Danczak, Robert E.; Garayburu-Caruso, Vanessa A.; Goldman, Amy E.; Graham, Emily B.; Jones, Michael; Jones, Nikki; Lewandowski, Jorg; Meile, Christof; Morad, Joseph W.; Muller, Birgit M.; Powers-McCormack, Beck; Renteria, Lupita; Schalles, John; Schulz, Hanna; Stegen, James C.; Toyoda, Jason G.; Ward, Adam; Wells, Jacqueline R. (2024). Data and scripts associated with a manuscript investigating dissolved organic matter and microbial community linkages across seven globally distributed rivers [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2319037
    Explore at:
    Dataset updated
    Feb 20, 2024
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Office of Sciencehttp://www.er.doe.gov/
    33.1805,35.6156|33.1805,35.6156|33.1805,35.6156|33.1805,35.6156|33.1805,35.615652.4764,13.6257|52.4764,13.6257|52.4764,13.6257|52.4764,13.6257|52.4764,13.625744.2065,-122.2566|44.2065,-122.2566|44.2065,-122.2566|44.2065,-122.2566|44.2065,-122.256631.3346,-81.4793|31.3346,-81.4793|31.3346,-81.4793|31.3346,-81.4793|31.3346,-81.479346.373,-119.272|46.373,-119.272|46.373,-119.272|46.373,-119.272|46.373,-119.27246.7386,-121.9181|46.7386,-121.9181|46.7386,-121.9181|46.7386,-121.9181|46.7386,-121.918135.9662,-84.3584|35.9662,-84.3584|35.9662,-84.3584|35.9662,-84.3584|35.9662,-84.3584
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States)
    Authors
    Arnon, Shai; Bar-Zeev, Edo; Borton, Mikayla A.; Brooks, Scott; Chu, Rosalie; Danczak, Robert E.; Garayburu-Caruso, Vanessa A.; Goldman, Amy E.; Graham, Emily B.; Jones, Michael; Jones, Nikki; Lewandowski, Jorg; Meile, Christof; Morad, Joseph W.; Muller, Birgit M.; Powers-McCormack, Beck; Renteria, Lupita; Schalles, John; Schulz, Hanna; Stegen, James C.; Toyoda, Jason G.; Ward, Adam; Wells, Jacqueline R.
    Description

    This data package is associated with the publication “Meta-metabolome ecology reveals that geochemistry and microbial functional potential are linked to organic matter development across seven rivers” submitted to Science of the Total Environment. This data package includes the data necessary to replicate the analyses presented within the manuscript to investigate dissolved organic matter (DOM) development across broad spatial distances and within divergent biomes. Specifically, we included the Fourier transform ion cyclotron mass spectrometry (FTICR-MS) data, geochemistry data, annotated metagenomic data, and results from ecological null modeling analyses in this data package. Additionally, we included the scripts necessary to generate the figures from the manuscript.Complete metagenomic data associated with this data package can be found at the National Center for Biotechnology (NCBI) under Bioproject PRJNA946291.This dataset consists of (1) four folders; (2) a file-level metadata (flmd) file; (3) a data dictionary (dd) file; (4) a factor sheet describing samples; and (5) a readme. The FTICR Data folder contains (1) the processed Fourier transform ion cyclotron mass spectrometry (FTICR-MS) data; (2) a transformation-weighted characteristics dendrogram generated from the FTICR-MS data; and (3) the script used to generate all FTICR-MS related figures. The Geochemical Data folder contains (1) the single geochemistry data filemore » and (2) the R script responsible for generating associated figures. The Metagenomic Data folder contains (1) annotation information across different levels; (2) carbohydrate active enzyme (CAZyme) information from the dbCAN database (Yin et al., 2012); (3) phylogenetic tree data (FASTAs, alignments, and tree file); and (4) the scripts necessary to analyze all of these data and generate figures. The Null Modeling Data folder contains (1) data generated during null modeling for each river and all rivers combined and (2) the R scripts necessary to process the data. All files are .csv, .pdf, .tsv, .tre, .faa, .afa, .tree, or .R.« less

  10. f

    Dataset for: Some Remarks on the R2 for Clustering

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wiley
    Authors
    Nicola Loperfido; Thaddeus Tarpey
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

  11. f

    Data from: Copula Graphical Models for Heterogeneous Mixed Data

    • tandf.figshare.com
    pdf
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sjoerd Hermes; Joost van Heerwaarden; Pariya Behrouzi (2024). Copula Graphical Models for Heterogeneous Mixed Data [Dataset]. http://doi.org/10.6084/m9.figshare.24756095.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Sjoerd Hermes; Joost van Heerwaarden; Pariya Behrouzi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This article proposes a graphical model that handles mixed-type, multi-group data. The motivation for such a model originates from real-world observational data, which often contain groups of samples obtained under heterogeneous conditions in space and time, potentially resulting in differences in network structure among groups. Therefore, the iid assumption is unrealistic, and fitting a single graphical model on all data results in a network that does not accurately represent the between group differences. In addition, real-world observational data is typically of mixed discrete-and-continuous type, violating the Gaussian assumption that is typical of graphical models, which leads to the model being unable to adequately recover the underlying graph structure. Both these problems are solved by fitting a different graph for each group, applying the fused group penalty to fuse similar graphs together and by treating the observed data as transformed latent Gaussian data, respectively. The proposed model outperforms related models on learning partial correlations in a simulation study. Finally, the proposed model is applied on real on-farm maize yield data, showcasing the added value of the proposed method in generating new production-ecological hypotheses. An R package containing the proposed methodology can be found on https://CRAN.R-project.org/package=heteromixgm. Supplementary materials for this article are available online.

  12. Z

    Data and R-scripts for "Land-use trajectories for sustainable land system...

    • data.niaid.nih.gov
    Updated Oct 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominic A. Martin (2021). Data and R-scripts for "Land-use trajectories for sustainable land system transformations: identifying leverage points in a global biodiversity hotspot" (V2) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4601599
    Explore at:
    Dataset updated
    Oct 14, 2021
    Dataset authored and provided by
    Dominic A. Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sustainable land system transformations are necessary to avert biodiversity and climate collapse. However, it remains unclear where entry points for transformations exist in complex land systems. Here, we conceptualize land systems along land-use trajectories, which allows us to identify and evaluate leverage points; i.e., entry points on the trajectory where targeted interventions have particular leverage to influence land-use decisions. We apply this framework in the biodiversity hotspot Madagascar. In the Northeast, smallholder agriculture results in a land-use trajectory originating in old-growth forests, spanning forest fragments, and reaching shifting hill rice cultivation and vanilla agroforests. Integrating interdisciplinary empirical data on seven taxa, five ecosystem services, and three measures of agricultural productivity, we assess trade-offs and co-benefits of land-use decisions at three leverage points along the trajectory. These trade-offs and co-benefits differ between leverage points: two leverage points are situated at the conversion of old-growth forests and forest fragments to shifting cultivation and agroforestry, resulting in considerable trade-offs, especially between endemic biodiversity and agricultural productivity. Here, interventions enabling smallholders to conserve forests are necessary. This is urgent since ongoing forest loss threatens to eliminate these leverage points due to path-dependency. The third leverage point allows for the restoration of land under shifting cultivation through vanilla agroforests and offers co-benefits between restoration goals and agricultural productivity. The co-occurring leverage points highlight that conservation and restoration are simultaneously necessary. Methodologically, the framework shows how leverage points can be identified, evaluated, and harnessed for land system transformations under the consideration of path-dependency along trajectories.

  13. Intervals for the correlation coefficients between R and G channels, r(1), G...

    • plos.figshare.com
    xls
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Rebollo-Neira; Aurelien Inacio (2023). Intervals for the correlation coefficients between R and G channels, r(1), G and B channels, r(2), and R and B channels, r(3), involving approximately 68% and 95% of the images in the data set. [Dataset]. http://doi.org/10.1371/journal.pone.0279917.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Laura Rebollo-Neira; Aurelien Inacio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Intervals for the correlation coefficients between R and G channels, r(1), G and B channels, r(2), and R and B channels, r(3), involving approximately 68% and 95% of the images in the data set.

  14. e

    Data From: Influence of Hydrological Perturbations and Riverbed Sediment...

    • knb.ecoinformatics.org
    • dataone.org
    • +2more
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michelle Newcomer; Susan Hubbard (2023). Data From: Influence of Hydrological Perturbations and Riverbed Sediment Characteristics on Hyporheic Zone Respiration of CO2 and N-2, Journal of Geophysical Research-Biogeosciences [Dataset]. http://doi.org/10.21952/WTR/1508398
    Explore at:
    Dataset updated
    Apr 6, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Michelle Newcomer; Susan Hubbard
    Time period covered
    May 1, 2012 - Aug 1, 2016
    Area covered
    Description

    This data package contains pumping data (.txt), parameter matrices, and R code (.R, .RData) to perform bootstrapping for parameter selection for the bioclogging model development. The pumping data were collected from the Russian River Riverbank Filtration site located in Sonoma County, California from 2010-2017 from three riverbank collection wells located alongside the study site. The pumping data is directly correlated with water table oscillations, so the code performs these correlations and simulates stochastic versions of water table oscillations. See Metadata Description.pdf for full details on dataset production. This dataset must be used with the R programming language. This dataset and R code is associated with the publication "Influence of Hydrological Perturbations and Riverbed Sediment Characteristics on Hyporheic Zone Respiration of CO2 and N-2" This research was supported by the Jane Lewis Fellowship from the University of California, Berkeley, the Sonoma County Water Agency (SCWA), the Roy G. Post Foundation Scholarship, the U.S. Department of Energy, Office of Science Graduate Student Research (SCGSR) Program, U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under award DE-AC02-05CH11231, and the UFZ-Helmholtz Centre for Environmental Research, Leipzig, Germany.

  15. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  16. C2Metadata test files

    • openicpsr.org
    spss, zip
    Updated Aug 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Alter (2020). C2Metadata test files [Dataset]. http://doi.org/10.3886/E120642V1
    Explore at:
    spss, zipAvailable download formats
    Dataset updated
    Aug 16, 2020
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    George Alter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The C2Metadata (“Continuous Capture of Metadata”) Project automates one of the most burdensome aspects of documenting the provenance of research data: describing data transformations performed by statistical software. Researchers in many fields use statistical software (SPSS, Stata, SAS, R, Python) for data transformation and data management as well as analysis. Scripts used with statistical software are translated into an independent Structured Data Transformation Language (SDTL), which serves as an intermediate language for describing data transformations. SDTL can be used to add variable-level provenance to data catalogs and codebooks and to create “variable lineages” for auditing software operations. This repository provides examples of scripts and metadata for use in testing C2Metadata tools.

  17. t

    BIOGRID CURATED DATA FOR PUBLICATION: M-Ras/R-Ras3, a transforming ras...

    • thebiogrid.org
    zip
    Updated Aug 20, 1999
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BioGRID Project (1999). BIOGRID CURATED DATA FOR PUBLICATION: M-Ras/R-Ras3, a transforming ras protein regulated by Sos1, GRF1, and p120 Ras GTPase-activating protein, interacts with the putative Ras effector AF6. [Dataset]. https://thebiogrid.org/5251/publication/m-rasr-ras3-a-transforming-ras-protein-regulated-by-sos1-grf1-and-p120-ras-gtpase-activating-protein-interacts-with-the-putative-ras-effector-af6.html
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 1999
    Dataset authored and provided by
    BioGRID Project
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Protein-Protein, Genetic, and Chemical Interactions for Quilliam LA (1999):M-Ras/R-Ras3, a transforming ras protein regulated by Sos1, GRF1, and p120 Ras GTPase-activating protein, interacts with the putative Ras effector AF6. curated by BioGRID (https://thebiogrid.org); ABSTRACT: M-Ras is a Ras-related protein that shares approximately 55% identity with K-Ras and TC21. The M-Ras message was widely expressed but was most predominant in ovary and brain. Similarly to Ha-Ras, expression of mutationally activated M-Ras in NIH 3T3 mouse fibroblasts or C2 myoblasts resulted in cellular transformation or inhibition of differentiation, respectively. M-Ras only weakly activated extracellular signal-regulated kinase 2 (ERK2), but it cooperated with Raf, Rac, and Rho to induce transforming foci in NIH 3T3 cells, suggesting that M-Ras signaled via alternate pathways to these effectors. Although the mitogen-activated protein kinase/ERK kinase inhibitor, PD98059, blocked M-Ras-induced transformation, M-Ras was more effective than an activated mitogen-activated protein kinase/ERK kinase mutant at inducing focus formation. These data indicate that multiple pathways must contribute to M-Ras-induced transformation. M-Ras interacted poorly in a yeast two-hybrid assay with multiple Ras effectors, including c-Raf-1, A-Raf, B-Raf, phosphoinositol-3 kinase delta, RalGDS, and Rin1. Although M-Ras coimmunoprecipitated with AF6, a putative regulator of cell junction formation, overexpression of AF6 did not contribute to fibroblast transformation, suggesting the possibility of novel effector proteins. The M-Ras GTP/GDP cycle was sensitive to the Ras GEFs, Sos1, and GRF1 and to p120 Ras GAP. Together, these findings suggest that while M-Ras is regulated by similar upstream stimuli to Ha-Ras, novel targets may be responsible for its effects on cellular transformation and differentiation.

  18. g

    R-scripts for uncertainty analysis v01

    • gimi9.com
    • researchdata.edu.au
    • +2more
    Updated Apr 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). R-scripts for uncertainty analysis v01 [Dataset]. https://gimi9.com/dataset/au_322c38ef-272f-4e77-964c-a14259abe9cf/
    Explore at:
    Dataset updated
    Apr 13, 2022
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Abstract This dataset was created within the Bioregional Assessment Programme. Data has not been derived from any source datasets. Metadata has been compiled by the Bioregional Assessment Programme. This dataset contains a set of generic R scripts that are used in the propagation of uncertainty through numerical models. ## Dataset History The dataset contains a set of R scripts that are loaded as a library. The R scripts are used to carry out the propagation of uncertainty through numerical models. The scripts contain the functions to create the statistical emulators and do the necessary data transformations and backtransformations. The scripts are self-documenting and created by Dan Pagendam (CSIRO) and Warren Jin (CSIRO). ## Dataset Citation Bioregional Assessment Programme (2016) R-scripts for uncertainty analysis v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/322c38ef-272f-4e77-964c-a14259abe9cf.

  19. Z

    Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cresci, Stefano (2023). Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation Interventions on r/The_Donald [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6250576
    Explore at:
    Dataset updated
    Jan 10, 2023
    Dataset provided by
    Cresci, Stefano
    Trujillo, Amaury
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.

    An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again

    If you use this dataset please cite the related article.

    The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.

    The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.

    The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.

    The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.

    A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.

  20. Data applied to automatic method to transform routine otolith images for a...

    • seanoe.org
    image/*
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Andrialovanirina; Alizee Hache; Kelig Mahe; Sébastien Couette; Emilie Poisson Caillault (2022). Data applied to automatic method to transform routine otolith images for a standardized otolith database using R [Dataset]. http://doi.org/10.17882/91023
    Explore at:
    image/*Available download formats
    Dataset updated
    2022
    Dataset provided by
    SEANOE
    Authors
    Nicolas Andrialovanirina; Alizee Hache; Kelig Mahe; Sébastien Couette; Emilie Poisson Caillault
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    fisheries management is generally based on age structure models. thus, fish ageing data are collected by experts who analyze and interpret calcified structures (scales, vertebrae, fin rays, otoliths, etc.) according to a visual process. the otolith, in the inner ear of the fish, is the most commonly used calcified structure because it is metabolically inert and historically one of the first proxies developed. it contains information throughout the whole life of the fish and provides age structure data for stock assessments of all commercial species. the traditional human reading method to determine age is very time-consuming. automated image analysis can be a low-cost alternative method, however, the first step is the transformation of routinely taken otolith images into standardized images within a database to apply machine learning techniques on the ageing data. otolith shape, resulting from the synthesis of genetic heritage and environmental effects, is a useful tool to identify stock units, therefore a database of standardized images could be used for this aim. using the routinely measured otolith data of plaice (pleuronectes platessa; linnaeus, 1758) and striped red mullet (mullus surmuletus; linnaeus, 1758) in the eastern english channel and north-east arctic cod (gadus morhua; linnaeus, 1758), a greyscale images matrix was generated from the raw images in different formats. contour detection was then applied to identify broken otoliths, the orientation of each otolith, and the number of otoliths per image. to finalize this standardization process, all images were resized and binarized. several mathematical morphology tools were developed from these new images to align and to orient the images, placing the otoliths in the same layout for each image. for this study, we used three databases from two different laboratories using three species (cod, plaice and striped red mullet). this method was approved to these three species and could be applied for others species for age determination and stock identification.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Goedhart, Joachim (2024). Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4056965

Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data.

Explore at:
Dataset updated
Jul 19, 2024
Dataset authored and provided by
Goedhart, Joachim
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)

Search
Clear search
Close search
Google apps
Main menu