Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It is always a struggle to find suitable datasets with which to teach, especially across domain expertise. There are many packages that have data, but finding them and knowing what is in them is a struggle due to inadequate documentation. Here we have compiled a search-able database of dataset metadata taken from R packages on CRAN.
Facebook
TwitterPackages for the R programming language often include datasets. This dataset collects information on those datasets to make them easier to find.
Rdatasets is a collection of 1072 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.
This data was collected by Vincent Arel-Bundock, @vincentarelbundock on Github. The version here was taken from Github on July 11, 2017 and is not actively maintained.
In addition to helping find a specific dataset, this dataset can help answer questions about what data is included in R packages. Are specific topics very popular or unpopular? How big are datasets included in R packages? What the naming conventions/trends for packages that include data? What are the naming conventions/trends for datasets included in packages?
This dataset is licensed under the GNU General Public License .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A list of the functions and associated descriptions for the top 10 most downloaded R packages in ecology and evolution.The R packages 'packagefinder' and 'dlstats' were used to compile these rankings and descriptions. Code published to Zenodo. https://zenodo.org/account/settings/github/repository/cjlortie/R_package_chooser_checklist
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Comprehensive R Archive Network (CRAN) is the central repository for software packages in the powerful R programming language for statistical computing. It describes itself as "a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R." If you're installing an R package in the standard way then it is provided by one of the CRAN mirrors.
The ecosystem of R packages continues to grow at an accelerated pace, covering a multitude of aspects of statistics, machine learning, data visualisation, and many other areas. This dataset provides monthly updates of all the packages available through CRAN, as well as their release histories. Explore the evolution of the R multiverse and all of its facets through this comprehensive data.
I'm providing 2 csv tables that describe the current set of R packages on CRAN, as well as the version history of these packages. To derive the data, I made use of the fantastic functionality of the tools package, via the CRAN_package_db function, and the equally wonderful packageRank package and its packageHistory function. The results from those function were slightly adjusted and formatted. I might add further related tables over time.
See the associated blog post for how the data was derived, and for some ideas on how to explore this dataset.
These are the tables contained in this dataset:
cran_package_overview.csv: all R packages currently available through CRAN, with (usually) 1 row per package. (At the time of the creation of this Kaggle dataset there were a few packages with 2 entries and different dependencies. Feel free to contribute some EDA investigating those.) Packages are listed in alphabetical order according to their names.
cran_package_history.csv: version history of virtually all packages in the previous table. This table has one row for each combination of package name and version number, which in most cases leads to multiple rows per package. Packages are listed in alphabetical order according to their names.
I will update this dataset on a roughly monthly cadence by checking which packages have newer version in the overview table, and then replacing
Table cran_package_overview.csv: I decided to simplify the large number of columns provided by CRAN and tools::CRAN_package_db into a smaller set of more focus features. All columns are formatted as strings, except for the boolean feature needs_compilation, but the date_published can be read as a ymd date:
package: package name following the official spelling and capitalisation. Table is sorted alphabetically according to this column.version: current version.depends: package depends on which other packages.imports: package imports which other packages.licence: the licence under which the package is distributed (e.g. GPL versions)needs_compilation: boolean feature describing whether the package needs to be compiled.author: package author.bug_reports: where to send bugs.url: where to read more.date_published: when the current version of the package was published. Note: this is not the date of the initial package release. See the package history table for that.description: relatively detailed description of what the package is doing.title: the title and tagline of the package.Table cran_package_history.csv: The output of packageRank::packageHistory for each package from the overview table. Almost all of them have a match in this table, and can be matched by package and version. All columns are strings, and the date can again be parsed as a ymd date:
package: package name. Joins to the feature of the same name in the overview table. Table is sorted alphabetically according to this column.version: historical or current package version. Also joins. Secondary sorting column within each package name.date: when this version was published. Should sort in the same way as the version does.repository: on CRAN or in the Archive.All data is being made publicly available by the Comprehensive R Archive Network (CRAN). I'm grateful to the authors and maintainers of the packages tools and packageRank for providing the functionality to query CRAN packages smoothly and easily.
The vignette photo is the official logo for the R language © 2016 The R Foundation. You can distribute the logo under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This example dataset is used to illustrate the usage of the R package survtd in the Supplementary Materials of the paper:Moreno-Betancur M, Carlin JB, Brilleman SL, Tanamas S, Peeters A, Wolfe R (2017). Survival analysis with time-dependent covariates subject to measurement error and missing data: Two-stage joint model using multiple imputation (submitted).The data was generated using the simjm function of the package, using the following code:dat
Facebook
TwitterR packages (~3600) from CRAN with description and categories (https://cran.r-project.org/web/views/) for Multilabel Classification task using NLP.
Script for scraping (03/08/2022) : https://github.com/MathieuCayssol/ScrapingCRAN
Data format for R_Cran_Packages :
{"package_name": {"categories": [label_1, ..., label_n], "description": string}, "package_name": {"categories": [label_1, ..., label_n], "description": string}, "package_name": {"categories": [label_1, ..., label_n], "description": string} "package_name": {"categories": [label_1, ..., label_n], "description": string} ... "package_name": {"categories": [label_1, ..., label_n], "description": string}}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Data for Hilbe, J.M. 2011. Negative Binomial Regression, 2nd Edition (Cambridge University Press) and Hilbe, J.M. 2014. Modeling Count Data (Cambridge University Press).
Version: 1.3.4
CRAN: https://CRAN.R-project.org/package=COUNT
Mirror: GitHub
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This software code was developed to estimate the probability that individuals found at a geographic location will belong to the same genetic cluster as individuals at the nearest empirical sampling location for which ancestry is known. POPMAPS includes 5 main functions to calculate and visualize these results (see Table 1 for functions and arguments). Population assignment coefficients and a raster surface must be estimated prior to using POPMAPS functions (see Fig. 1a and b). With these data in hand, users can run a jackknife function to choose an optimal parameter combination that reconstructs empirical data best (Figs. 2 and S2). Pertinent parameters include 1) how many empirical sampling localities should be used to estimate ancestry coefficients and 2) what is the influence of empirical sites on ancestry coefficient estimation as distance increases (Fig. 2). After choosing these parameters, a user can estimate the entire ancestry probability surface (Fig. 1c and d, Fig. 3). ...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Mass spectrometry is a powerful tool for identifying and analyzing biomolecules such as metabolites and lipids in complex biological samples. Liquid chromatography and gas chromatography mass spectrometry studies quite commonly involve large numbers of samples, which can require significant time for sample preparation and analyses. To accommodate such studies, the samples are commonly split into batches. Inevitably, variations in sample handling, temperature fluctuation, imprecise timing, column degradation, and other factors result in systematic errors or biases of the measured abundances between the batches. Numerous methods are available via R packages to assist with batch correction for omics data; however, since these methods were developed by different research teams, the algorithms are available in separate R packages, each with different data input and output formats. We introduce the malbacR package, which consolidates 11 common batch effect correction methods for omics data into one place so users can easily implement and compare the following: pareto scaling, power scaling, range scaling, ComBat, EigenMS, NOMIS, RUV-random, QC-RLSC, WaveICA2.0, TIGER, and SERRF. The malbacR package standardizes data input and output formats across these batch correction methods. The package works in conjunction with the pmartR package, allowing users to seamlessly include the batch effect correction in a pmartR workflow without needing any additional data manipulation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2: Table S1. General information of three real datasets downloaded from TCGA. Table S2. Top 20 rules identified from BRCA mRNA dataset. Table S3. Top 20 rules identified from BRCA DNA methylation. Table S4. Top 20 rules identified from ESCA mRNA dataset. Table S5. Top 20 rules identified from ESCA DNA methylation dataset. Table S6. Top 20 rules identified from LUAD mRNA dataset. Table S7. Top 20 rules identified from LUAD DNA methylation dataset. Table S8. Top 20 rules identified from the combined BRCA mRNA and DNA methylation datasets. Table S9. Top 20 rules identified from the combined ESCA mRNA and DNA methylation datasets. Table S10. Top 20 rules identified from the combined LUAD mRNA and DNA methylation datasets.
Facebook
TwitterEcological processes and biodiversity patterns are strongly affected by how animals move through the landscape. However, it remains challenging to predict animal movement and space use. Here we present our new R package enerscape to quantify and predict animal movement in real landscapes based on energy expenditure.
Enerscape integrates a general locomotory model for terrestrial animals with GIS tools in order to map energy costs of movement in a given environment, resulting in energy landscapes that reflect how energy expenditures may shape habitat use. Enerscape only requires topographic data (elevation) and the body mass of the studied animal. To illustrate the potential of enerscape, we analyze the energy landscape for the Marsican bear (Ursus arctos marsicanus) in a protected area in central Italy in order to identify least-cost paths and high-connectivity areas with low energy costs of travel.
Enerscape allowed us to identify travel routes for the bear that minimize...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for the paper titled "Self-Admitted Technical Debt in R Packages: An Exploratory Study" (Vidoni, 2021), appearing at: https://2021.msrconf.org/track/msr-2021-technical-papers#Accepted-Papers-
Facebook
Twittermarianna13/R-packages dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis folder contains all the data used in the manuscript. The file also includes the R code used to estimate the burden of disease, economic losses and regional incidences. Relevant shape files used to generate the maps and associated R code are also provided. (ZIP)
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Data for Hilbe, J.M. 2015. Practical Guide to Logistic Regression (Chapman and Hall/CRC Press).
Version: 1.3
CRAN: https://CRAN.R-project.org/package=LOGIT (removed)
CRAN archive: https://cran.r-project.org/src/contrib/Archive/LOGIT (archived on 2018-5-10)
Mirror: GitHub
Facebook
TwitterFunctions and data tables for simulation and statistical analysis of chemical toxicokinetics ("TK") as in Pearce et al. (2017) . Chemical-specific in vitro data have been obtained from relatively high throughput experiments. Both physiologically-based ("PBTK") and empirical (e.g., one compartment) "TK" models can be parameterized for several hundred chemicals and multiple species. These models are solved efficiently, often using compiled (C-based) code. This dataset is associated with the following publication: Pearce , R., C. Strope , W. Setzer , N. Sipes , and J. Wambaugh. (Journal of Statistical Software) HTTK: R Package for High-Throughput Toxicokinetics. Journal of Statistical Software. American Statistical Association, Alexandria, VA, USA, 79(4): 1-26, (2017).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R, a programming language, is an attractive tool for data visualization because it is free and open source. However, learning R can be intimidating and cumbersome for many. In this report, we introduce an R package called “smplot” for easy and elegant data visualization. The R package “smplot” generates graphs with defaults that are visually pleasing and informative. Although it requires basic knowledge of R and ggplot2, it significantly simplifies the process of plotting a bar graph, a violin plot, a correlation plot, a slope chart, a Bland-Altman plot and a raincloud plot. The aesthetics of the plots generated from the package are elegant, highly customisable and adhere to important practices of data visualization. The functions from smplot can be used in a modular fashion, thereby allowing the user to further customise the aesthetics. The smplot package is open source under the MIT license and available on Github (https://github.com/smin95/smplot), where updates will be posted. All the example figures in this report are reproducible and the codes and data are provided for the reader in a separate online guide (https://smin95.github.io/dataviz/).
Facebook
TwitterComparison of selected metrics across R packages using example dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It is always a struggle to find suitable datasets with which to teach, especially across domain expertise. There are many packages that have data, but finding them and knowing what is in them is a struggle due to inadequate documentation. Here we have compiled a search-able database of dataset metadata taken from R packages on CRAN.