100+ datasets found

Explore data formats and ingestion methods
kaggle.com
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gabriel Preda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Why this Dataset

This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

Test Data Formats in Python

Test Data Formats in R

Iris Dataset

Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

The file downloaded is iris.data and is formatted as a comma delimited file.

This small data collection was created to help you test your skills with ingesting various data formats.

Content

This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
* feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
* npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

Acknowledgements

I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

Inspiration

Use these data formats to test your skills in ingesting data in various formats.
m
R codes and dataset for Visualisation of Diachronic Constructional Change...
bridges.monash.edu
researchdata.edu.au
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
q
Large Datasets in R - Plant Phenology & Temperature Data from NEON
qubeshub.org
Updated May 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
Explore at:
Unique identifier
https://doi.org/10.25334/Q4DQ3F
Dataset updated
May 10, 2018
Dataset provided by
QUBES
Authors
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
Description
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
f
Collection of example datasets used for the book - R Programming -...
figshare.com
txt
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24728073.v1
Dataset updated
Dec 4, 2023
Dataset provided by
figshare
Authors
Kingsley Okoye; Samira Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
R object containing study data in Phyloseq format
figshare.com
application/gzip
Updated Apr 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Cox (2023). R object containing study data in Phyloseq format [Dataset]. http://doi.org/10.6084/m9.figshare.22702000.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22702000.v1
Dataset updated
Apr 26, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Michael Cox
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R object containing OTU tables and metadata from throat swabs of children in Ecuador.
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
d
Census block internal point coordinates and weights formatted specifically...
catalog.data.gov
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OP,ORPM (2023). Census block internal point coordinates and weights formatted specifically for use in R code of the Environmental Justice Analysis Multisite (EJAM) tool, USA, 2020, EPA, EPA AO OP ORPM [Dataset]. https://catalog.data.gov/dataset/census-block-internal-point-coordinates-and-weights-formatted-specifically-for-use-in-r-co
Explore at:
Dataset updated
Sep 8, 2023
Dataset provided by
OP,ORPM
Area covered
United States
Description
This is Census 2020 block data specifically formatted for use by the Environmental Protection Agency (EPA) in-development Environmental Justice Analysis Multisite (EJAM) tool, which uses R code to find which block centroids are within X miles of each specified point (e.g., regulated facility), and to find those distances. The datasets have latitude and longitude of each block's internal point, as provided by Census Bureau, and the FIPS code of the block and its parent block group. The datasets also include a weight for each block, representing this block's Census 2020 population count as a fraction of the count for the parent block group overall, for use in estimating how much of a given block group is within X miles of a specified point or inside a polygon of interest. The datasets also have an effective radius of each block, which is what the radius would be in miles if the block covered the same area in square miles but were circular. The datasets also have coordinates in units that facilitate building a quadtree index of locations. They are in R data.table format, saved as .rda or .arrow files to be read by R code.
E
Code for dealing with data format CARIBIC_NAmes_v02
edmond.mpdl.mpg.de
edmond.mpg.de
txt, type/x-r-syntax
Updated Mar 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Walter, David; Walter, David (2022). Code for dealing with data format CARIBIC_NAmes_v02 [Dataset]. http://doi.org/10.17617/3.WDVSU7
Explore at:
type/x-r-syntax(75684), txt(76894), txt(128015), txt(132902)Available download formats
Unique identifier
https://doi.org/10.17617/3.WDVSU7
Dataset updated
Mar 3, 2022
Dataset provided by
Edmond
Authors
Walter, David; Walter, David
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
R- and Igor-Code for reading and writing data files of format "CARIBIC_NAmes_v02". See "https://doi.org/10.17617/3.10" for the file format description. That file format has been used predominantly within projects CARIBIC and ATTO, for example in "https://doi.org/10.17617/3.3r". The code files of this dataset can be used with software R ("r-project.org") or Igor Pro ("https://www.wavemetrics.com/").
SMART R-1 Radar Data, DORADE format
data.ucar.edu
netcdf
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Conrad L. Ziegler; Gordon D. Carrie; Michael I. Biggerstaff (2024). SMART R-1 Radar Data, DORADE format [Dataset]. http://doi.org/10.5065/D6C53J0T
Explore at:
netcdfAvailable download formats
Unique identifier
https://doi.org/10.5065/D6C53J0T
Dataset updated
Dec 26, 2024
Dataset provided by
University Corporation for Atmospheric Research
Authors
Conrad L. Ziegler; Gordon D. Carrie; Michael I. Biggerstaff
Time period covered
Jun 2, 2015 - Jul 9, 2015
Area covered

Description
This dataset contains SMART R-1 Radar data collected during the Plains Elevated Convection at Night (PECAN) project from 2 June 2015 to 9 July 2015. The data are in DORADE format and are available as daily tar files. Each tar file contains an operator log and documentation file. An example readme file is linked below. The original data were in SIGMET format; manufacturer info is available by following the link to SIGMET manuals below.
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
explore.openaire.eu
+2more
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Massachusetts General Hospital
Harvard Medical School
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Bellabeat Case Study
kaggle.com
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sierra Klimek (2023). Bellabeat Case Study [Dataset]. https://www.kaggle.com/datasets/sierraklimek/bellabeat-case-study/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sierra Klimek
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About the Company:

Bellabeat, a small company manufacturing high-tech products focused on bringing Health-focused smart devices and other Wellness products to Women around the world. Since Urška Sršen and Sando Mur founded the company in 2013 they have seen it grow tremendously. Now they have asked for an analysis on non-Bellabeat smart device usage and how we can use this data to create new campaign strategies and drive future growth.

Questions and Objectives

Questions:

What are some trends in smart device usage?

How could these trends apply to Bellabeat customers?

How could these trends help influence Bellabeat marketing strategy? ### Objectives:

Utilize R Studio to clean and format the data

Visualize trends in the data, showing your findings

Identify opportunities for growth and recommendations for Bellabeat marketing team _

R Programming Showcase

Loading packages

library(tidyverse)

library(lubridate)

library(dplyr)

library(ggplot2)

library(tidyr)

Importing the datasets

I utilized Fitbit Fitness tracker data, located here for this project. 6. activity <- read.csv("Fitabase_Data/dailyActivity_merged.csv") 7. calories <- read.csv("Fitabase_Data/dailyCalories_merged.csv") 8. sleep <- read.csv("Fitabase_Data/sleepDay_merged.csv") 9. weight <- read.csv("Fitabase_Data/weightLogInfo_merged.csv")

Viewing the data

While using the view function I'm able to skim through the datasets and make sure everything is imported correctly. I will also use this time to see if I need to clean the data in anyway or format the data differently. 10. View(activity) 11. View(calories) 12. View(sleep) 13. View(weight)

Formatting the data

After viewing the datasets I see that I will need to format the Dates and Times to matching formats on all the datasets. 14. sleep$SleepDay=as.POSIXct(sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()) 15. sleep$date <- format(sleep$SleepDay, format = "%m/%d/%y") 16. activity$ActivityDate=as.POSIXct(activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone()) 17. activity$date <- format(activity$ActivityDate, format = "%m/%d/%y") 18. weight$Date=as.POSIXct(weight$Date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()) 19. weight$time <- format(weight$Date, format = "%H:%M:%S") 20. weight$date <- format(weight$Date, format = "%m/%d/%y") 21. calories$date <- format(calories$ActivityDay, format = "%m/%d/%y")

Summarizing the data

Here I will be using the summary function to gather information about minimum, medians, averages, and maximums for certain column in the datasets (ie; Total Steps, Calories, Active Minutes, Minutes Asleep, Sedentary Minutes) 22. activity %>% select(TotalSteps, TotalDistance, SedentaryMinutes, Calories) %>% summary() 23. activity %>% select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>% summary() 24. calories %>% select(Calories) %>% summary() 25. sleep %>% select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>% summary() 26. weight %>% select(WeightKg, BMI) %>% summary()

Discoveries I made from summarizing the data:

Most participants in this dataset are lightly active (on a scale of light, moderate, and high)

Average sleep time is 7 hours

Average steps per day is 7638

Average weight is 72kg, or 158lbs _ ### Visualizing the data Now it's time to visualize our data with some scatter plots. I chose this form of visualization because it easily shows correlation and trends. _ https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16489441%2Fe8609be4b7c42b45697ee0a77661ee5d%2Fstepsvcal.png?generation=1700709197910572&alt=media" alt="">

The first scatter plot shows a positive correlation between Total Steps and Calories, which shows that the more active we are, the more calories we burn

ggplot(data=activity, aes(x=TotalSteps, y=Calories)) + geom_point(color='purple') + geom_smooth() + labs(title="Total Steps vs. Calories") _

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16489441%2Fe7a12b855837b0c6b7a2a5b1736e0fe1%2Fminsleepvsedentarymin.png?generation=1700709515785307&alt=media" alt=""> - The second scatter plot showcas...
R scripts
figshare.com
txt
Updated May 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xueying Han (2018). R scripts [Dataset]. http://doi.org/10.6084/m9.figshare.5513170.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5513170.v3
Dataset updated
May 10, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Xueying Han
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R scripts in this fileset are those used in the PLOS ONE publication "A snapshot of translational research funded by the National Institutes of Health (NIH): A case study using behavioral and social science research awards and Clinical and Translational Science Awards funded publications." The article can be accessed here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196545This consists of all R scripts used for data cleaning, data manipulation, and statistical analysis used in the publication.There are eleven files in total:1. "Step1a.bBSSR.format.grants.and.publications.data.R" combines all bBSSR 2008-2014 grant award data and associated publications downloaded from NIH Reporter. 2. "Step1b.BSSR.format.grants.and.publications.data.R" combines all BSSR-only 2008-2014 grant award data and associated publications downloaded from NIH Reporter. 3. "Step2a.bBSSR.get.pubdates.transl.and.all.grants.R" queries PubMed and downloads associated bBSSR publication data.4. "Step2b.BSSR.get.pubdates.transl.and.all.grants.R" queries PubMed and downloads associated BSSR-only publication data.5. "Step3.summary.stats.R" performs summary statistics6. "Step4.time.to.first.publication.R" performs time to first publication analysis.7. "Step5.time.to.citation.analysis.R" performs time to first citation and time to overall citation analyses.8. "Step6.combine.NIH.iCite.data.R" combines NIH iCite citation data.9. "Step7.iCite.data.analysis.R" performs citation analysis on combined iCite data.10. "Step8.MeSH.descriptors.R" queries PubMed and pulls down all MeSH descriptors for all publications11. "Step9.CTSA.publications.R" compares the percent of translational publications among bBSSR, BSSR-only, and CTSA publications.
d
Modeling data and data for figures and text
datasets.ai
catalog.data.gov
10, 57
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2024). Modeling data and data for figures and text [Dataset]. https://datasets.ai/datasets/modeling-data-and-data-for-figures-and-text
Explore at:
10, 57Available download formats
Dataset updated
Aug 27, 2024
Dataset authored and provided by
U.S. Environmental Protection Agency
Description
The data in this archive in in a zipped R data binary format, https://cran.r-project.org/doc/manuals/r-release/R-data.html. These data can be read by using the open source and free to use statistical software package R, https://www.r-project.org/. The data are organized following the figure numbering in the manuscript, e.g. Figure 1a is fig1a, and contains the same labeling as the figures including units and variable names. For a full explanation of the figure, please see the captions in the manuscript.

To open this data file, use the following commands in R.

load(‘JKelly_NH4NO3_JGR_2018.rdata’)

To list the contents of the file, use the following command in R

ls()

The data for each figure is contained in the data object with the figures name. To list the data, simply type the name of the figure returned from the ls() command.

The original model output and emissions used for this study are located on the ASM archived storage at /asm/ROMO/finescale/sjv2013. These data are in NetCDF format with self contained metadata with descriptive headers containing variable names, units, and simulation times.

This dataset is associated with the following publication: Kelly, J., C. Parworth, Q. Zhang, D. Miller, K. Sun, M. Zondlo , K. Baker, A. Wisthaler, J. Nowak , S. Pusede , R. Cohen , A. Weinheimer , A. Beyersdorf , G. Tonnesen, J. Bash, L. Valin, J. Crawford, A. Fried , and J. Walega. Modeling NH4NO3 Over the San Joaquin Valley During the 2013 DISCOVER‐AQ Campaign. JOURNAL OF GEOPHYSICAL RESEARCH-ATMOSPHERES. American Geophysical Union, Washington, DC, USA, 123(9): 4727-4745, (2018).
d
Data from: High-Resolution Seismic-Reflection Boomer Profiles in SEG-Y and...
catalog.data.gov
search.dataone.org
+3more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). High-Resolution Seismic-Reflection Boomer Profiles in SEG-Y and JPEG Formats From Cruise RAFA08034 off Edgartown, Massachusetts (08034_BOOMERPROFILES) [Dataset]. https://catalog.data.gov/dataset/high-resolution-seismic-reflection-boomer-profiles-in-seg-y-and-jpeg-formats-from-cruise-r
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Massachusetts, Edgartown
Description
The U.S. Geological Survey (USGS), in cooperation with the National Oceanic and Atmospheric Administration (NOAA) and the Massachusetts Office of Coastal Zone Management (MA CZM), is producing detailed geologic maps of the coastal sea floor. Imagery, originally collected by NOAA for charting purposes, provides a fundamental framework for research and management activities along this part of the Massachusetts coastline, shows the composition and terrain of the seabed, and provides information on sediment transport and benthic habitat. Interpretive data layers were derived from the combined single-beam and multibeam echo-sounder data and sidescan-sonar data collected in the vicinity of Edgartown Harbor, Massachusetts. During August 2008 seismic-reflection profiles (Boomer and Chirp) were acquired, and during September 2008 bottom photographs and surficial sediment data were acquired as part of two ground-truth reconnaissance surveys.
d
GAL Predictions of receptor impact variables v01
data.gov.au
cloud.csiss.gmu.edu
+1more
zip
Updated Nov 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2019). GAL Predictions of receptor impact variables v01 [Dataset]. https://data.gov.au/data/dataset/67e0aec1-be25-46f5-badc-b4d895a934aa
Explore at:
zipAvailable download formats
Dataset updated
Nov 20, 2019
Dataset provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

Receptor impact models (RIMs) are developed for specific landscape classes. The prediction of Receptor Impact Variables is a multi-stage process. It relies on the runs from surface water and groundwater models at nodes within the analysis extent. These outputs derive directly from the hydrological model. For a given node, there is a value for each combination of hydrological response variable, future, and replicate or run number. Not all variables may be available or appropriate at every node. This differs to the quantile summary information that is otherwise used to summarise the HRV output and is also registered.

Dataset History

There is a key look up table (Excel file) that lists the assessment units (AUIDs) by landscape class (or landscape group if appropriate) and notes that groundwater modelling node and runs, and the surface water modelling node and runs, that should be used for that AUID. In some cases the AUID is only mapped to one set of hydrological modelling output. This look up table represent the AUIDs that require RIV predictions. For NAM and GAL there is a single look up table. For GLO and HUN surface and GW are provided separately.

Receptor impact models (RIMs) are developed for specific landscape classes. The hydrological response variables that a RIM within a landscape class requires are organised by the R script RIM_Prediction_CreateArray.R into an array. The formatted data is available as an R data file format called RDS and can be read directly into R.

The R script IMIA_XXX_RIM_predictions.R applies the receptor model functions (RDS object as part of Data set 1: Ecological expert elicitation and receptor impact models for the XXX subregion) to the HRV array for each landscape class (or landscape group) to make predictions of receptor impact varibles (RIVs). Predictions of a receptor impact from a RIM for a landscape class are summarised at relevant AUIDs by the 5th through to the 95th percentiles (in 5% increments) for baseline and CRDP futures. These are available in the XXX_RIV_quantiles_IMIA.csv data set. RIV predictions are further summarised and compared as boxplots (using the R script boxplotsbyfutureperiod.R) and as (aggregated) spatial risk maps using GIS.

Dataset Citation

Bioregional Assessment Programme (2018) GAL Predictions of receptor impact variables v01. Bioregional Assessment Derived Dataset. Viewed 10 December 2018, http://data.bioregionalassessments.gov.au/dataset/67e0aec1-be25-46f5-badc-b4d895a934aa.

Dataset Ancestors

Derived From Queensland wetland data version 3 - wetland areas.

Derived From Geofabric Surface Cartography - V2.1

Derived From Landscape classification of the Galilee preliminary assessment extent

Derived From Geofabric Surface Cartography - V2.1.1

Derived From GAL Landscape Class Reclassification for impact and risk analysis 20170601

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Derived From Queensland groundwater dependent ecosystems

Derived From GEODATA TOPO 250K Series 3

Derived From Multi-resolution Valley Bottom Flatness MrVBF at three second resolution CSIRO 20000211

Derived From Landscape classification of the Galilee preliminary assessment extent

Derived From Biodiversity status of pre-clearing and remnant regional ecosystems - South East Qld

maestro in WebDataset Format Creators

zenodo.org

tar

Updated Feb 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Niu Yadong; Niu Yadong (2025). maestro in WebDataset Format Creators [Dataset]. http://doi.org/10.5281/zenodo.14858022

Explore at:

tarAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14858022

Dataset updated

Feb 12, 2025

Dataset provided by

Xiaomi

Authors

Niu Yadong; Niu Yadong

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is the maestro dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

$ tar tvf maestro_train_000000.tar|head       
-r--r--r-- bigdata/bigdata 327458 2025-01-23 13:26 MIDI-Unprocessed_XP_15_R2_2004_01_ORIG_MID--AUDIO_15_R2_2004_02_Track02_wav.json
-r--r--r-- bigdata/bigdata 120375940 2025-01-23 13:26 MIDI-Unprocessed_XP_15_R2_2004_01_ORIG_MID--AUDIO_15_R2_2004_02_Track02_wav.wav
-r--r--r-- bigdata/bigdata  625054 2025-01-23 13:26 MIDI-Unprocessed_13_R1_2009_01-03_ORIG_MID--AUDIO_13_R1_2009_13_R1_2009_03_WAV.json
-r--r--r-- bigdata/bigdata 137713368 2025-01-23 13:26 MIDI-Unprocessed_13_R1_2009_01-03_ORIG_MID--AUDIO_13_R1_2009_13_R1_2009_03_WAV.wav
-r--r--r-- bigdata/bigdata  356393 2025-01-23 13:26 MIDI-Unprocessed_XP_17_R2_2004_01_ORIG_MID--AUDIO_17_R2_2004_01_Track01_wav.json
-r--r--r-- bigdata/bigdata 132159804 2025-01-23 13:26 MIDI-Unprocessed_XP_17_R2_2004_01_ORIG_MID--AUDIO_17_R2_2004_01_Track01_wav.wav
-r--r--r-- bigdata/bigdata  255210 2025-01-23 13:26 ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.json
-r--r--r-- bigdata/bigdata 58523088 2025-01-23 13:26 ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.wav
-r--r--r-- bigdata/bigdata  1190145 2025-01-23 13:26 MIDI-UNPROCESSED_04-07-08-10-12-15-17_R2_2014_MID--AUDIO_17_R2_2014_wav.json
-r--r--r-- bigdata/bigdata 390151460 2025-01-23 13:26 MIDI-UNPROCESSED_04-07-08-10-12-15-17_R2_2014_MID--AUDIO_17_R2_2014_wav.wav

$ cat ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.json
[
  ...
  {"start": 323.546875, "end": 323.5859375, "note": 51}, 
  {"start": 323.703125, "end": 323.74869791666663, "note": 51}, 
  {"start": 323.8450520833333, "end": 323.8919270833333, "note": 51}, 
  {"start": 324.00390625, "end": 324.0442708333333, "note": 51},
  ...
]

Z
SLAFEEL: R scripts and reformatted data analyzed by Alamil et al. (2019)
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Soubeyrand (2020). SLAFEEL: R scripts and reformatted data analyzed by Alamil et al. (2019) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1410438
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Samuel Soubeyrand
Gaël Thébaud
Joseph Hughes
Cécile Desbiez
Karine Berthier
Maryam Alamil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SLAFEEL: Statistical Learning Approach For Estimating Epidemiological Links from deep sequencing data

This archive contains R scripts for running analyses proposed by Alamil et al. (2019; Inferring epidemiological links from deep sequencing data: a statistical learning approach for human, animal and plant diseases), namely - functions.R that contains R functions required for computations, - influenza.R, ebola.R and potyvirus.R where the analyses are implemented for each case study, and - influenza-format-genomic-data.R giving an example of how to format data to be used in the statistical learning approach.

This archive also contains the reformatted data analyzed by Alamil et al. (2019). The datasets that are provided concern swine influenza virus (reformatted from Murcia et al., 2012), Ebola virus (reformatted from Gire et al., 2014) and a wild salsify potyvirus. Two rds files are provided for swine influenza, the first one for the naive chain, the second one for the vaccinated chain. Ebola rds files are compressed into the archive ebolaRDS.zip. rds files can be loaded in the R statistical software with the command "readRDS(filename)", which returns a list. The list contains a "readme" item describing the contents of the list, as well as a "host.table" item providing metadata about host units and a "set.of.sequences" item providing sequencing data formatted in numeric matrices.

Murcia PR, Hughes J, Battista P, Lloyd L, Baillie GJ, Ramirez-Gonzalez RH, et al. Evolution of an Eurasian avian-like influenza virus in naive and vaccinated pigs. PLoS Pathogens. 2012;8(5):e1002730.

Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science. 2014;345:1369–1372

Funded by the ANR - Project name: SMITID (2016-2020) - Grant number: ANR-16-CE35-0006
c
Data from: Decision-Support Framework for Linking Regional-Scale Management...
s.cnmilf.com
data.usgs.gov
+1more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Decision-Support Framework for Linking Regional-Scale Management Actions to Continental-Scale Conservation of Wide-Ranging Species [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/decision-support-framework-for-linking-regional-scale-management-actions-to-continental-sc
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data release presents the data, JAGS models, and R code used to manipulate data and to produce results and figures presented in the USGS Open File Report, "Decision-Support Framework for Linking Regional-Scale Management Actions to Continental-Scale Conservation of Wide-Ranging Species, (https://doi.org/10.5066/P93YTR3X). The zip folder is provided so that other can reproduce results from the integrated population model, inspect model structure and posterior simulations, conduct analyses not present in the report, and use and modify the code. Raw source data can be sourced from the USGS Bird Banding Laboratory, USFWS Surveys and Monitoring Branch, National Oceanic and Atmospheric administration, and Ducks Unlimited Canada. The zip file contains the following objects when extracted: Readme.txt: A plain text file describing each file in this directory. Figures-Pintail-IPM.r: R code that generates report figures in png, pdf, and eps format. Generates Figures 2-11 and calls source code for figures 12 and 13 found in other files. * get pintail IPM data.r: R source code that must be run to format data for the IPM code file. * getbandrecovs.r: R code that takes Bird Banding Lab data for pintail band releases and recoveries and formats for analysis. This file is called by 'get pintail IPM data.r'. File was originally written by Scott Boomer (USFWS) and modified by Erik Osnas for use for the IPM. * Model_1_post.txt: Text representation of the posterior simulations from Model 1. This file can be read by the R function dget() to produce an R list object that contain posterior draws from Model 1. The list is the BUGSoutput$sims.list object from a call to rjags::jags. * Model_2_post.txt: As above but for Model 2. * Model_S1_post.txt: As above but for Model S1. * Pintail IPM.r: This is the main file that defines the IPM models in JAGS, structures the data for JAGS, defines initial values, and calls runs the models. Outputs are text files that contains JAGS model files, R work spaces that contains all data models, and results, include the output from the jags() function. From this the BUGSoutput$sims.list object was written to text for each model. * MSY_metrics.txt: Summary of results produced from running code in source_figure_12.R. This table is a text representation of a summary of the maximum sustained yield analysis at various mean rainfall levels, used for Table 1 of report and can be reproduced by running the code in source_figure_12.R. To understand the structure of this file, you must consult the code file and understand the structure of the R objects created from that code. Otherwise, consult Figure 12 and Table 1 in report. * source_figure_12.R: R code to produce Figure 12. Code is written to work with Rworkspace output from Model 1, but can be modified to use the Model_1_post.txt file without re-running the model. This would allow use of the same posterior realizations as used in the report. * source_figure_13.R: This is the code used to product the results for Figure 13. Required here is the posterior from Model 1 and data for the Prairie Parkland Model based on Jim Devries/Ducks Unlimited data. These are described in the report text. * Data: A directory that contains the raw data used for this report. * Data/2015_LCC_Networks_shapefile: A directory that contain ESRI shapefiles for used in Figure 1 and to define the boundaries of the Landscape Conservation Cooperatives. Found at (https://www.sciencebase.gov/catalog/item/55b943ade4b09a3b01b65d78) * Data/bndg_1430_yr1960up_DBISC_03042014.csv: A comma delimited file for banded pintail from 1960 to 2014. Obtained from the USGS Bird Banding Lab. This file is used by 'getbandrecovs.r' to produce and 'm-array' used in the Integrated Population Model (IPM). A data dictionary describing the codes for each field can be found here, https://www.pwrc.usgs.gov/BBL/manual/summary.cfm * Data/cponds.csv: A comma delimited file of estimated Canadian ponds based on counts from the North American Breeding Waterfowl and Habitat Survey, 1955-2014. Given is the year, point estimate, and estimated standard error. * Data/enc_1430_yr1960up_DBISC_03042014.csv: A comma delimited file for encounters of banded pintail. Obtained from the USGS Bird Banding Lab. This file is use by 'getbandrecovs.r' to produce and 'm-array' used in the Integrated Population Model (IPM). A data dictionary describing the codes for each field can be found here, (https://www.pwrc.usgs.gov/BBL/manual/enc.cfm) * Data/nopiBPOP19552014.csv: A comma delimited file of estimated northern pintail based on counts from the North American Breeding Waterfowl and Habitat Survey, 1955-2014. Given is the year, pintail point estimate (bpop), and pintail estimated standard error (bpopSE), mean latitude of the pintail population (lat), latitude variance of the pintail population (latVAR), mean longitude of the pintail population (lon), and the variance in longitude of the pintail population (lonVAR). * Data/Summary Climate Data California CV 2.csv: Rainfall data for the California central valley downloaded from National Climate Data Center (www.ncdc.noaa.gov/cdo-web/) as described in report text (https://doi.org/10.5066/P93YTR3X) and publication found at https://doi.org/10.1002/jwmg.21124 . Used in 'get pintail IPM data.r' for IPM. * Data/Summary data MAV.csv: Rainfall data for the Mississippi Aluvial valley downloaded from National Climate Data Center (www.ncdc.noaa.gov/cdo-web/) as described in report text (https://doi.org/10.5066/P93YTR3X) and publication found at https://doi.org/10.1002/jwmg.21124 . Used in 'get pintail IPM data.r' for IPM. * Data/Wing data 1961 2011 NOPI.txt: Comma delimited text file for pintail wing age data for 1961 to 2011 from the Parts Collection Survey. Each row is an individual wing with sex cohorts 4 = male, 5 = female and age cohorts 1 = After Hatch Year and 2 = Hatch Year. Wt is a weighting factor that determines how many harvested pintails this wing represent. See USFWS documentation for the Part Collection survey for descriptions. Summing Wt for each age, sex, and year gives an estimate of the number of pintail harvested. Used in 'get pintail IPM data.r' for IPM. * Data/Wing data 2012 2013 NOPI.csv: Same as 'Wing data 1961 2011 NOPI.txt' but for years 2012 and 2013.
d
Data from: Prediction models based on 28 point samples, link to model output...
datadiscoverystudio.org
doi.pangaea.de
+1more
831942
Updated 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MartÃnez Arbizu, Pedro; Schnurr, Sarah; Ostmann, Alexandra (2014). Prediction models based on 28 point samples, link to model output in text format [Dataset]. http://doi.org/10.1594/PANGAEA.831942
Explore at:
831942(2788.0000)Available download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.831942
Dataset updated
2014
Authors
MartÃnez Arbizu, Pedro; Schnurr, Sarah; Ostmann, Alexandra
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Description
Techical Information: Prediction models based on 28 point samples were produced using randomForest regression and Multivariate Adaptive Regression Splines (MARS) of the 'randomForest' and 'earth' packages of the R software (http://www.r-project.org).

Facebook

Twitter

Click to copy link

Link copied

Cite

Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined

Explore data formats and ingestion methods

Learn how to ingest various data formats in R & Python using Iris Dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 12, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gabriel Preda

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Why this Dataset

This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

Iris Dataset

Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

The file downloaded is iris.data and is formatted as a comma delimited file.

This small data collection was created to help you test your skills with ingesting various data formats.

Content

This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
* feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
* npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

Acknowledgements

I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

Inspiration

Use these data formats to test your skills in ingesting data in various formats.

Clear search

Close search

Google apps

Main menu

Explore data formats and ingestion methods

Why this Dataset

Iris Dataset

Content

Acknowledgements

Inspiration

R codes and dataset for Visualisation of Diachronic Constructional Change...

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Large Datasets in R - Plant Phenology & Temperature Data from NEON

Collection of example datasets used for the book - R Programming -...

R object containing study data in Phyloseq format

Film Circulation dataset

Census block internal point coordinates and weights formatted specifically...

Code for dealing with data format CARIBIC_NAmes_v02

SMART R-1 Radar Data, DORADE format

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Bellabeat Case Study

About the Company:

Questions and Objectives

Questions:

R Programming Showcase

Loading packages

Importing the datasets

Viewing the data

Formatting the data

Summarizing the data

Discoveries I made from summarizing the data:

R scripts

Modeling data and data for figures and text

Data from: High-Resolution Seismic-Reflection Boomer Profiles in SEG-Y and...

GAL Predictions of receptor impact variables v01

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

maestro in WebDataset Format Creators

SLAFEEL: R scripts and reformatted data analyzed by Alamil et al. (2019)

Data from: Decision-Support Framework for Linking Regional-Scale Management...

Data from: Prediction models based on 28 point samples, link to model output...

Explore data formats and ingestion methods

Learn how to ingest various data formats in R & Python using Iris Dataset

Why this Dataset

Iris Dataset

Content

Acknowledgements

Inspiration