14 datasets found
  1. Data and R-script for a tutorial that explains how to convert spreadsheet...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joachim Goedhart; Joachim Goedhart (2024). Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. [Dataset]. http://doi.org/10.5281/zenodo.4056966
    Explore at:
    bin, csv, pngAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joachim Goedhart; Joachim Goedhart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)

  2. h

    R-tidy-base

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zhao, R-tidy-base [Dataset]. https://huggingface.co/datasets/zixiao/R-tidy-base
    Explore at:
    Authors
    zhao
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    zixiao/R-tidy-base dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. R Downloads from Tidy Tuesday

    • kaggle.com
    Updated Jan 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ángela Castillo-Gill (2019). R Downloads from Tidy Tuesday [Dataset]. https://www.kaggle.com/adcastillogill/r_downloads/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ángela Castillo-Gill
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Ángela Castillo-Gill

    Released under CC0: Public Domain

    Contents

  4. o

    R codes for datasets derived from “Kamus Bahasa Enggano”, the printed...

    • osf.io
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg; Charlotte Hemmings; Engga Zakaria Sangian; Dendi Wijaya; I Wayan Arka (2025). R codes for datasets derived from “Kamus Bahasa Enggano”, the printed learner’s dictionary of Contemporary Enggano [Dataset]. http://doi.org/10.17605/OSF.IO/JM4FN
    Explore at:
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Center For Open Science
    Authors
    Gede Primahadi Wijaya Rajeg; Charlotte Hemmings; Engga Zakaria Sangian; Dendi Wijaya; I Wayan Arka
    Area covered
    Enggano Island
    Description

    This repository tracks and documents the R codes used to transform the .lift XML export of the Enggano learner's dictionary FLEx project into a tidy tabular form for three file formats: .rds (R data file), .csv, and .tsv. This repository is synced from its original GitHub repository at https://github.com/engganolang/eno-learner-lift. Check that GitHub repository for further update.

  5. Brisbane Library Checkout Data

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Tierney; Nicholas Tierney (2020). Brisbane Library Checkout Data [Dataset]. http://doi.org/10.5281/zenodo.2437860
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicholas Tierney; Nicholas Tierney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Brisbane
    Description

    This has been copied from the README.md file

    bris-lib-checkout

    This provides tidied up data from the Brisbane library checkouts

    Retrieving and cleaning the data

    The script for retrieving and cleaning the data is made available in scrape-library.R.

    The data

    • The data/ folder contains the tidy data
    • The data-raw/ folder contains the raw data

    data/

    This contains four tidied up dataframes:

    • tidy-brisbane-library-checkout.csv
    • metadata_branch.csv
    • metadata_heading.csv
    • metadata_item_type.csv

    tidy-brisbane-library-checkout.csv contains the following columns, with the metadata file metadata_heading containing the description of these columns.

    knitr::kable(readr::read_csv("data/metadata_heading.csv"))
    #> Parsed with column specification:
    #> cols(
    #> heading = col_character(),
    #> heading_explanation = col_character()
    #> )

    heading

    heading_explanation

    Title

    Title of Item

    Author

    Author of Item

    Call Number

    Call Number of Item

    Item id

    Unique Item Identifier

    Item Type

    Type of Item (see next column)

    Status

    Current Status of Item

    Language

    Published language of item (if not English)

    Age

    Suggested audience

    Checkout Library

    Checkout branch

    Date

    Checkout date

    We also added year, month, and day columns.

    The remaining data are all metadata files that contain meta information on the columns in the checkout data:

    library(tidyverse)
    #> ── Attaching packages ────────────── tidyverse 1.2.1 ──
    #> ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
    #> ✔ tibble 1.4.99.9006 ✔ dplyr 0.7.8
    #> ✔ tidyr 0.8.2 ✔ stringr 1.3.1
    #> ✔ readr 1.3.0 ✔ forcats 0.3.0
    #> ── Conflicts ───────────────── tidyverse_conflicts() ──
    #> ✖ dplyr::filter() masks stats::filter()
    #> ✖ dplyr::lag() masks stats::lag()
    knitr::kable(readr::read_csv("data/metadata_branch.csv"))
    #> Parsed with column specification:
    #> cols(
    #> branch_code = col_character(),
    #> branch_heading = col_character()
    #> )

    branch_code

    branch_heading

    ANN

    Annerley

    ASH

    Ashgrove

    BNO

    Banyo

    BRR

    BrackenRidge

    BSQ

    Brisbane Square Library

    BUL

    Bulimba

    CDA

    Corinda

    CDE

    Chermside

    CNL

    Carindale

    CPL

    Coopers Plains

    CRA

    Carina

    EPK

    Everton Park

    FAI

    Fairfield

    GCY

    Garden City

    GNG

    Grange

    HAM

    Hamilton

    HPK

    Holland Park

    INA

    Inala

    IPY

    Indooroopilly

    MBG

    Mt. Coot-tha

    MIT

    Mitchelton

    MTG

    Mt. Gravatt

    MTO

    Mt. Ommaney

    NDH

    Nundah

    NFM

    New Farm

    SBK

    Sunnybank Hills

    SCR

    Stones Corner

    SGT

    Sandgate

    VAN

    Mobile Library

    TWG

    Toowong

    WND

    West End

    WYN

    Wynnum

    ZIL

    Zillmere

    knitr::kable(readr::read_csv("data/metadata_item_type.csv"))
    #> Parsed with column specification:
    #> cols(
    #> item_type_code = col_character(),
    #> item_type_explanation = col_character()
    #> )

    item_type_code

    item_type_explanation

    AD-FICTION

    Adult Fiction

    AD-MAGS

    Adult Magazines

    AD-PBK

    Adult Paperback

    BIOGRAPHY

    Biography

    BSQCDMUSIC

    Brisbane Square CD Music

    BSQCD-ROM

    Brisbane Square CD Rom

    BSQ-DVD

    Brisbane Square DVD

    CD-BOOK

    Compact Disc Book

    CD-MUSIC

    Compact Disc Music

    CD-ROM

    CD Rom

    DVD

    DVD

    DVD_R18+

    DVD Restricted - 18+

    FASTBACK

    Fastback

    GAYLESBIAN

    Gay and Lesbian Collection

    GRAPHICNOV

    Graphic Novel

    ILL

    InterLibrary Loan

    JU-FICTION

    Junior Fiction

    JU-MAGS

    Junior Magazines

    JU-PBK

    Junior Paperback

    KITS

    Kits

    LARGEPRINT

    Large Print

    LGPRINTMAG

    Large Print Magazine

    LITERACY

    Literacy

    LITERACYAV

    Literacy Audio Visual

    LOCSTUDIES

    Local Studies

    LOTE-BIO

    Languages Other than English Biography

    LOTE-BOOK

    Languages Other than English Book

    LOTE-CDMUS

    Languages Other than English CD Music

    LOTE-DVD

    Languages Other than English DVD

    LOTE-MAG

    Languages Other than English Magazine

    LOTE-TB

    Languages Other than English Taped Book

    MBG-DVD

    Mt Coot-tha Botanical Gardens DVD

    MBG-MAG

    Mt Coot-tha Botanical Gardens Magazine

    MBG-NF

    Mt Coot-tha Botanical Gardens Non Fiction

    MP3-BOOK

    MP3 Audio Book

    NONFIC-SET

    Non Fiction Set

    NONFICTION

    Non Fiction

    PICTURE-BK

    Picture Book

    PICTURE-NF

    Picture Book Non Fiction

    PLD-BOOK

    Public Libraries Division Book

    YA-FICTION

    Young Adult Fiction

    YA-MAGS

    Young Adult Magazine

    YA-PBK

    Young Adult Paperback

    Example usage

    Let’s explore the data

    bris_libs <- readr::read_csv("data/bris-lib-checkout.csv")
    #> Parsed with column specification:
    #> cols(
    #> title = col_character(),
    #> author = col_character(),
    #> call_number = col_character(),
    #> item_id = col_double(),
    #> item_type = col_character(),
    #> status = col_character(),
    #> language = col_character(),
    #> age = col_character(),
    #> library = col_character(),
    #> date = col_double(),
    #> datetime = col_datetime(format = ""),
    #> year = col_double(),
    #> month = col_double(),
    #> day = col_character()
    #> )
    #> Warning: 20 parsing failures.
    #> row col expected actual file
    #> 587795 item_id a double REFRESH 'data/bris-lib-checkout.csv'
    #> 590579 item_id a double REFRESH 'data/bris-lib-checkout.csv'
    #> 590597 item_id a double REFRESH 'data/bris-lib-checkout.csv'
    #> 595774 item_id a double REFRESH 'data/bris-lib-checkout.csv'
    #> 597567 item_id a double REFRESH 'data/bris-lib-checkout.csv'
    #> ...... ....... ........ ....... ............................
    #> See problems(...) for more details.

    We can count the number of titles, item types, suggested age, and the library given:

    library(dplyr)
    count(bris_libs, title, sort = TRUE)
    #> # A tibble: 121,046 x 2
    #> title n
    #>

    License

    This data is provided under a CC BY 4.0 license

    It has been downloaded from Brisbane library checkouts, and tidied up using the code in data-raw.

  6. Beach Volleyball

    • kaggle.com
    Updated May 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesse Mostipak (2020). Beach Volleyball [Dataset]. https://www.kaggle.com/jessemostipak/beach-volleyball/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jesse Mostipak
    Description

    Beach Volleyball

    The data this week comes from Adam Vagnar who also blogged about this dataset. There's a LOT of data here - match-level results, player details, and match-level statistics for some matches. For all this dataset all the matches are played 2 vs 2, so there are columns for 2 winners (1 team) and 2 losers (1 team). The data is relatively ready for analysis and clean, although there are some duplicated columns and the data is wide due to the 2-players per team.

    Check out the data dictionary, or Wikipedia for some longer-form details around what the various match statistics mean.

    Most of the data is from the international FIVB tournaments but about 1/3 is from the US-centric AVP.

    The FIVB Beach Volleyball World Tour (known between 2003 and 2012 as the FIVB Beach Volleyball Swatch World Tour for sponsorship reasons) is the worldwide professional beach volleyball tour for both men and women organized by the Fédération Internationale de Volleyball (FIVB). The World Tour was introduced for men in 1989 while the women first competed in 1992.

    Winning the World Tour is considered to be one of the highest honours in international beach volleyball, being surpassed only by the World Championships, and the Beach Volleyball tournament at the Summer Olympic Games.

    FiveThirtyEight examined the disadvantage of serving in beach volleyball, although they used Olympic-level data. Again, Adam Vagnar also covered this data on his blog.

    What is Tidy Tuesday?

    TidyTuesday A weekly data project aimed at the R ecosystem. As this project was borne out of the R4DS Online Learning Community and the R for Data Science textbook, an emphasis was placed on understanding how to summarize and arrange data to make meaningful charts with ggplot2, tidyr, dplyr, and other tools in the tidyverse ecosystem. However, any code-based methodology is welcome - just please remember to share the code used to generate the results.

    Join the R4DS Online Learning Community in the weekly #TidyTuesday event! Every week we post a raw dataset, a chart or article related to that dataset, and ask you to explore the data. While the dataset will be “tamed”, it will not always be tidy!

    We will have many sources of data and want to emphasize that no causation is implied. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our guidelines are to use the data provided to practice your data tidying and plotting techniques. Participants are invited to consider for themselves what nuancing factors might underlie these relationships.

    The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data.

  7. Keep Wales Tidy: Blue Flag Awards (3rd Party Data)

    • metadata.naturalresources.wales
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keep Wales Tidy, Keep Wales Tidy: Blue Flag Awards (3rd Party Data) [Dataset]. https://metadata.naturalresources.wales/geonetwork/srv/api/records/EXT_DS119113
    Explore at:
    Dataset provided by
    Area covered
    Description

    This is a spatial dataset showing the location of Blue Flag beaches across Wales. 2018 marked the 30th year of the Blue Flag Award in Wales, which is generally considered the 'gold standard' for beaches across the world. The Blue Flag Programme is owned by the non-governmental, non-profit organisation 'Foundation for Enivronmental Education' (FEE). The Blue Flag Programme was started in France in 1985. It has been operating in Europe since 1987 and in areas outside of Europe since 2001. The programme is currently in operation in 46 countries across the world. In Wales, the award is managed by Keep Wales Tidy.

  8. q

    Writing Clean Code in R Workshop

    • qubeshub.org
    Updated Oct 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
    Explore at:
    Dataset updated
    Oct 15, 2019
    Dataset provided by
    QUBES
    Authors
    Max Joseph; Leah Wasser
    Description

    When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.

  9. Code and data from simulations that apply multiple regression analysis...

    • zenodo.org
    bin, csv
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Takeharu SEKI; Takeharu SEKI (2025). Code and data from simulations that apply multiple regression analysis models to biased occurrence data to detect thermophilization. [Dataset]. http://doi.org/10.5281/zenodo.13431533
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Takeharu SEKI; Takeharu SEKI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    READ ME

    Description of this repository

    This repository houses the code and data for simulations that apply multiple regression analysis models to biased occurrence data to detect thermophilization.

    Explanation of each file

    SimulationCode.R

    This R code simulates the application of a multiple regression analysis model to biased occurrence data to detect thermophilization.

    Note: To save the running time, we used a parallel computation approach (run time of approximately 30 minutes). Since seven CPUs were used, an equal or greater number of CPUs would be required to reproduce the same results.

    01_GeneratedDistributionData.csv

    Simulation-generated distribution data of fictitious biota species. The column names are explained below.

    Column NamesExplanation
    IndIDUnique individual identification number
    SpeciesIDUnique identification number for the species to witch the individual belongs.
    StepSteps in which the individual exists.
    LTILocal Temperature Index (LTI) of the location where the individual occurred.
    SpeciesLTICenterCentral value of the species-specific LTI at the time of its Step
    Prob.BiasToWarmValue of weighting sampled when Bias to Warm is present.
    Prob.BiasToColdValue of weighting sampled when Bias to Cold is present.

    02_ExtractedBiasedOccurrenceData.csv

    The result of extracting 2,000 biased occurrences data ofrom the Distribution data.

    Column NamesExplanation
    IndIDUnique identification number of the extracted individual.
    SpeciesIDUnique identification number for the species to witch the individual belongs.
    StepSteps in which the individual is extracted
    LTILocal Temperature Index (LTI) of the location where the individual occurred.
    EstSTISpecies Temperature Index (STI) of the record species calculated on the basis of the occurrence data.
    BiasTypeThe type of bias
    iterThe number of iteration

    Reference

    This simulation code uses the following packages.

    {tidyverse} package,

     Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
    

    {broom} package,

    Robinson D, Hayes A, Couch S (2024). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.7, https://github.com/tidymodels/broom,

    {rlist} package.

    Ren K (2021). _rlist: A Toolbox for Non-Tabular Data Manipulation_. R package version 0.4.6.2, <https://CRAN.R-project.org/package=rlist>.

    {data.table} package

    Barrett T, Dowle M, Srinivasan A, Gorecki J, Chirico M, Hocking T (2024). _data.table: Extension of `data.frame`_. R package version 1.15.4, <https://CRAN.R-project.org/package=data.table>.
    

    {snowfall} package

    Knaus J (2023). _snowfall: Easier Cluster Computing (Based on 'snow')_. R package version 1.84-6.3, <https://CRAN.R-project.org/package=snowfall>.

    {magrittr} package

    Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R package version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.

    {ggpmisc} package

    Aphalo P (2024). _ggpmisc: Miscellaneous Extensions to 'ggplot2'_. R package version 0.5.6, <https://CRAN.R-project.org/package=ggpmisc>.

    {effsize} package

    Torchiano M (2020). _effsize: Efficient Effect Size Computation_. doi:10.5281/zenodo.1480624 <https://doi.org/10.5281/zenodo.1480624>, R package version 0.8.1, <https://CRAN.R-project.org/package=effsize>.

    {conflicted] package

    Wickham H (2023). _conflicted: An Alternative Conflict Resolution Strategy_. R package version 1.2.0, <https://CRAN.R-project.org/package=conflicted>.

  10. Iris Flower Data Set Cleaned

    • kaggle.com
    Updated Mar 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Iris Flower Data Set Cleaned [Dataset]. https://www.kaggle.com/larsen0966/iris-flower-data-set-cleaned/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this data Set is useful, and upvote is appreciated. British Statistician Ronald Fisher introduced the Iris Flower in 1936. Fisher published a paper that described the use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

  11. Data from: Do Current Language Models Support Code Intelligence for R...

    • zenodo.org
    zip
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zixiao Zhao; Zixiao Zhao (2024). Do Current Language Models Support Code Intelligence for R Programming Language? [Dataset]. http://doi.org/10.5281/zenodo.13871742
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zixiao Zhao; Zixiao Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 18, 2013
    Description

    This is the dataset used in the paper: Do Current Language Models Support Code Intelligence for Programming Language?

    This dataset contains code snippets from R programming language repositories on GitHub, paired with their corresponding natural language (NL) descriptions. It was created for research in software engineering tasks like code summarization and code search. The data was collected using the GitHub REST API and includes over 1,500 public R repositories. To ensure quality, only active, well-structured R packages with proper documentation were included. Roxygen2, a popular documentation framework, was used to extract both the code and its matching NL descriptions.

    The dataset is organized into three parts: base R functions (Base), functions from the tidyverse (Tidy), and a combined set (RCombine). The dataset follows the CodeSearchNet format, with a split for training, validation, and testing data, ensuring no duplicate functions.

  12. f

    Data from: The regressinator: A simulation tool for teaching regression...

    • tandf.figshare.com
    txt
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Reinhart (2025). The regressinator: A simulation tool for teaching regression assumptions and diagnostics in R [Dataset]. http://doi.org/10.6084/m9.figshare.29361136.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Alex Reinhart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When students learn linear regression, they must learn to use diagnostics to check and improve their models. Model-building is an expert skill requiring the interpretation of diagnostic plots, an understanding of model assumptions, the selection of appropriate changes to remedy problems, and an intuition for how potential problems may affect results. Simulation offers opportunities to practice these skills, and is already widely used to teach important concepts in sampling, probability, and statistical inference. Visual inference, which uses simulation, has also recently been applied to regression instruction. This article presents the regressinator, an R package designed to facilitate simulation and visual inference in regression settings. Simulated regression problems can be easily defined with minimal programming, using the same modeling and plotting code students may already learn. The simulated data can then be used for model diagnostics, visual inference, and other activities, with the package providing functions to facilitate common tasks with a minimum of programming. Example activities covering model diagnostics, statistical power, and model selection are shown for both advanced undergraduate and Ph.D.-level regression courses.

  13. Tidy Data for Swelling Manuscript

    • figshare.com
    txt
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nate Richbourg (2020). Tidy Data for Swelling Manuscript [Dataset]. http://doi.org/10.6084/m9.figshare.12442730.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Nate Richbourg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All the input and output data that is actually processed with R files within the swelling manuscript.

  14. f

    Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    figshare
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Joachim Goedhart; Joachim Goedhart (2024). Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. [Dataset]. http://doi.org/10.5281/zenodo.4056966
Organization logo

Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data.

Explore at:
bin, csv, pngAvailable download formats
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joachim Goedhart; Joachim Goedhart
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)

Search
Clear search
Close search
Google apps
Main menu