Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
zixiao/R-tidy-base dataset hosted on Hugging Face and contributed by the HF Datasets community
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Ángela Castillo-Gill
Released under CC0: Public Domain
This repository tracks and documents the R codes used to transform the .lift XML export of the Enggano learner's dictionary FLEx project into a tidy tabular form for three file formats: .rds (R data file), .csv, and .tsv. This repository is synced from its original GitHub repository at https://github.com/engganolang/eno-learner-lift. Check that GitHub repository for further update.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This has been copied from the README.md file
bris-lib-checkout
This provides tidied up data from the Brisbane library checkouts
Retrieving and cleaning the data
The script for retrieving and cleaning the data is made available in scrape-library.R.
The data
data/
This contains four tidied up dataframes:
tidy-brisbane-library-checkout.csv contains the following columns, with the metadata file metadata_heading containing the description of these columns.
knitr::kable(readr::read_csv("data/metadata_heading.csv"))
#> Parsed with column specification:
#> cols(
#> heading = col_character(),
#> heading_explanation = col_character()
#> )
heading
heading_explanation
Title
Title of Item
Author
Author of Item
Call Number
Call Number of Item
Item id
Unique Item Identifier
Item Type
Type of Item (see next column)
Status
Current Status of Item
Language
Published language of item (if not English)
Age
Suggested audience
Checkout Library
Checkout branch
Date
Checkout date
We also added year, month, and day columns.
The remaining data are all metadata files that contain meta information on the columns in the checkout data:
library(tidyverse)
#> ── Attaching packages ────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
#> ✔ tibble 1.4.99.9006 ✔ dplyr 0.7.8
#> ✔ tidyr 0.8.2 ✔ stringr 1.3.1
#> ✔ readr 1.3.0 ✔ forcats 0.3.0
#> ── Conflicts ───────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
knitr::kable(readr::read_csv("data/metadata_branch.csv"))
#> Parsed with column specification:
#> cols(
#> branch_code = col_character(),
#> branch_heading = col_character()
#> )
branch_code
branch_heading
ANN
Annerley
ASH
Ashgrove
BNO
Banyo
BRR
BrackenRidge
BSQ
Brisbane Square Library
BUL
Bulimba
CDA
Corinda
CDE
Chermside
CNL
Carindale
CPL
Coopers Plains
CRA
Carina
EPK
Everton Park
FAI
Fairfield
GCY
Garden City
GNG
Grange
HAM
Hamilton
HPK
Holland Park
INA
Inala
IPY
Indooroopilly
MBG
Mt. Coot-tha
MIT
Mitchelton
MTG
Mt. Gravatt
MTO
Mt. Ommaney
NDH
Nundah
NFM
New Farm
SBK
Sunnybank Hills
SCR
Stones Corner
SGT
Sandgate
VAN
Mobile Library
TWG
Toowong
WND
West End
WYN
Wynnum
ZIL
Zillmere
knitr::kable(readr::read_csv("data/metadata_item_type.csv"))
#> Parsed with column specification:
#> cols(
#> item_type_code = col_character(),
#> item_type_explanation = col_character()
#> )
item_type_code
item_type_explanation
AD-FICTION
Adult Fiction
AD-MAGS
Adult Magazines
AD-PBK
Adult Paperback
BIOGRAPHY
Biography
BSQCDMUSIC
Brisbane Square CD Music
BSQCD-ROM
Brisbane Square CD Rom
BSQ-DVD
Brisbane Square DVD
CD-BOOK
Compact Disc Book
CD-MUSIC
Compact Disc Music
CD-ROM
CD Rom
DVD
DVD
DVD_R18+
DVD Restricted - 18+
FASTBACK
Fastback
GAYLESBIAN
Gay and Lesbian Collection
GRAPHICNOV
Graphic Novel
ILL
InterLibrary Loan
JU-FICTION
Junior Fiction
JU-MAGS
Junior Magazines
JU-PBK
Junior Paperback
KITS
Kits
LARGEPRINT
Large Print
LGPRINTMAG
Large Print Magazine
LITERACY
Literacy
LITERACYAV
Literacy Audio Visual
LOCSTUDIES
Local Studies
LOTE-BIO
Languages Other than English Biography
LOTE-BOOK
Languages Other than English Book
LOTE-CDMUS
Languages Other than English CD Music
LOTE-DVD
Languages Other than English DVD
LOTE-MAG
Languages Other than English Magazine
LOTE-TB
Languages Other than English Taped Book
MBG-DVD
Mt Coot-tha Botanical Gardens DVD
MBG-MAG
Mt Coot-tha Botanical Gardens Magazine
MBG-NF
Mt Coot-tha Botanical Gardens Non Fiction
MP3-BOOK
MP3 Audio Book
NONFIC-SET
Non Fiction Set
NONFICTION
Non Fiction
PICTURE-BK
Picture Book
PICTURE-NF
Picture Book Non Fiction
PLD-BOOK
Public Libraries Division Book
YA-FICTION
Young Adult Fiction
YA-MAGS
Young Adult Magazine
YA-PBK
Young Adult Paperback
Example usage
Let’s explore the data
bris_libs <- readr::read_csv("data/bris-lib-checkout.csv")
#> Parsed with column specification:
#> cols(
#> title = col_character(),
#> author = col_character(),
#> call_number = col_character(),
#> item_id = col_double(),
#> item_type = col_character(),
#> status = col_character(),
#> language = col_character(),
#> age = col_character(),
#> library = col_character(),
#> date = col_double(),
#> datetime = col_datetime(format = ""),
#> year = col_double(),
#> month = col_double(),
#> day = col_character()
#> )
#> Warning: 20 parsing failures.
#> row col expected actual file
#> 587795 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> 590579 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> 590597 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> 595774 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> 597567 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> ...... ....... ........ ....... ............................
#> See problems(...) for more details.
We can count the number of titles, item types, suggested age, and the library given:
library(dplyr)
count(bris_libs, title, sort = TRUE)
#> # A tibble: 121,046 x 2
#> title n
#>
License
This data is provided under a CC BY 4.0 license
It has been downloaded from Brisbane library checkouts, and tidied up using the code in data-raw.
The data this week comes from Adam Vagnar who also blogged about this dataset. There's a LOT of data here - match-level results, player details, and match-level statistics for some matches. For all this dataset all the matches are played 2 vs 2, so there are columns for 2 winners (1 team) and 2 losers (1 team). The data is relatively ready for analysis and clean, although there are some duplicated columns and the data is wide due to the 2-players per team.
Check out the data dictionary, or Wikipedia for some longer-form details around what the various match statistics mean.
Most of the data is from the international FIVB tournaments but about 1/3 is from the US-centric AVP.
The FIVB Beach Volleyball World Tour (known between 2003 and 2012 as the FIVB Beach Volleyball Swatch World Tour for sponsorship reasons) is the worldwide professional beach volleyball tour for both men and women organized by the Fédération Internationale de Volleyball (FIVB). The World Tour was introduced for men in 1989 while the women first competed in 1992.
Winning the World Tour is considered to be one of the highest honours in international beach volleyball, being surpassed only by the World Championships, and the Beach Volleyball tournament at the Summer Olympic Games.
FiveThirtyEight examined the disadvantage of serving in beach volleyball, although they used Olympic-level data. Again, Adam Vagnar also covered this data on his blog.
TidyTuesday A weekly data project aimed at the R ecosystem. As this project was borne out of the R4DS Online Learning Community
and the R for Data Science textbook
, an emphasis was placed on understanding how to summarize and arrange data to make meaningful charts with ggplot2
, tidyr
, dplyr
, and other tools in the tidyverse
ecosystem. However, any code-based methodology is welcome - just please remember to share the code used to generate the results.
Join the R4DS Online Learning Community in the weekly #TidyTuesday event! Every week we post a raw dataset, a chart or article related to that dataset, and ask you to explore the data. While the dataset will be “tamed”, it will not always be tidy!
We will have many sources of data and want to emphasize that no causation is implied. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our guidelines are to use the data provided to practice your data tidying and plotting techniques. Participants are invited to consider for themselves what nuancing factors might underlie these relationships.
The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data.
This is a spatial dataset showing the location of Blue Flag beaches across Wales. 2018 marked the 30th year of the Blue Flag Award in Wales, which is generally considered the 'gold standard' for beaches across the world. The Blue Flag Programme is owned by the non-governmental, non-profit organisation 'Foundation for Enivronmental Education' (FEE). The Blue Flag Programme was started in France in 1985. It has been operating in Europe since 1987 and in areas outside of Europe since 2001. The programme is currently in operation in 46 countries across the world. In Wales, the award is managed by Keep Wales Tidy.
When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository houses the code and data for simulations that apply multiple regression analysis models to biased occurrence data to detect thermophilization.
This R code simulates the application of a multiple regression analysis model to biased occurrence data to detect thermophilization.
Note: To save the running time, we used a parallel computation approach (run time of approximately 30 minutes). Since seven CPUs were used, an equal or greater number of CPUs would be required to reproduce the same results.
Simulation-generated distribution data of fictitious biota species. The column names are explained below.
Column Names | Explanation |
IndID | Unique individual identification number |
SpeciesID | Unique identification number for the species to witch the individual belongs. |
Step | Steps in which the individual exists. |
LTI | Local Temperature Index (LTI) of the location where the individual occurred. |
SpeciesLTICenter | Central value of the species-specific LTI at the time of its Step |
Prob.BiasToWarm | Value of weighting sampled when Bias to Warm is present. |
Prob.BiasToCold | Value of weighting sampled when Bias to Cold is present. |
The result of extracting 2,000 biased occurrences data ofrom the Distribution data.
Column Names | Explanation |
IndID | Unique identification number of the extracted individual. |
SpeciesID | Unique identification number for the species to witch the individual belongs. |
Step | Steps in which the individual is extracted |
LTI | Local Temperature Index (LTI) of the location where the individual occurred. |
EstSTI | Species Temperature Index (STI) of the record species calculated on the basis of the occurrence data. |
BiasType | The type of bias |
iter | The number of iteration |
This simulation code uses the following packages.
{tidyverse} package,
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
{broom} package,
Robinson D, Hayes A, Couch S (2024). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.7, https://github.com/tidymodels/broom,
{rlist} package.
Ren K (2021). _rlist: A Toolbox for Non-Tabular Data Manipulation_. R package version 0.4.6.2, <https://CRAN.R-project.org/package=rlist>.
{data.table} package
Barrett T, Dowle M, Srinivasan A, Gorecki J, Chirico M, Hocking T (2024). _data.table: Extension of `data.frame`_. R package version 1.15.4, <https://CRAN.R-project.org/package=data.table>.
{snowfall} package
Knaus J (2023). _snowfall: Easier Cluster Computing (Based on 'snow')_. R package version 1.84-6.3, <https://CRAN.R-project.org/package=snowfall>.
{magrittr} package
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R package version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.
{ggpmisc} package
Aphalo P (2024). _ggpmisc: Miscellaneous Extensions to 'ggplot2'_. R package version 0.5.6, <https://CRAN.R-project.org/package=ggpmisc>.
{effsize} package
Torchiano M (2020). _effsize: Efficient Effect Size Computation_. doi:10.5281/zenodo.1480624 <https://doi.org/10.5281/zenodo.1480624>, R package version 0.8.1, <https://CRAN.R-project.org/package=effsize>.
{conflicted] package
Wickham H (2023). _conflicted: An Alternative Conflict Resolution Strategy_. R package version 1.2.0, <https://CRAN.R-project.org/package=conflicted>.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this data Set is useful, and upvote is appreciated. British Statistician Ronald Fisher introduced the Iris Flower in 1936. Fisher published a paper that described the use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset used in the paper: Do Current Language Models Support Code Intelligence for Programming Language?
This dataset contains code snippets from R programming language repositories on GitHub, paired with their corresponding natural language (NL) descriptions. It was created for research in software engineering tasks like code summarization and code search. The data was collected using the GitHub REST API and includes over 1,500 public R repositories. To ensure quality, only active, well-structured R packages with proper documentation were included. Roxygen2, a popular documentation framework, was used to extract both the code and its matching NL descriptions.
The dataset is organized into three parts: base R functions (Base), functions from the tidyverse (Tidy), and a combined set (RCombine). The dataset follows the CodeSearchNet format, with a split for training, validation, and testing data, ensuring no duplicate functions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
When students learn linear regression, they must learn to use diagnostics to check and improve their models. Model-building is an expert skill requiring the interpretation of diagnostic plots, an understanding of model assumptions, the selection of appropriate changes to remedy problems, and an intuition for how potential problems may affect results. Simulation offers opportunities to practice these skills, and is already widely used to teach important concepts in sampling, probability, and statistical inference. Visual inference, which uses simulation, has also recently been applied to regression instruction. This article presents the regressinator, an R package designed to facilitate simulation and visual inference in regression settings. Simulated regression problems can be easily defined with minimal programming, using the same modeling and plotting code students may already learn. The simulated data can then be used for model diagnostics, visual inference, and other activities, with the package providing functions to facilitate common tasks with a minimum of programming. Example activities covering model diagnostics, statistical power, and model selection are shown for both advanced undergraduate and Ph.D.-level regression courses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All the input and output data that is actually processed with R files within the swelling manuscript.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)