14 datasets found

Data and R-script for a tutorial that explains how to convert spreadsheet...
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joachim Goedhart; Joachim Goedhart (2024). Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. [Dataset]. http://doi.org/10.5281/zenodo.4056966
Explore at:
bin, csv, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4056966
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joachim Goedhart; Joachim Goedhart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)
h
R-tidy-base
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhao, R-tidy-base [Dataset]. https://huggingface.co/datasets/zixiao/R-tidy-base
Explore at:
Authors
zhao
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
zixiao/R-tidy-base dataset hosted on Hugging Face and contributed by the HF Datasets community
R Downloads from Tidy Tuesday
kaggle.com
Updated Jan 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ángela Castillo-Gill (2019). R Downloads from Tidy Tuesday [Dataset]. https://www.kaggle.com/adcastillogill/r_downloads/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 4, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ángela Castillo-Gill
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Ángela Castillo-Gill

Released under CC0: Public Domain

Contents
o
R codes for datasets derived from “Kamus Bahasa Enggano”, the printed...
osf.io
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Charlotte Hemmings; Engga Zakaria Sangian; Dendi Wijaya; I Wayan Arka (2025). R codes for datasets derived from “Kamus Bahasa Enggano”, the printed learner’s dictionary of Contemporary Enggano [Dataset]. http://doi.org/10.17605/OSF.IO/JM4FN
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/JM4FN
Dataset updated
Jan 23, 2025
Dataset provided by
Center For Open Science
Authors
Gede Primahadi Wijaya Rajeg; Charlotte Hemmings; Engga Zakaria Sangian; Dendi Wijaya; I Wayan Arka
Area covered
Enggano Island
Description
This repository tracks and documents the R codes used to transform the .lift XML export of the Enggano learner's dictionary FLEx project into a tidy tabular form for three file formats: .rds (R data file), .csv, and .tsv. This repository is synced from its original GitHub repository at https://github.com/engganolang/eno-learner-lift. Check that GitHub repository for further update.
Brisbane Library Checkout Data
zenodo.org
data.niaid.nih.gov
application/gzip, bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Tierney; Nicholas Tierney (2020). Brisbane Library Checkout Data [Dataset]. http://doi.org/10.5281/zenodo.2437860
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2437860
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicholas Tierney; Nicholas Tierney
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brisbane
Description
This has been copied from the README.md file

bris-lib-checkout

This provides tidied up data from the Brisbane library checkouts

Retrieving and cleaning the data

The script for retrieving and cleaning the data is made available in scrape-library.R.

The data

The data/ folder contains the tidy data

The data-raw/ folder contains the raw data

data/

This contains four tidied up dataframes:

tidy-brisbane-library-checkout.csv

metadata_branch.csv

metadata_heading.csv

metadata_item_type.csv

tidy-brisbane-library-checkout.csv contains the following columns, with the metadata file metadata_heading containing the description of these columns.

knitr::kable(readr::read_csv("data/metadata_heading.csv"))
#> Parsed with column specification:
#> cols(
#> heading = col_character(),
#> heading_explanation = col_character()
#> )

heading

heading_explanation

Title

Title of Item

Author

Author of Item

Call Number

Call Number of Item

Item id

Unique Item Identifier

Item Type

Type of Item (see next column)

Status

Current Status of Item

Language

Published language of item (if not English)

Age

Suggested audience

Checkout Library

Checkout branch

Date

Checkout date

We also added year, month, and day columns.

The remaining data are all metadata files that contain meta information on the columns in the checkout data:

library(tidyverse)
#> ── Attaching packages ────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
#> ✔ tibble 1.4.99.9006 ✔ dplyr 0.7.8
#> ✔ tidyr 0.8.2 ✔ stringr 1.3.1
#> ✔ readr 1.3.0 ✔ forcats 0.3.0
#> ── Conflicts ───────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
knitr::kable(readr::read_csv("data/metadata_branch.csv"))
#> Parsed with column specification:
#> cols(
#> branch_code = col_character(),
#> branch_heading = col_character()
#> )

branch_code

branch_heading

ANN

Annerley

ASH

Ashgrove

BNO

Banyo

BRR

BrackenRidge

BSQ

Brisbane Square Library

BUL

Bulimba

CDA

Corinda

CDE

Chermside

CNL

Carindale

CPL

Coopers Plains

CRA

Carina

EPK

Everton Park

FAI

Fairfield

GCY

Garden City

GNG

Grange

HAM

Hamilton

HPK

Holland Park

INA

Inala

IPY

Indooroopilly

MBG

Mt. Coot-tha

MIT

Mitchelton

MTG

Mt. Gravatt

MTO

Mt. Ommaney

NDH

Nundah

NFM

New Farm

SBK

Sunnybank Hills

SCR

Stones Corner

SGT

Sandgate

VAN

Mobile Library

TWG

Toowong

WND

West End

WYN

Wynnum

ZIL

Zillmere

knitr::kable(readr::read_csv("data/metadata_item_type.csv"))
#> Parsed with column specification:
#> cols(
#> item_type_code = col_character(),
#> item_type_explanation = col_character()
#> )

item_type_code

item_type_explanation

AD-FICTION

Adult Fiction

AD-MAGS

Adult Magazines

AD-PBK

Adult Paperback

BIOGRAPHY

Biography

BSQCDMUSIC

Brisbane Square CD Music

BSQCD-ROM

Brisbane Square CD Rom

BSQ-DVD

Brisbane Square DVD

CD-BOOK

Compact Disc Book

CD-MUSIC

Compact Disc Music

CD-ROM

CD Rom

DVD

DVD

DVD_R18+

DVD Restricted - 18+

FASTBACK

Fastback

GAYLESBIAN

Gay and Lesbian Collection

GRAPHICNOV

Graphic Novel

ILL

InterLibrary Loan

JU-FICTION

Junior Fiction

JU-MAGS

Junior Magazines

JU-PBK

Junior Paperback

KITS

Kits

LARGEPRINT

Large Print

LGPRINTMAG

Large Print Magazine

LITERACY

Literacy

LITERACYAV

Literacy Audio Visual

LOCSTUDIES

Local Studies

LOTE-BIO

Languages Other than English Biography

LOTE-BOOK

Languages Other than English Book

LOTE-CDMUS

Languages Other than English CD Music

LOTE-DVD

Languages Other than English DVD

LOTE-MAG

Languages Other than English Magazine

LOTE-TB

Languages Other than English Taped Book

MBG-DVD

Mt Coot-tha Botanical Gardens DVD

MBG-MAG

Mt Coot-tha Botanical Gardens Magazine

MBG-NF

Mt Coot-tha Botanical Gardens Non Fiction

MP3-BOOK

MP3 Audio Book

NONFIC-SET

Non Fiction Set

NONFICTION

Non Fiction

PICTURE-BK

Picture Book

PICTURE-NF

Picture Book Non Fiction

PLD-BOOK

Public Libraries Division Book

YA-FICTION

Young Adult Fiction

YA-MAGS

Young Adult Magazine

YA-PBK

Young Adult Paperback

Example usage

Let’s explore the data

bris_libs <- readr::read_csv("data/bris-lib-checkout.csv")
#> Parsed with column specification:
#> cols(
#> title = col_character(),
#> author = col_character(),
#> call_number = col_character(),
#> item_id = col_double(),
#> item_type = col_character(),
#> status = col_character(),
#> language = col_character(),
#> age = col_character(),
#> library = col_character(),
#> date = col_double(),
#> datetime = col_datetime(format = ""),
#> year = col_double(),
#> month = col_double(),
#> day = col_character()
#> )
#> Warning: 20 parsing failures.
#> row col expected actual file
#> 587795 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> 590579 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> 590597 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> 595774 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> 597567 item_id a double REFRESH 'data/bris-lib-checkout.csv'
#> ...... ....... ........ ....... ............................
#> See problems(...) for more details.

We can count the number of titles, item types, suggested age, and the library given:

library(dplyr)
count(bris_libs, title, sort = TRUE)
#> # A tibble: 121,046 x 2
#> title n
#>

License

This data is provided under a CC BY 4.0 license

It has been downloaded from Brisbane library checkouts, and tidied up using the code in data-raw.
Beach Volleyball
kaggle.com
Updated May 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesse Mostipak (2020). Beach Volleyball [Dataset]. https://www.kaggle.com/jessemostipak/beach-volleyball/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jesse Mostipak
Description
Beach Volleyball

The data this week comes from Adam Vagnar who also blogged about this dataset. There's a LOT of data here - match-level results, player details, and match-level statistics for some matches. For all this dataset all the matches are played 2 vs 2, so there are columns for 2 winners (1 team) and 2 losers (1 team). The data is relatively ready for analysis and clean, although there are some duplicated columns and the data is wide due to the 2-players per team.

Check out the data dictionary, or Wikipedia for some longer-form details around what the various match statistics mean.

Most of the data is from the international FIVB tournaments but about 1/3 is from the US-centric AVP.

The FIVB Beach Volleyball World Tour (known between 2003 and 2012 as the FIVB Beach Volleyball Swatch World Tour for sponsorship reasons) is the worldwide professional beach volleyball tour for both men and women organized by the Fédération Internationale de Volleyball (FIVB). The World Tour was introduced for men in 1989 while the women first competed in 1992.

Winning the World Tour is considered to be one of the highest honours in international beach volleyball, being surpassed only by the World Championships, and the Beach Volleyball tournament at the Summer Olympic Games.

FiveThirtyEight examined the disadvantage of serving in beach volleyball, although they used Olympic-level data. Again, Adam Vagnar also covered this data on his blog.

What is Tidy Tuesday?

TidyTuesday A weekly data project aimed at the R ecosystem. As this project was borne out of the R4DS Online Learning Community and the R for Data Science textbook, an emphasis was placed on understanding how to summarize and arrange data to make meaningful charts with ggplot2, tidyr, dplyr, and other tools in the tidyverse ecosystem. However, any code-based methodology is welcome - just please remember to share the code used to generate the results.

Join the R4DS Online Learning Community in the weekly #TidyTuesday event! Every week we post a raw dataset, a chart or article related to that dataset, and ask you to explore the data. While the dataset will be “tamed”, it will not always be tidy!

We will have many sources of data and want to emphasize that no causation is implied. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our guidelines are to use the data provided to practice your data tidying and plotting techniques. Participants are invited to consider for themselves what nuancing factors might underlie these relationships.

The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data.
Keep Wales Tidy: Blue Flag Awards (3rd Party Data)
metadata.naturalresources.wales
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keep Wales Tidy, Keep Wales Tidy: Blue Flag Awards (3rd Party Data) [Dataset]. https://metadata.naturalresources.wales/geonetwork/srv/api/records/EXT_DS119113
Explore at:
Dataset provided by
Area covered

Description
This is a spatial dataset showing the location of Blue Flag beaches across Wales. 2018 marked the 30th year of the Blue Flag Award in Wales, which is generally considered the 'gold standard' for beaches across the world. The Blue Flag Programme is owned by the non-governmental, non-profit organisation 'Foundation for Enivronmental Education' (FEE). The Blue Flag Programme was started in France in 1985. It has been operating in Europe since 1987 and in areas outside of Europe since 2001. The programme is currently in operation in 46 countries across the world. In Wales, the award is managed by Keep Wales Tidy.
q
Writing Clean Code in R Workshop
qubeshub.org
Updated Oct 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
Explore at:
Dataset updated
Oct 15, 2019
Dataset provided by
QUBES
Authors
Max Joseph; Leah Wasser
Description
When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.

Code and data from simulations that apply multiple regression analysis...

zenodo.org

bin, csv

Updated Jun 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Takeharu SEKI; Takeharu SEKI (2025). Code and data from simulations that apply multiple regression analysis models to biased occurrence data to detect thermophilization. [Dataset]. http://doi.org/10.5281/zenodo.13431533

Explore at:

csv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13431533

Dataset updated

Jun 12, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Takeharu SEKI; Takeharu SEKI

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

READ ME

Description of this repository

This repository houses the code and data for simulations that apply multiple regression analysis models to biased occurrence data to detect thermophilization.

Explanation of each file

SimulationCode.R

This R code simulates the application of a multiple regression analysis model to biased occurrence data to detect thermophilization.

Note: To save the running time, we used a parallel computation approach (run time of approximately 30 minutes). Since seven CPUs were used, an equal or greater number of CPUs would be required to reproduce the same results.

01_GeneratedDistributionData.csv

Simulation-generated distribution data of fictitious biota species. The column names are explained below.

Column Names	Explanation
IndID	Unique individual identification number
SpeciesID	Unique identification number for the species to witch the individual belongs.
Step	Steps in which the individual exists.
LTI	Local Temperature Index (LTI) of the location where the individual occurred.
SpeciesLTICenter	Central value of the species-specific LTI at the time of its Step
Prob.BiasToWarm	Value of weighting sampled when Bias to Warm is present.
Prob.BiasToCold	Value of weighting sampled when Bias to Cold is present.

02_ExtractedBiasedOccurrenceData.csv

The result of extracting 2,000 biased occurrences data ofrom the Distribution data.

Column Names	Explanation
IndID	Unique identification number of the extracted individual.
SpeciesID	Unique identification number for the species to witch the individual belongs.
Step	Steps in which the individual is extracted
LTI	Local Temperature Index (LTI) of the location where the individual occurred.
EstSTI	Species Temperature Index (STI) of the record species calculated on the basis of the occurrence data.
BiasType	The type of bias
iter	The number of iteration

Reference

This simulation code uses the following packages.

{tidyverse} package,

 Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

{broom} package,

Robinson D, Hayes A, Couch S (2024). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.7, https://github.com/tidymodels/broom,

{rlist} package.

Ren K (2021). _rlist: A Toolbox for Non-Tabular Data Manipulation_. R package version 0.4.6.2, <https://CRAN.R-project.org/package=rlist>.

{data.table} package

Barrett T, Dowle M, Srinivasan A, Gorecki J, Chirico M, Hocking T (2024). _data.table: Extension of `data.frame`_. R package version 1.15.4, <https://CRAN.R-project.org/package=data.table>.

{snowfall} package

Knaus J (2023). _snowfall: Easier Cluster Computing (Based on 'snow')_. R package version 1.84-6.3, <https://CRAN.R-project.org/package=snowfall>.

{magrittr} package

Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R package version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.

{ggpmisc} package

Aphalo P (2024). _ggpmisc: Miscellaneous Extensions to 'ggplot2'_. R package version 0.5.6, <https://CRAN.R-project.org/package=ggpmisc>.

{effsize} package

Torchiano M (2020). _effsize: Efficient Effect Size Computation_. doi:10.5281/zenodo.1480624 <https://doi.org/10.5281/zenodo.1480624>, R package version 0.8.1, <https://CRAN.R-project.org/package=effsize>.

{conflicted] package

Wickham H (2023). _conflicted: An Alternative Conflict Resolution Strategy_. R package version 1.2.0, <https://CRAN.R-project.org/package=conflicted>.

Iris Flower Data Set Cleaned
kaggle.com
Updated Mar 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Science Sean (2020). Iris Flower Data Set Cleaned [Dataset]. https://www.kaggle.com/larsen0966/iris-flower-data-set-cleaned/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data-Science Sean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
If this data Set is useful, and upvote is appreciated. British Statistician Ronald Fisher introduced the Iris Flower in 1936. Fisher published a paper that described the use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
Data from: Do Current Language Models Support Code Intelligence for R...
zenodo.org
zip
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zixiao Zhao; Zixiao Zhao (2024). Do Current Language Models Support Code Intelligence for R Programming Language? [Dataset]. http://doi.org/10.5281/zenodo.13871742
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13871742
Dataset updated
Oct 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zixiao Zhao; Zixiao Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 18, 2013
Description
This is the dataset used in the paper: Do Current Language Models Support Code Intelligence for Programming Language?

This dataset contains code snippets from R programming language repositories on GitHub, paired with their corresponding natural language (NL) descriptions. It was created for research in software engineering tasks like code summarization and code search. The data was collected using the GitHub REST API and includes over 1,500 public R repositories. To ensure quality, only active, well-structured R packages with proper documentation were included. Roxygen2, a popular documentation framework, was used to extract both the code and its matching NL descriptions.

The dataset is organized into three parts: base R functions (Base), functions from the tidyverse (Tidy), and a combined set (RCombine). The dataset follows the CodeSearchNet format, with a split for training, validation, and testing data, ensuring no duplicate functions.
f
Data from: The regressinator: A simulation tool for teaching regression...
tandf.figshare.com
txt
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Reinhart (2025). The regressinator: A simulation tool for teaching regression assumptions and diagnostics in R [Dataset]. http://doi.org/10.6084/m9.figshare.29361136.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29361136.v1
Dataset updated
Jun 18, 2025
Dataset provided by
Taylor & Francis
Authors
Alex Reinhart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
When students learn linear regression, they must learn to use diagnostics to check and improve their models. Model-building is an expert skill requiring the interpretation of diagnostic plots, an understanding of model assumptions, the selection of appropriate changes to remedy problems, and an intuition for how potential problems may affect results. Simulation offers opportunities to practice these skills, and is already widely used to teach important concepts in sampling, probability, and statistical inference. Visual inference, which uses simulation, has also recently been applied to regression instruction. This article presents the regressinator, an R package designed to facilitate simulation and visual inference in regression settings. Simulated regression problems can be easily defined with minimal programming, using the same modeling and plotting code students may already learn. The simulated data can then be used for model diagnostics, visual inference, and other activities, with the package providing functions to facilitate common tasks with a minimum of programming. Example activities covering model diagnostics, statistical power, and model selection are shown for both advanced undergraduate and Ph.D.-level regression courses.
Tidy Data for Swelling Manuscript
figshare.com
txt
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nate Richbourg (2020). Tidy Data for Swelling Manuscript [Dataset]. http://doi.org/10.6084/m9.figshare.12442730.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12442730.v3
Dataset updated
Aug 13, 2020
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Nate Richbourg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All the input and output data that is actually processed with R files within the swelling manuscript.
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Joachim Goedhart; Joachim Goedhart (2024). Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. [Dataset]. http://doi.org/10.5281/zenodo.4056966

Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data.

Explore at:

bin, csv, pngAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4056966

Dataset updated

Jul 19, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Joachim Goedhart; Joachim Goedhart

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)

Clear search

Close search

Google apps

Main menu

Data and R-script for a tutorial that explains how to convert spreadsheet...

R-tidy-base

R Downloads from Tidy Tuesday

Dataset

Contents

R codes for datasets derived from “Kamus Bahasa Enggano”, the printed...

Brisbane Library Checkout Data

Beach Volleyball

Beach Volleyball

What is Tidy Tuesday?

Keep Wales Tidy: Blue Flag Awards (3rd Party Data)

Writing Clean Code in R Workshop

Code and data from simulations that apply multiple regression analysis...

READ ME

Description of this repository

Explanation of each file

SimulationCode.R

01_GeneratedDistributionData.csv

02_ExtractedBiasedOccurrenceData.csv

Reference

Iris Flower Data Set Cleaned

Data from: Do Current Language Models Support Code Intelligence for R...

Data from: The regressinator: A simulation tool for teaching regression...

Tidy Data for Swelling Manuscript

Data and tools for studying isograms

Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data.