48 datasets found

q
Chapter 1: Introduction to R and RStudio
qubeshub.org
Updated Dec 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raisa Hernández-Pacheco; Alexis Diaz (2020). Chapter 1: Introduction to R and RStudio [Dataset]. http://doi.org/10.25334/FRPR-2J11
Explore at:
Unique identifier
https://doi.org/10.25334/FRPR-2J11
Dataset updated
Dec 23, 2020
Dataset provided by
QUBES
Authors
Raisa Hernández-Pacheco; Alexis Diaz
Description
Biostatistics Using R: A Laboratory Manual was created with the goals of providing biological content to lab sessions by using authentic research data and introducing R programming language. Chapter 1 introduces R and RStudio.
d
Python and R Basics for Environmental Data Sciences
search.dataone.org
hydroshare.org
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Wen (2021). Python and R Basics for Environmental Data Sciences [Dataset]. https://search.dataone.org/view/sha256%3Aa4a66e6665773400ae76151d376607edf33cfead15ffad958fe5795436ff48ff
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Tao Wen
Area covered

Description
This resource collects teaching materials that are originally created for the in-person course 'GEOSC/GEOG 497 – Data Mining in Environmental Sciences' at Penn State University (co-taught by Tao Wen, Susan Brantley, and Alan Taylor) and then refined/revised by Tao Wen to be used in the online teaching module 'Data Science in Earth and Environmental Sciences' hosted on the NSF-sponsored HydroLearn platform.

This resource includes both R Notebooks and Python Jupyter Notebooks to teach the basics of R and Python coding, data analysis and data visualization, as well as building machine learning models in both programming languages by using authentic research data and questions. All of these R/Python scripts can be executed either on the CUAHSI JupyterHub or on your local machine.

This resource is shared under the CC-BY license. Please contact the creator Tao Wen at Syracuse University (twen08@syr.edu) for any questions you have about this resource. If you identify any errors in the files, please contact the creator.
q
Large Datasets in R - Plant Phenology & Temperature Data from NEON
qubeshub.org
Updated May 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
Explore at:
Unique identifier
https://doi.org/10.25334/Q4DQ3F
Dataset updated
May 10, 2018
Dataset provided by
QUBES
Authors
Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
Description
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
r
Data from: Working with a linguistic corpus using R: An introductory note...
researchdata.edu.au
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; I Made Rajeg; Karlina Denistia (2022). Working with a linguistic corpus using R: An introductory note with Indonesian Negating Construction [Dataset]. http://doi.org/10.4225/03/5a7ee2ac84303
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a7ee2ac84303
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg; I Made Rajeg; Karlina Denistia
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This is a repository for codes and datasets for the open-access paper in Linguistik Indonesia, the flagship journal for the Linguistic Society of Indonesia (Masyarakat Linguistik Indonesia [MLI]) (cf. the link in the references below).

To cite the paper (in APA 6th style):
Rajeg, G. P. W., Denistia, K., & Rajeg, I. M. (2018). Working with a linguistic corpus using R: An introductory note with Indonesian negating construction. Linguistik Indonesia, 36(1), 1–36. doi: 10.26499/li.v36i1.71

To cite this repository:
Click on the Cite (dark-pink button on the top-left) and select the citation style through the dropdown button (default style is Datacite option (right-hand side)

This repository consists of the following files:
1. Source R Markdown Notebook (.Rmd file) used to write the paper and containing the R codes to generate the analyses in the paper.
2. Tutorial to download the Leipzig Corpus file used in the paper. It is freely available on the Leipzig Corpora Collection Download page.
3. Accompanying datasets as images and .rds format so that all code-chunks in the R Markdown file can be run.
4. BibLaTeX and .csl files for the referencing and bibliography (with APA 6th style).
5. A snippet of the R session info after running all codes in the R Markdown file.
6. RStudio project file (.Rproj). Double click on this file to open an RStudio session associated with the content of this repository. See here and here for details on Project-based workflow in RStudio.
7. A .docx template file following the basic stylesheet for Linguistik Indonesia

Put all these files in the same folder (including the downloaded Leipzig corpus file)!

To render the R Markdown into MS Word document, we use the bookdown R package (Xie, 2018). Make sure this package is installed in R.

Yihui Xie (2018). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.6.
[Dataset] Does Volunteer Engagement Pay Off? An Analysis of User...
data.europa.eu
recerca.uoc.edu
+1more
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). [Dataset] Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7357747?locale=cs
Explore at:
unknown(10386572)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Explanation/Overview: Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised. Purpose: The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R. Relatedness: The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'. Content: In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions). corresponding_calculations.html Quarto-notebook to view in browser corresponding_calculations.qmd Quarto-notebook to view in RStudio assets data annotations annotations.csv List of annotations made per day for each of the analysed projects comments comments.csv Total list of comments with several data fields (i.e., comment id, text, reply_user_id) rolechanges 478_rolechanges.csv List of roles per user to determine number of role changes 1104_rolechanges.csv ... ... totalnetworkdata Edges 478_edges.csv Network data (edge set) for the given projects (without time slices) 1104_edges.csv ... ... Nodes 478_nodes.csv Network data (node set) for the given projects (without time slices) 1104_nodes.csv ... ... trajectories Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021) 478 Edges edges_4782016_q1.csv edges_4782016_q2.csv edges_4782016_q3.csv edges_4782016_q4.csv ... Nodes nodes_4782016_q1.csv nodes_4782016_q4.csv nodes_4782016_q3.csv nodes_4782016_q2.csv ... 1104 Edges ... Nodes ... ... scripts datavizfuncs.R script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd import.R script for the import of data, automatically executed from within corresponding_calculations.qmd corresponding_calculations_files files for the html/qmd view in the browser/RStudio Grouping: The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
d
Data from: Highlighting health consequences of racial disparities sparks...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riana M. Brown; Pia Dietze; Maureen A. Craig (2023). Highlighting health consequences of racial disparities sparks support for action [Dataset]. http://doi.org/10.5061/dryad.cz8w9gj8t
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.cz8w9gj8t
Dataset updated
Dec 6, 2023
Dataset provided by
Dryad Digital Repository
Authors
Riana M. Brown; Pia Dietze; Maureen A. Craig
Time period covered
Jan 1, 2023
Description
Racial disparities arise across many vital areas of American life, including employment, health, and interpersonal treatment. For example, 1 in 3 Black children live in poverty (vs. 1 in 9 White children) and on average, Black Americans live 4 fewer years than White Americans. Which disparity is more likely to spark reduction efforts? We find that highlighting disparities in health-related (vs. economic) outcomes spurs greater social media engagement and support for disparity-mitigating policy. Further, reading about racial health disparities elicits greater support for action (e.g., protesting) than economic or belonging-based disparities. This occurs, in part, because people view health disparities as violating morally-sacred values which enhances perceived injustice. This work elucidates which manifestations of racial inequality are most likely to prompt Americans to action., The data from Studies 1a, 1b, 3, 4a, and 4b were collected via online platfroms (i.e., Mturk.com, Prolific Academic, and NORCâ€™s AmeriSpeak Panel). All analyses were run in R with the R code provided (title: Health_Disparities_Syntax.R)., , # Highlighting Health Consequences of Racial Disparities Sparks Support for Action

There are a total of 5 datasets available (Studies 1a, 1b, 3, 4a, 4b) each collected by the researchers from online survey platforms. All data files are .sav files. We recommed using SPSS or RStudio to work with the data. We provide our code using RStudio and a codebook with the name of all variables in each dataset.

Description of the data and file structure

Study 1a and Study 1b utilized a within-subjects experimental design (S1a: N=191; S1b, preregistered: N=337, 50% White participants, 50% Black participants) where samples of U.S. citizens recruited from MTurk.com and Prolific Academic read nine examples of racial disparities, three each from the domains of health, economics, and belonging. After each example, participants reported whether the disparity was unjust and fair (reverse-coded; 2-items averaged to create a perceived injustice scale). Participants also indicated their agreement (1=s...
Z
Vehicle CAN bus data (with GPS)
data.niaid.nih.gov
data.europa.eu
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Kaiser (2020). Vehicle CAN bus data (with GPS) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_2661315
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Christian Kaiser
Alexander Stocker
Andreas Festl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains 20Hz sampled CAN bus data from a passenger vehicle, e.g. WheelSpeed FL (speed of the front left wheel), SteerAngle (steering wheel angle), Role, Pitch, and accelerometer values per direction.

In contrast to the dataset published at https://zenodo.org/record/2658168#.XMw2m6JS9PY we now have GPS data from the vehicle (see signals 'Latitude_Vehicle' and 'Longitude_Vehicle' in h5 group 'Math') and GPS data from the IMU device (see signals 'Latitude_IMU', 'Longitude_IMU' and 'Time_IMU' in h5 group 'Math') included. However, as it was exported with single_precision, therefore we lost some precision for those GPS values.

We are currently looking for a solution and will update the records if possible.

For data analysis we use R and R Studio (https://www.rstudio.com/) and the library h5.

e.g. check file with R code:

library(h5)

f <- h5file("file path/20181113_Driver1_Trip1.hdf")

summary(f["CAN/Yawrate1"][,])

summary(f["Math/Latitude_IMU"][,])

h5close(f)
f
Open data: Visual load effects on the auditory steady-state responses to...
su.figshare.com
researchdata.se
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Wiens; Malina Szychowska (2023). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones [Dataset]. http://doi.org/10.17045/sthlmuni.12582002.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17045/sthlmuni.12582002.v1
Dataset updated
May 30, 2023
Dataset provided by
Stockholm University
Authors
Stefan Wiens; Malina Szychowska
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The main results file are saved separately:- ASSR2.html: R output of the main analyses (N = 33)- ASSR2_subset.html: R output of the main analyses for the smaller sample (N = 25)FIGSHARE METADATACategories- Biological psychology- Neuroscience and physiological psychology- Sensory processes, perception, and performanceKeywords- crossmodal attention- electroencephalography (EEG)- early-filter theory- task difficulty- envelope following responseReferences- https://doi.org/10.17605/OSF.IO/6FHR8- https://github.com/stamnosslin/mn- https://doi.org/10.17045/sthlmuni.4981154.v3- https://biosemi.com/- https://www.python.org/- https://mne.tools/stable/index.html#- https://www.r-project.org/- https://rstudio.com/products/rstudio/GENERAL INFORMATION1. Title of Dataset:Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones2. Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se B. Associate or Co-investigator Contact Information Name: Malina Szychowska Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.researchgate.net/profile/Malina_Szychowska Email: malina.szychowska@psychology.su.se3. Date of data collection: Subjects (N = 33) were tested between 2019-11-15 and 2020-03-12.4. Geographic location of data collection: Department of Psychology, Stockholm, Sweden5. Information about funding sources that supported the collection of the data:Swedish Research Council (Vetenskapsrådet) 2015-01181SHARING/ACCESS INFORMATION1. Licenses/restrictions placed on the data: CC BY 4.02. Links to publications that cite or use the data: Szychowska M., & Wiens S. (2020). Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Submitted manuscript.The study was preregistered:https://doi.org/10.17605/OSF.IO/6FHR83. Links to other publicly accessible locations of the data: N/A4. Links/relationships to ancillary data sets: N/A5. Was data derived from another source? No 6. Recommended citation for this dataset: Wiens, S., & Szychowska M. (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.12582002DATA & FILE OVERVIEWFile List:The files contain the raw data, scripts, and results of main and supplementary analyses of an electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.ASSR2_experiment_scripts.zip: contains the Python files to run the experiment. ASSR2_rawdata.zip: contains raw datafiles for each subject- data_EEG: EEG data in bdf format (generated by Biosemi)- data_log: logfiles of the EEG session (generated by Python)ASSR2_EEG_scripts.zip: Python-MNE scripts to process the EEG dataASSR2_EEG_preprocessed_data.zip: EEG data in fif format after preprocessing with Python-MNE scriptsASSR2_R_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are: - ASSR2.html: R output of the main analyses- ASSR2_subset.html: R output of the main analyses but after excluding eight subjects who were recorded as pilots before preregistering the studyASSR2_results.zip: contains all figures and tables that are created by Python-MNE and R.METHODOLOGICAL INFORMATION1. Description of methods used for collection/generation of data:The auditory stimuli were amplitude-modulated tones with a carrier frequency (fc) of 500 Hz and modulation frequencies (fm) of 20.48 Hz, 40.96 Hz, or 81.92 Hz. The experiment was programmed in python: https://www.python.org/ and used extra functions from here: https://github.com/stamnosslin/mnThe EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and saved in .bdf format.For more information, see linked publication.2. Methods for processing the data:We conducted frequency analyses and computed event-related potentials. See linked publication3. Instrument- or software-specific information needed to interpret the data:MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html#Rstudio used with R (R Core Team, 2020): https://rstudio.com/products/rstudio/Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v34. Standards and calibration information, if appropriate:For information, see linked publication.5. Environmental/experimental conditions:For information, see linked publication.6. Describe any quality-assurance procedures performed on the data:For information, see linked publication.7. People involved with sample collection, processing, analysis and/or submission:- Data collection: Malina Szychowska with assistance from Jenny Arctaedius.- Data processing, analysis, and submission: Malina Szychowska and Stefan WiensDATA-SPECIFIC INFORMATION:All relevant information can be found in the MNE-Python and R scripts (in EEG_scripts and analysis_scripts folders) that process the raw data. For example, we added notes to explain what different variables mean.
d
Data from: Macroevolutionary patterns in marine hermaphroditism
datadryad.org
data.niaid.nih.gov
zip
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Colebrook Jarvis; Craig Robert White; Dustin J. Marshall (2022). Macroevolutionary patterns in marine hermaphroditism [Dataset]. http://doi.org/10.5061/dryad.76hdr7t0v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.76hdr7t0v
Dataset updated
Sep 21, 2022
Dataset provided by
Dryad
Authors
George Colebrook Jarvis; Craig Robert White; Dustin J. Marshall
Time period covered
Sep 9, 2022
Description
Data for many of the species in our dataset came from previously published meta-analyses on various marine invertebrate life-history traits (e.g., Marshall et al. 2012, Monro & Marshall 2015), supplemented with additional data on hermaphroditism and body size from the literature. Data were compiled in Microsoft Excel and phylogenetic logistic regressions testing the covariance between hermaphroditism and life-history/latitude were analyzed with the ‘phylolm’ package v. 2.6.2 (Ho and Ané 2014) in RStudio v.4.1717 (R Core Team 2021). We extracted our phylogenies from the Open Tree of Life (Hinchliff et al. 2015) with the package ‘rotl’ v. 3.0.11 (Michonneau et al. 2016) and constructed phylogenetic trees with the package ‘phytools’ v. 0.7-80 (Revell 2012). Branch lengths for the phylogeny were unknown, so we scaled branch lengths using Grafen’s method (Grafen 1989) in the ‘ape’ package v. 5.6-1 (Paradis and Schliep 2019)

Literature Cited

Grafen, A. 1989. The phylogenetic regression....

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

A
‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-winter-olympics-prediction-fantasy-draft-picks-2684/07d15ca8/?iid=004-753&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Winter Olympics Prediction - Fantasy Draft Picks’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ericsbrown/winter-olympics-prediction-fantasy-draft-picks on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Olympic Draft Predictive Model

Our family runs an Olympic Draft - similar to fantasy football or baseball - for each Olympic cycle. The purpose of this case study is to identify trends in medal count / point value to create a predictive analysis of which teams should be selected in which order.

There are a few assumptions that will impact the final analysis: Point Value - Each medal is worth the following: Gold - 6 points Silver - 4 points Bronze - 3 points For analysis reviewing the last 10 Olympic cycles. Winter Olympics only.

All GDP numbers are in USD

My initial hypothesis is that larger GDP per capita and size of contingency are correlated with better points values for the Olympic draft.

All Data pulled from the following Datasets:

Winter Olympics Medal Count - https://www.kaggle.com/ramontanoeiro/winter-olympic-medals-1924-2018 Worldwide GDP History - https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2020&start=1984&view=chart

GDP data was a wide format when downloaded from the World Bank. Opened file in Excel, removed irrelevant years, and saved as .csv.

Process

In RStudio utilized the following code to convert wide data to long:

install.packages("tidyverse") library(tidyverse) library(tidyr)

Converting to long data from wide

long <- newgdpdata %>% gather(year, value, -c("Country Name","Country Code"))

Completed these same steps for GDP per capita.

Primary Key Creation

Differing types of data between these two databases and there is not a good primary key to utilize. Used CONCAT to create a new key column in both combining the year and country code to create a unique identifier that matches between the datasets.

SELECT *, CONCAT(year,country_code) AS "Primary" FROM medal_count

Saved as new table "medals_w_primary"

Utilized Excel to concatenate the primary key for GDP and GDP per capita utilizing:

=CONCAT()

Saved as new csv files.

Uploaded all to SSMS.

Contingent Size

Next need to add contingent size.

No existing database had this information. Pulled data from Wikipedia.

2018 - No problem, pulled existing table. 2014 - Table was not created. Pulled information into excel, needed to convert the country NAMES into the country CODES.

Created excel document with all ISO Country Codes. Items were broken down between both formats, either 2 or 3 letters. Example:

AF/AFG

Used =RIGHT(C1,3) to extract only the country codes.

For the country participants list in 2014, copied source data from Wikipedia and pasted as plain text (not HTML).

Items then showed as: Albania (2)

Broke cells using "(" as the delimiter to separate country names and numbers, then find and replace to remove all parenthesis from this data.

We were left with: Albania 2

Used VLOOKUP to create correct country code: =VLOOKUP(A1,'Country Codes'!A:D,4,FALSE)

This worked for almost all items with a few exceptions that didn't match. Based on nature and size of items, manually checked on which items were incorrect.

Chinese Taipei 3 #N/A Great Britain 56 #N/A Virgin Islands 1 #N/A

This was relatively easy to fix by adding corresponding line items to the Country Codes sheet to account for future variability in the country code names.

Copied over to main sheet.

Repeated this process for additional years.

Once complete created sheet with all 10 cycles of data. In total there are 731 items.

Data Cleaning

Filtered by Country Code since this was an issue early on.

Found a number of N/A Country Codes:

Serbia and Montenegro FR Yugoslavia FR Yugoslavia Czechoslovakia Unified Team Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia

Appears to be issues with older codes, Soviet Union block countries especially. Referred to historical data and filled in these country codes manually. Codes found on iso.org.

Filled all in, one issue that was more difficult is the Unified Team of 1992 and Soviet Union. For simplicity used code for Russia - GDP data does not recognize the Soviet Union, breaks the union down to constituent countries. Using Russia is a reasonable figure for approximations and analysis to attempt to find trends.

From here created a filter and scanned through the country names to ensure there were no obvious outliers. Found the following:

Olympic Athletes from Russia[b] -- This is a one-off due to the recent PED controversy for Russia. Amended the Country Code to RUS to more accurately reflect the trends.

Korea[a] and South Korea -- both were listed in 2018. This is due to the unified Korean team that competed. This is an outlier and does not warrant standing on its own as the 2022 Olympics will not have this team (as of this writing on 01/14/2022). Removed the COR country code item.

Confirmed Primary Key was created for all entries.

Ran minimum and maximum years, no unexpected values. Ran minimum and maximum Athlete numbers, no unexpected values. Confirmed length of columns for Country Code and Primary Key.

No NULL values in any columns. Ready to import to SSMS.

SQL work

We now have 4 tables, joined together to create the master table:

SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes FROM medals_w_primary INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY year DESC

This left us with the following table:

https://i.imgur.com/tpNhiNs.png" alt="Imgur">

Performed some basic cleaning tasks to ensure no outliers:

Checked GDP numbers: 1992 North Korea shows as null. Updated this row with information from countryeconomy.com - $12,458,000,000

Checked GDP per capita:

1992 North Korea again missing. Updated this to $595, utilized same source.

UPDATE [OlympicDraft].[dbo].[gdp_w_primary] SET [OlympicDraft].[dbo].[gdp_w_primary].[value] = 12458000000 WHERE [OlympicDraft].[dbo].[gdp_w_primary].[year_country] = '1992PRK'

UPDATE [OlympicDraft].[dbo].[convertedgdpdatapercapita] SET [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita] = 595 WHERE [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year_country] = '1992PRK'

Liechtenstein showed as an outlier with GDP per capita at 180,366 in 2018. Confirmed this number is correct per the World Bank, appears Liechtenstein does not often have atheletes in the winter olympics. Performing a quick SQL search to verify this shows that they fielded 3 atheletes in 2018, with a Bronze medal being won. Initially this appears to be a good ratio for win/loss.

Finally, need to create a column that shows the total point value for each of these rows based on the above formula (6 points for Gold, 4 points for Silver, 3 points for Bronze).

Updated query as follows:

SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes, (Gold*6) + (Silver*4) + (Bronze*3) AS 'Total_Points' FROM [OlympicDraft].[dbo].[medals_w_primary] INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year]

Spot checked, calculating correctly.

Saved result as winter_olympics_study.csv.

We can now see that all relevant information is in this table:

https://i.imgur.com/ceZvqCA.png" alt="Imgur">

RStudio Work

To continue our analysis, opened this CSV in RStudio.

install.packages("tidyverse") library(tidyverse) library(ggplot2) install.packages("forecast") library(forecast) install.packages("GGally") library(GGally) install.packages("modelr") library(modelr)

View(winter_olympic_study)

Finding correlation between gdp_per_capita and Total_Points

ggplot(data = winter_olympic_study) + geom_point(aes(x=gdp_per_capita,y=Total_Points,color=country_name)) + facet_wrap(~country_name)

cor(winter_olympic_study$gdp_per_capita, winter_olympic_study$Total_Points, method = c("pearson"))

Result is .347, showing a moderate correlation between these two figures.

Looked next at GDP vs. Total_Points ggplot(data = winter_olympic_study) + geom_point(aes(x=GDP,y=Total_Points,color=country_name))+ facet_wrap(~country_name)

cor(winter_olympic_study$GDP, winter_olympic_study$Total_Points, method = c("pearson")) This resulted in 0.35, statistically insignificant difference between this and GDP Per Capita

Next looked at contingent size vs. total points ggplot(data = winter_olympic_study) + geom_point(aes(x=Atheletes,y=Total_Points,color=country_name)) +
o
Data and code for: Microalgae-blend tilapia feed eliminates fishmeal and...
explore.openaire.eu
search.dataone.org
+1more
Updated Oct 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pallab Sarker; Anne Kapuscinski; Brandi McKuin; Devin Fitzgerald; Hannah Nash; Connor Greenwood (2020). Data and code for: Microalgae-blend tilapia feed eliminates fishmeal and fish oil, improves growth, and is cost viable [Dataset]. http://doi.org/10.6071/m3vd5v
Explore at:
Unique identifier
https://doi.org/10.6071/m3vd5v
Dataset updated
Oct 19, 2020
Authors
Pallab Sarker; Anne Kapuscinski; Brandi McKuin; Devin Fitzgerald; Hannah Nash; Connor Greenwood
Description
Code for bootstrap analysis of commodity and market prices We conducted non-parametric bootstraps in Rstudio (v.1.2.5033) based on 10000 replicates using the adjusted bootstrap percentile method to estimate the median and 95% confidence intervals of commodity and market prices for the formulated tilapia feed ingredients that we used in feeditrials from a variety of sources (FAO, 2020; USDA, 2020a; USDA, 2020b; Alibaba, 2019; and Sigma Aldrich vitamin and mineral mixes created in lab; See Supplementary Methods and Tables S5 and S12 for more details). For non-parametric bootstrap code see attached Rstudio file entitled, "bootstrap_confidence_intervals_4_24_2020.R" and supporting .csv files entitled, "Corn_gluten_meal_annual_price.csv", "Fish_meal_annual_price.csv", "Soybean_meal_annual_price.csv", "Wheat_flour_annual_price.csv", "Fish_oil_annual_price.csv", "Lysine_annual_price.csv", "Choline_chloride_annual_price.csv", "DCP_annual_price.csv", "Nanno_meal_annual_price.csv", "Whole_schizo_annual_price.csv"Code for hedonic analysis of defatted N. oculata meal Code for hedonic analysis of defatted N. oculata meal We conducted a hedonic analysis in Rstudio to estimate the price of defatted N. oculata meal. The general methodology of hedonic analysis is described in Maisashvili et al. (2015). We used mixed-effects linear models using maximum likelihood methods (Bates et al., 2015; 2020). We selected crude protein, ether extract, methionine, and lysine as the key input variables in our defatted N. oculata meal model (See Eq. 3 for more details). For the code used in the hedonic analysis, see attached Rstudio file entitled, "Nanno_meal_model_4_20_2020.R" and supporting dataframe .csv entitled, "df_meal_CP_EE_plus_aminos.csv". Code for hedonic analysis of Schizochytrium sp. We conducted a hedonic analysis in Rstudio to estimate the price of whole-cell Schizochytrium sp. We selected the top fatty acids (e.g. eicosapentaenoic acid, 20:5n-3; myristic acid, 14:0; palmitoleic acid, 16:1n-7; and palmitic acid, 16:0) present in both the commodity oils (vegetable and fish) and in Schizochytrium sp that did not require an extrapolation (See Eq. 4 for more details). For the code used in the hedonic analysis, see attached Rstudio file entitled, "Schizo_oil_model_4_23_2020.R" and supporting dataframe file entitled, "df_scaled_oil_4_23_2020.csv". Code for Fig. 2 For the code used to produce Fig. 2, see attached Rstudio file entitled, "Fig_2_4_24_2020.R", and supporting dataframe file entitled,"Feed_price.csv". References Alibaba, Product searches, Web accessed Nov. 4, 2019. Available at: https://www.alibaba.com/. Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting Linear Mixed-Effects Models Using lme4. J. Stat. Softw. 67, 1–48 (2015). Bates, D. et al. Linear Mixed-Effects Models using ‘Eigen’ and S4. (2020). FAO, GIEWS FMPA Tool (v.3.8.0): monitoring and analysis of food prices (Food and Agriculture Organization (FAO) of the United Nations, 2020). Available at: https://fpma.apps.fao.org/giews/food- prices/tool/public/#/dataset/international. Maisashvili, A. et al. The values of whole algae and lipid extracted algae meal for aquaculture. Algal Res. 9, 133–142 (2015). USDA, Agricultural marketing service, Custom reports (United States Department of Agriculture, 2020a). Available at: https://marketnews.usda.gov/mnp/ls-report-config USDA, Wheat Data, Economic Research Service, (United States Department of Agriculture, 2020b). Available at: https://www.ers.usda.gov/data-products/wheat-data/. Aquafeed manufacturers have reduced, but not fully eliminated, fishmeal and fish oil and are seeking cost competitive replacements. We combined two commercially available microalgae, to produce a high-performing fish-free feed for Nile tilapia (Oreochromis niloticus) —the world’s second largest group of farmed fish. We substituted protein-rich defatted biomass of Nannochloropsis oculata (leftover after oil extraction for nutraceuticals) for fishmeal and whole cells of docosahexaenoic acid (DHA)-rich Schizochytrium sp. as substitute for fish oil. Here, we provide the datasets and code that we used to estimate the price of fish-free experimental and reference diets of tilpia in the Scientific Reports manuscript entitled, "Microalgae-blend tilapia feed eliminates fishmeal and fish oil, improves growth, and is cost viable". We include the Rstudio and supporting .csv files for a hedonic analysis of defatted N. oculata meal and whole-cell Schizochytrium sp., non-parametric bootstraps to estimate the median and 95% confidence intervals of commodity and market prices for the formulated tilapia feed ingredients, and for Fig. 2 in the manuscript. The attached Rstudio files and supporting dataframes are provided to ensure reproducibility of our study. There are no missing values in the input files. Please see embedded comments in the code provided with the Rstudio files....
Z
Automotive CAN bus data: An Example Dataset from the AEGIS Big Data Project
data.niaid.nih.gov
explore.openaire.eu
Updated Jul 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaiser, Christian (2020). Automotive CAN bus data: An Example Dataset from the AEGIS Big Data Project [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3267183
Explore at:
Dataset updated
Jul 8, 2020
Dataset provided by
Festl, Andreas
Stocker, Alexander
Kaiser, Christian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here you find an example research data dataset for the automotive demonstrator within the "AEGIS - Advanced Big Data Value Chain for Public Safety and Personal Security" big data project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732189. The time series data has been collected during trips conducted by three drivers driving the same vehicle in Austria.

The dataset contains 20Hz sampled CAN bus data from a passenger vehicle, e.g. WheelSpeed FL (speed of the front left wheel), SteerAngle (steering wheel angle), Role, Pitch, and accelerometer values per direction.

GPS data from the vehicle (see signals 'Latitude_Vehicle' and 'Longitude_Vehicle' in h5 group 'Math') and GPS data from the IMU device (see signals 'Latitude_IMU', 'Longitude_IMU' and 'Time_IMU' in h5 group 'Math') are included. However, as it had to be exported with single-precision, we lost some precision for those GPS values.

For data analysis we use R and R Studio (https://www.rstudio.com/) and the library h5.

e.g. check file with R code:

library(h5)

f <- h5file("file path/20181113_Driver1_Trip1.hdf")

summary(f["CAN/Yawrate1"][,])

summary(f["Math/Latitude_IMU"][,])

h5close(f)
o
Population Pyramid Data and R Script for the US, States, and Counties 1970 -...
openicpsr.org
delimited, zip
Updated Jan 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathanael Rosenheim (2020). Population Pyramid Data and R Script for the US, States, and Counties 1970 - 2017 [Dataset]. http://doi.org/10.3886/E117081V1
Explore at:
delimited, zipAvailable download formats
Unique identifier
https://doi.org/10.3886/E117081V1
Dataset updated
Jan 6, 2020
Dataset provided by
Texas A&M University
Authors
Nathanael Rosenheim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States, Counties, States
Description
Population pyramids provide a way to visualize the age and sex composition of a geographic region, such as a nation, state, or county. A standard population pyramid has two bar charts or histograms, one for the male population and one for the female population. The two charts mirror each other and are divided into 5-year age cohorts. The shape of a population pyramid provide insights into a regions fertility, mortality, and migration patterns. When a region has high fertility and mortality, but low migration the visualization will look like a pyramid. In many regions fertility and mortality have decreased between 1970 and 2017, as people live longer and women have fewer children. With lower fertility and mortality population pyramids are shaped more like a pillar. When interpreting population pyramids for smaller areas (like counties) the most important force that shapes the pyramid is in- and out-migration. (Wang and vom Hofe, 2006, p. 65) For smaller regions population pyramids can have unique shapes.

This data archive provides the resources needed to generate population pyramids for the United States, individual states, and any county within the United States. Population pyramids usually require significant data cleaning and graph making skills to generate one pyramid. With this data archive the data cleaning has been completed and the R script provides reusable code to quickly generate graphs. The final output is an image file with six graphs on one page. The final layout makes it easy to compare changes in population age and sex composition for any state and any county in the US for 1970, 1980, 1990, 2000, 2010, and 2017.
d
Data from: Fertilization mode covaries with body size
datadryad.org
search.dataone.org
+1more
zip
Updated Apr 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Jarvis; Dustin Marshall (2023). Fertilization mode covaries with body size [Dataset]. http://doi.org/10.5061/dryad.n5tb2rc00
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.n5tb2rc00
Dataset updated
Apr 5, 2023
Dataset provided by
Dryad
Authors
George Jarvis; Dustin Marshall
Time period covered
Feb 11, 2023
Description
The evolution of internal fertilization has occurred repeatedly and independently across the tree of life. As it has evolved, internal fertilization has reshaped sexual selection and the covariances among sexual traits such as testes size and gamete traits. But it is unclear whether fertilization mode also shows evolutionary associations with traits other than primary sex traits. Theory predicts that fertilization mode and body size should covary, but formal tests with phylogenetic control are lacking. We used a phylogenetically-controlled approach to test the covariance between fertilization mode and adult body size (while accounting for latitude, offspring size, and offspring developmental mode) among 1,232 species of marine invertebrates from 3 phyla. Within all phyla, external fertilizers are consistently larger than internal fertilizers: the consequences of fertilization mode extend to traits that are only indirectly related to reproduction. We suspect that other traits may a...
d
Data from: Plant-pollinator specialization: Origin and measurement of...
datadryad.org
search.dataone.org
+1more
zip
Updated Oct 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mannfred Boehm; Jill Jankowski; Quentin Cronk (2021). Plant-pollinator specialization: Origin and measurement of curvature [Dataset]. http://doi.org/10.5061/dryad.g1jwstqrr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.g1jwstqrr
Dataset updated
Oct 26, 2021
Dataset provided by
Dryad
Authors
Mannfred Boehm; Jill Jankowski; Quentin Cronk
Time period covered
Sep 24, 2021
Description
This dataset is primarily an RStudio project. To run the .R files, we recommend first opening the .RProj file in RStudio and installing the package here. This will allow you to run all of the .R scripts without changing any of the working directories.
s
Dataset supporting thesis "Effects of cognitive behavioural therapy,...
eprints.soton.ac.uk
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swanton, James; Bellato, Alessio; Lawrence, Pete (2025). Dataset supporting thesis "Effects of cognitive behavioural therapy, dialectical behavioural therapy, and paternal anxiety on youth emotional and behavioural dysregulation: A systematic review, meta-analysis, and longitudinal study" [Dataset]. http://doi.org/10.5258/SOTON/D3585
Explore at:
Unique identifier
https://doi.org/10.5258/SOTON/D3585
Dataset updated
Jun 27, 2025
Dataset provided by
University of Southampton
Authors
Swanton, James; Bellato, Alessio; Lawrence, Pete
Description
This dataset contains: A spreadsheet titled "MA_Intervention_Studies.xlsx" containing effect size data extracted from eligible studies for the systematic review and meta-analysis. A spreadsheet called alspac_dataset_cleanedFZ.xlsx. Cleaned and derived variables used in a longitudinal analysis from the Avon Longitudinal Study of Parents and Children (ALSPAC), including father anxiety scores and child emotional/behavioural outcomes. R scripts for conducting a multilevel random-effects meta-analysis and generating funnel and forest plots. Date of data collection: Systematic Review/Meta-Analysis: February 2024 – May 2025 Longitudinal Analysis (ALSPAC): ALSPAC data originally collected between 1990 and 2010; analysis performed between January – June 2025 Data were manually extracted from peer-reviewed publications identified in the systematic review using predefined inclusion criteria. Effect sizes were calculated using standardised mean difference methods (Hedges’ g). Composite effect sizes were computed for studies reporting multiple emotional dysregulation outcomes (e.g., anger and sadness), following Borenstein et al. (2009). Longitudinal data from ALSPAC were accessed through approved application procedures and cleaned using R. Software required to view/use the data: Microsoft Excel (or compatible spreadsheet software) – for viewing and working with the dataset. R and RStudio (with metafor, readxl, dplyr, ggplot2) – for running the analysis scripts and reproducing the meta-analysis figures.
e
Replication Data for: A Corpus Based Analysis of V2 Variation in West...
b2find.eudat.eu
Updated Nov 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Replication Data for: A Corpus Based Analysis of V2 Variation in West Flemish and French Flemish Dialects - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/dbfbd743-f9f6-5e06-ba7f-768ebdcf3f2c
Explore at:
Dataset updated
Nov 2, 2023
Area covered
French
Description
DOI Dataset abstract The dataset includes an annotated dataset of N = 1413 sentences (or parts thereof) taken from an authentic spoken corpus data from West Flemish and French Flemish (Dialects of Dutch). The sentences are annotated for V2 variation (Subject-Verb inversion, the outcome variable of the associated study) and seven predictor variables, including city, region, prosodic integration, form and function of the topicalized constituent, form of the subject, and the number of constituents in the prefield on (non)inverted word order. The dataset also includes geographical data to create a dialect map showing the relative frequencies of V2 variation. An R Notebook with the data analysis is provided. Article abstract This paper explores V2 variation in West Flemish and French Flemish dialects of Dutch based on an extensive corpus of authentic spoken data. After taking stock of the existing literature, we probe into the effect of region, prosodic integration, form and function of the topicalized constituent, form of the subject, and the number of constituents in the prefield on (non)inverted word order. This is the first study that carries out regression analysis on the combined impact of these variables in the entire West Flemish and French Flemish region, with additional visualization of effect sizes. The results show that noninversion is generally more widespread than originally anticipated, with unexpected higher occurrence of noninversion in continental West Flemish and lower frequencies in western West Flemish. With the exception of the variable number of constituents in the prefield, all other variables had a significant impact on word order: Clausal topicalized elements, elements that have peripheral functions, and elements that lack prosodic integration all favor noninverted word order. The form of the subject also impacted word order, but its effect is sometimes overruled by discourse considerations. MS Excel, Microsoft Office Professional Plus 2016 R, version 4.0.5 RStudio, Version 1.4.1106 Data for the present study was gathered from the dialect recordings collected by Ghent University and the Meertens Institute in Amsterdam in the 1960s and 1970s; see Dialectloket: Stemmen uit het verleden. URL: http://www.dialectloket.be/geluid/stemmen-uit-het-verleden. The purpose of these recordings was to capture the authentic local dialects that were affected as little as possible by Standard Dutch or other dialects. Recorded speakers had to meet several criteria: they had to be born and raised in the same place, have a relatively old age (older than 60) and a low level of education. Ideally both their parents and their partner spoke the same dialect. Most of the recorded dialect speakers who met these criteria were farmers born around 1900. All of the dialect speakers were born and raised well before the democratisation of education and the introduction of the mass media, which enhanced the spread of Standard Dutch in Flanders. The authentic local dialects were collected based on what Mesthrie et al. (2009:90) refer to as sociolinguistic interviews: An interviewer asks questions about the interviewee’s youth, profession, his/her experiences in times of war, and so on. To minimize the distance between the “middle-class researcher versus the subject” (Mesthrie et al. 2009:90) and the impact of age or class differences between the interviewer and the interviewee, the interviews proceeded in an informal environment with an interviewer taking on the role of a student. References: Mesthrie, Rajend, Joan Swann, Ana Deumert, & William L. Leap. 2009. Introducing sociolinguistics. 2nd edn. Edinburgh: Edinburgh University Press. http://www.dialectloket.be/geluid/stemmen-uit-het-verleden/
o
Data from: Patterns, predictors, and consequence of dominance in hybrids
explore.openaire.eu
borealisdata.ca
+4more
Updated Jan 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth Thompson; Mackenzie Urquhart-Cronish; Kenneth D. Whitney; Loren H. Rieseberg; Dolph Schluter (2020). Patterns, predictors, and consequence of dominance in hybrids [Dataset]. http://doi.org/10.5683/sp2/qrlrc9
Explore at:
Unique identifier
https://doi.org/10.5683/sp2/qrlrc9
Dataset updated
Jan 1, 2020
Authors
Kenneth Thompson; Mackenzie Urquhart-Cronish; Kenneth D. Whitney; Loren H. Rieseberg; Dolph Schluter
Description
Compared to those of their parents, are the traits of first-generation (F1) hybrids typically intermediate, biased toward one parent, or mismatched for alternative parental phenotypes? And how does hybrid trait expression affect fitness? To address this empirical gap, we compiled data from 198 studies in which traits were measured in a common environment for two parent taxa and their F1 hybrids. We find that individual traits in F1s are, on average, halfway between the parental midpoint and one parental value (e.g., hybrid trait values are 0.75 if parents'values are 0 &1). When considering pairs of traits together, a hybrid's bivariate phenotype tends to resemble one parent (pairwise parent-bias) about 50 % more than the other while also exhibiting a similar magnitude of mismatch due to different traits having dominance in conflicting directions. We detect no phylogenetic signal nor an effect of parental genetic distance on dominance or mismatch. Using data from an experimental field planting of recombinant hybrid sunflowers, we illustrate that pairwise parent-bias improves fitness whereas pairwise mismatch reduces fitness. In sum, our study has three major conclusions. First, hybrids between ecologically divergent natural populations are not phenotypically intermediate but rather exhibit substantial mismatch while also resembling one parent more than the other. Second, dominance and mismatch do not seem to be governed by general rules but rather by the idiosyncratic evolutionary trajectories of individual traits in individual populations or species. Finally, selection against hybrids likely results from selection against both intermediate and mismatched phenotypes. Everything should work well if directories are established using the .Rproj file in RStudio. See meta-data and readme file for further instructions. Email me (kthomp1063@gmail.com) with any queries. Data collection for follow-up projects is ongoing, and if you are interested in the most current dataset please contact me. I'd be happy to collaborate and/or share the updated dataset. The systematic review data were compiled from the literature (many datasets came from emailing authors). The sunflower data were collected by Ken Whitney & co. All details given in text. Data are raw, as collected, in the 'raw_data.'

Facebook

Twitter

Click to copy link

Link copied

Cite

Raisa Hernández-Pacheco; Alexis Diaz (2020). Chapter 1: Introduction to R and RStudio [Dataset]. http://doi.org/10.25334/FRPR-2J11

Chapter 1: Introduction to R and RStudio

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.25334/FRPR-2J11

Dataset updated

Dec 23, 2020

Dataset provided by

QUBES

Authors

Raisa Hernández-Pacheco; Alexis Diaz

Description

Biostatistics Using R: A Laboratory Manual was created with the goals of providing biological content to lab sessions by using authentic research data and introducing R programming language. Chapter 1 introduces R and RStudio.

Clear search

Close search

Google apps

Main menu

Chapter 1: Introduction to R and RStudio

Python and R Basics for Environmental Data Sciences

Large Datasets in R - Plant Phenology & Temperature Data from NEON

Data from: Working with a linguistic corpus using R: An introductory note...

[Dataset] Does Volunteer Engagement Pay Off? An Analysis of User...

Data and tools for studying isograms

Data from: Highlighting health consequences of racial disparities sparks...

Description of the data and file structure

Vehicle CAN bus data (with GPS)

Open data: Visual load effects on the auditory steady-state responses to...

Data from: Macroevolutionary patterns in marine hermaphroditism

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2

Olympic Draft Predictive Model

Process

Converting to long data from wide

Primary Key Creation

Contingent Size

Data Cleaning

SQL work

RStudio Work

Finding correlation between gdp_per_capita and Total_Points

Data and code for: Microalgae-blend tilapia feed eliminates fishmeal and...

Automotive CAN bus data: An Example Dataset from the AEGIS Big Data Project

Population Pyramid Data and R Script for the US, States, and Counties 1970 -...

Data from: Fertilization mode covaries with body size

Data from: Plant-pollinator specialization: Origin and measurement of...

Dataset supporting thesis "Effects of cognitive behavioural therapy,...

Replication Data for: A Corpus Based Analysis of V2 Variation in West...

Data from: Patterns, predictors, and consequence of dominance in hybrids

Chapter 1: Introduction to R and RStudio