Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the national health and nutrition examination survey (nhanes) with r nhanes is this fascinating survey where doctors and dentists accompany survey interviewers in a little mobile medical center that drives around the country. while the survey folks are interviewing people, the medical professionals administer laboratory tests and conduct a real doctor's examination. the b lood work and medical exam allow researchers like you and me to answer tough questions like, "how many people have diabetes but don't know they have diabetes?" conducting the lab tests and the physical isn't cheap, so a new nhanes data set becomes available once every two years and only includes about twelve thousand respondents. since the number of respondents is so small, analysts often pool multiple years of data together. the replication scripts below give a few different examples of how multiple years of data can be pooled with r. the survey gets conducted by the centers for disease control and prevention (cdc), and generalizes to the united states non-institutional, non-active duty military population. most of the data tables produced by the cdc include only a small number of variables, so importation with the foreign package's read.xport function is pretty straightforward. but that makes merging the appropriate data sets trickier, since it might not be clear what to pull for which variables. for every analysis, start with the table with 'demo' in the name -- this file includes basic demographics, weighting, and complex sample survey design variables. since it's quick to download the files directly from the cdc's ftp site, there's no massive ftp download automation script. this new github repository co ntains five scripts: 2009-2010 interview only - download and analyze.R download, import, save the demographics and health insurance files onto your local computer load both files, limit them to the variables needed for the analysis, merge them together perform a few example variable recodes create the complex sample survey object, using the interview weights run a series of pretty generic analyses on the health insurance ques tions 2009-2010 interview plus laboratory - download and analyze.R download, import, save the demographics and cholesterol files onto your local computer load both files, limit them to the variables needed for the analysis, merge them together perform a few example variable recodes create the complex sample survey object, using the mobile examination component (mec) weights perform a direct-method age-adjustment and matc h figure 1 of this cdc cholesterol brief replicate 2005-2008 pooled cdc oral examination figure.R download, import, save, pool, recode, create a survey object, run some basic analyses replicate figure 3 from this cdc oral health databrief - the whole barplot replicate cdc publications.R download, import, save, pool, merge, and recode the demographics file plus cholesterol laboratory, blood pressure questionnaire, and blood pressure laboratory files match the cdc's example sas and sudaan syntax file's output for descriptive means match the cdc's example sas and sudaan synta x file's output for descriptive proportions match the cdc's example sas and sudaan syntax file's output for descriptive percentiles replicate human exposure to chemicals report.R (user-contributed) download, import, save, pool, merge, and recode the demographics file plus urinary bisphenol a (bpa) laboratory files log-transform some of the columns to calculate the geometric means and quantiles match the 2007-2008 statistics shown on pdf page 21 of the cdc's fourth edition of the report click here to view these five scripts for more detail about the national health and nutrition examination survey (nhanes), visit: the cdc's nhanes homepage the national cancer institute's page of nhanes web tutorials notes: nhanes includes interview-only weights and interview + mobile examination component (mec) weights. if you o nly use questions from the basic interview in your analysis, use the interview-only weights (the sample size is a bit larger). i haven't really figured out a use for the interview-only weights -- nhanes draws most of its power from the combination of the interview and the mobile examination component variables. if you're only using variables from the interview, see if you can use a data set with a larger sample size like the current population (cps), national health interview survey (nhis), or medical expenditure panel survey (meps) instead. confidential to sas, spss, stata, sudaan users: why are you still riding around on a donkey after we've invented the internal combustion engine? time to transition to r. :D
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).
We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).
This replication package contains the following files and folders:
README This file
MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:
museum: Name of the museum.
type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).
munic: Municipality, in which the museum is located.
yr: Year of the observation.
units: Number of visit sites.
resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).
vis: Number of (physical) visitors.
aarc: Number of articles published (archeology).
ach: Number of articles published (cultural history).
aah: Number of articles published (art history).
anh: Number of articles published (natural history).
exh: Number of temporary exhibitions.
edu: Number of primary school classes on educational visits to the museum.
ev: Number of events other than exhibitions.
ftesc: Scientific labor (full-time equivalents).
ftensc: Non-scientific labor (full-time equivalents).
expProperty: Running and maintenance costs [1,000 DKK].
expCons: Conservation expenditure [1,000 DKK].
ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).
prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.
DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.
make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.
IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.
IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.
allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.
IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.
/tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.
/figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the behavioral risk factor surveillance system (brfss) with r and monetdb experimental. the behavioral risk factor surveillance system (brfss) aggregates behavioral health data from 400,000 adults via telephone every year. it's um clears throat the largest telephone survey in the world and it's gotta lotta uses, here's a list neato. state health departments perform the actual d ata collection (according to a nationally-standardized protocol and a core set of questions), then forward all responses to the centers for disease control and prevention (cdc) office of surveillance, epidemiology, and laboratory services (osels) where the nationwide, annual data set gets constructed. independent administration by each state allows them to tack on their own questions that other states might not care about. that way, florida could exempt itself from all t he risky frostbite behavior questions. in addition to providing the most comprehensive behavioral health data set in the united states, brfss also eeks out my worst acronym in the federal government award - onchit a close second. annual brfss data sets have grown rapidly over the past half-decade: the 1984 data set contained only 12,258 respondents from 15 states, all states were participating by 1994, and the 2011 file has surpassed half a million interviews. if you're examining trends over time, do your homework and review the brfss technical documents for the years you're looking at (plus any years in between). what might you find? well for starters, the cdc switched to sampling cellphones in their 2011 methodology. unlike many u.s. government surveys, brfss is not conducted for each resident at a sampled household (phone number). only one respondent per phone number gets interviewed. did i miss anything? well if your next question is frequently asked, you're in luck. all brfss files are available in sas transport format so if you're sittin' pretty on 16 gb of ram, you could potentially read.xport a single year and create a taylor-series survey object using the sur vey package. cool. but hear me out: the download and importation script builds an ultra-fast monet database (click here for speed tests, installation instructions) on your local hard drive. after that, these scripts are shovel-ready. consider importing all brfss files my way - let it run overnight - and during your actual analyses, code will run a lot faster. the brfss generalizes to the u.s. adult (18+ ) (non-institutionalized) population, but if you don't have a phone, you're probably out of scope. this new github repository contains four scripts: 1984 - 2011 download all microdata.R create the batch (.bat) file needed to initiate the monet database in the future download, unzip, and import each year specified by the user create and save the taylor-series linearization complex sample designs create a well-documented block of code to re-initiate the monetdb server in the future 2011 single-year - analysis examples.R run the well-d ocumented block of code to re-initiate the monetdb server load the r data file (.rda) containing the taylor-series linearization design for the single-year 2011 file perform the standard repertoire of analysis examples, only this time using sqlsurvey functions 2010 single-year - variable recode example.R run the well-documented block of code to re-initiate the monetdb server copy the single-year 2010 table to maintain the pristine original add a new drinks per month category variable by hand re-create then save the sqlsurvey taylor-series linearization complex sample design on this new table close everything, then load everything back up in a fresh instance of r replicate statistics from this table , pulled from the cdc's web-enabled analysis tool replicate cdc weat - 2010.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the taylor-series linearization design for the single-year 2010 file replicate statistics from this table, pulled from the cdc's web-enabled analysis tool click here to view these four scripts for more detail about the behavioral risk factor surveillance system, visit: the centers for disease control and prevention beh avioral risk factor surveillance system homepage the behavioral risk factor surveillance system wikipedia entry notes: if you're just scroungin' around for a few statistics, the cdc's web-enabled analysis tool (weat) might be all your heart desires. in fact , on slides seven, eight, nine of my online query tools video, i demonstrate how to use this table creator. weat's more advanced than most web-based survey analysis - you can run a regression. but only seven (of eighteen) years can currently be queried online. since data types in sql are not as plentiful as they are in the r language, the definition of a monet database-backed complex design object requires a cutoff be specified between the categorical variables and the linear ones. that cut point gets...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the survey of income and program participation (sipp) with r if the census bureau's budget was gutted and only one complex sample survey survived, pray it's the survey of income and program participation (sipp). it's giant. it's rich with variables. it's monthly. it follows households over three, four, now five year panels. the congressional budget office uses it for their health insurance simulation . analysts read that sipp has person-month files, get scurred, and retreat to inferior options. the american community survey may be the mount everest of survey data, but sipp is most certainly the amazon. questions swing wild and free through the jungle canopy i mean core data dictionary. legend has it that there are still species of topical module variables that scientists like you have yet to analyze. ponce de león would've loved it here. ponce. what a name. what a guy. the sipp 2008 panel data started from a sample of 105,663 individuals in 42,030 households. once the sample gets drawn, the census bureau surveys one-fourth of the respondents every four months, over f our or five years (panel durations vary). you absolutely must read and understand pdf pages 3, 4, and 5 of this document before starting any analysis (start at the header 'waves and rotation groups'). if you don't comprehend what's going on, try their survey design tutorial. since sipp collects information from respondents regarding every month over the duration of the panel, you'll need to be hyper-aware of whether you want your results to be point-in-time, annualized, or specific to some other period. the analysis scripts below provide examples of each. at every four-month interview point, every respondent answers every core question for the previous four months. after that, wave-specific addenda (called topical modules) get asked, but generally only regarding a single prior month. to repeat: core wave files contain four records per person, topical modules contain one. if you stacked every core wave, you would have one record per person per month for the duration o f the panel. mmmassive. ~100,000 respondents x 12 months x ~4 years. have an analysis plan before you start writing code so you extract exactly what you need, nothing more. better yet, modify something of mine. cool? this new github repository contains eight, you read me, eight scripts: 1996 panel - download and create database.R 2001 panel - download and create database.R 2004 panel - download and create database.R 2008 panel - download and create database.R since some variables are character strings in one file and integers in anoth er, initiate an r function to harmonize variable class inconsistencies in the sas importation scripts properly handle the parentheses seen in a few of the sas importation scripts, because the SAScii package currently does not create an rsqlite database, initiate a variant of the read.SAScii
function that imports ascii data directly into a sql database (.db) download each microdata file - weights, topical modules, everything - then read 'em into sql 2008 panel - full year analysis examples.R< br /> define which waves and specific variables to pull into ram, based on the year chosen loop through each of twelve months, constructing a single-year temporary table inside the database read that twelve-month file into working memory, then save it for faster loading later if you like read the main and replicate weights columns into working memory too, merge everything construct a few annualized and demographic columns using all twelve months' worth of information construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half, again save it for faster loading later, only if you're so inclined reproduce census-publish ed statistics, not precisely (due to topcoding described here on pdf page 19) 2008 panel - point-in-time analysis examples.R define which wave(s) and specific variables to pull into ram, based on the calendar month chosen read that interview point (srefmon)- or calendar month (rhcalmn)-based file into working memory read the topical module and replicate weights files into working memory too, merge it like you mean it construct a few new, exciting variables using both core and topical module questions construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half reproduce census-published statistics, not exactly cuz the authors of this brief used the generalized variance formula (gvf) to calculate the margin of error - see pdf page 4 for more detail - the friendly statisticians at census recommend using the replicate weights whenever possible. oh hayy, now it is. 2008 panel - median value of household assets.R define which wave(s) and spe cific variables to pull into ram, based on the topical module chosen read the topical module and replicate weights files into working memory too, merge once again construct a replicate-weighted complex sample design with a...
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset (MEG and MRI data) was collected by the MEG Unit Lab, McConnell Brain Imaging Center, Montreal Neurological Institute, McGill University, Canada. The original purpose was to serve as a tutorial data example for the Brainstorm software project (http://neuroimage.usc.edu/brainstorm). It is presently released in the Public Domain, and is not subject to copyright in any jurisdiction.
We would appreciate though that you reference this dataset in your publications: please acknowledge its authors (Elizabeth Bock, Peter Donhauser, Francois Tadel and Sylvain Baillet) and cite the Brainstorm project seminal publication (also in open access): http://www.hindawi.com/journals/cin/2011/879716/
3 datasets:
S01_AEF_20131218_01.ds: Run #1, 360s, 200 standard + 40 deviants
S01_AEF_20131218_02.ds: Run #2, 360s, 200 standard + 40 deviants
S01_Noise_20131218_01.ds: Empty room recordings, 30s long
File name: S01=Subject01, AEF=Auditory evoked field, 20131218=date(Dec 18 2013), 01=run
Use of the .ds, not the AUX (standard at the MNI) because they are easier to manipulate in FieldTrip
The output file is copied to each .ds folder and contains the following entries:
Around 150 head points distributed on the hard parts of the head (no soft tissues)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the national health interview survey (nhis) with r the national health interview survey (nhis) is a household survey about health status and utilization. each annual data set can be used to examine the disease burden and access to care that individuals and families are currently experiencing across the country. check out the wikipedia article (ohh hayy i wrote that) for more detail about its current and potential uses. if you're cooking up a health-related analysis that doesn't need medical expenditures or monthly health insurance coverage, look at nhis before the medical expenditure panel survey (it's sample is twice as big). the centers for disease control and prevention (cdc) has been keeping nhis real since 1957, and the scripts below automate the download, importation, and analysis of every file back to 1963. what happened in 1997, you ask? scientists cloned dolly the sheep, clinton started his second term, and the national health interview survey underwent its most recent major questionnaire re-design. here's how all the moving parts work: a person-level file (personsx) that merges onto other files using unique household (hhx), family (fmx), and person (fpx) identifiers. [note to data historians: prior to 2004, person number was (px) and unique within each household.] this file includes the complex sample survey variables needed to construct a taylor-series linearization design, and should be used if your analysis doesn't require variables from the sample adult or sample c hild files. this survey setup generalizes to the noninstitutional, non-active duty military population. a family-level file that merges onto other files using unique household (hhx) and family (fmx) identifiers. a household-level file that merges onto other files using the unique household (hhx) identifier. a sample adult file that includes questions asked of only one adult within each household (selected at random) - a subset of the main person-level file. hhx, fmx, and fpx identifiers will merge with each of the files above, but since not every adult gets asked thes e questions, this file contains its own set of weights: wtfa_sa instead of wtfa. you can merge on whatever other variables you need from the three files above, but if your analysis requires any variables from the sample adult questionnaire, you can't use records in the person-level file that aren't also in the sample adult file (a big sample size cut). this survey setup generalizes to the noninstitutional, non-active duty military adult population. a sample child file that includes questions asked of only one child within each household (if available, and also selected at random) - another subset of the main person-level file. same deal as the sample adult description, except use wtfa_sc instead of wtfa oh yeah and this one generalizes to the child population. five imputed income files. if you want income and/or poverty variables incorporated into any part of your analysis, you'll need these puppies. the replication example below uses these, but if that's impenetrable, post in the comments describing where you get stuck. some injury stuff and other miscellanea that varies by year. if anyone uses this, please share your experience. if you use anything more than the personsx file alone, you'll need to merge some tables together. make sure you understand the difference between setting the parameter all = TRUE versus all = FALSE -- not everyone in the personsx file has a record in the samadult and sam child files. this new github repository contains four scripts: 1963-2011 - download all microdata.R loop through every year and download every file hosted on the cdc's nhis ftp site import each file into r with SAScii save each file as an r d ata file (.rda) download all the documentation into the year-specific directory 2011 personsx - analyze.R load the r data file (.rda) created by the download script (above) set up a taylor-series linearization survey design outlined on page 6 of this survey document perform a smattering of analysis examples 2011 personsx plus samadult with multiple imputation - analyze.R load the personsx and samadult r data files (.rda) created by the download script (above) merge the personsx and samadult files, highlighting how to conduct analyses that need both create tandem survey designs for both personsx-only and merg ed personsx-samadult files perform just a touch of analysis examples load and loop through the five imputed income files, tack them onto the personsx-samadult file conduct a poverty recode or two analyze the multiply-imputed survey design object, just like mom used to analyze replicate cdc tecdoc - 2000 multiple imputation.R download and import the nhis 2000 personsx and imputed income files, using SAScii and this imputed income sas importation script (no longer hosted on the cdc's nhis ftp site). loop through each of the five imputed income files, merging each to the personsx file and performing the same set of...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data set, R code and RMarkdown document that accompanies the publication "Fast Range Expansion of Red Imported Fire Ants in Virginia and Prediction of Future Spread in the United States" by Malone M, Ivanov K, Taylor SV & Schürch R in Ecosphere. The data set only contains our newly acquired data, and some derived data that can be used to save computing time. Shape files for US state boundaries can be downloaded from https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the social security administration public use microdata files (ssapumf) with r the social security administration (ssa) must be overflowing with quiet heroes, because their public-use microdata files are as inconspicuous as they are thorough. sure, ssa publishes enough great statistical research of their own that outside researchers rarely find ourselves wanting more and fin er data that this agency can provide, but does that stop them from releasing detailed microdata as well? why no. no it does not. if you wake up one morning with a hankerin' to study the person-level lifetime cash-flows of fdr's legacy, roll up your sleeves and start right here. compared to the other data sets on asdfree.com, the social security administr ation public use microdata files (ssapumf) are as straightforward as it gets. you won't find complex sample survey data here, so just review the short-and-to-the-point data descriptions then calculate your statistics the way you would with other non-survey data. each of these files contain either one record per person or one record per person per year, and effortlessly generalize to the entire population of either social security number holders (most of the country) or social security recipients (just beneficiaries). the one-percent samples should be multiplied by 100 to get accurate nationwide count statistics an d the five-percent samples by 20, but ykta (my new urban dictionary entry). this new github repository contains one script: download all microdata.R download each zipped file directly onto your local computer load each file into a data.frame using a mixture of both fancery and schmantzery reproduce the overall count statistics provided in each respective data dictionary save each file as an R data file (.rda) for ultra-fast future use click here to view this lonely script for more detail about the social security administration public use microdata files (ssapumf), visit: < ul> the social security administration home page the social security administration open data initiative the national archives' history of social security notes: i skipped importing these n ew beneficiary data system (nbds) files because i broadly distrust data older than i am and you probably want these easy-to-use, far more current files anyway. confidential to sas, spss, stata, and sudaan users: no doubt they were very impressive when they originally became available. but so was the bone flute. time to transition to r. :D
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)
April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online
The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R
The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:
Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.
Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
BCCD Dataset is a small-scale dataset for blood cells detection.
Thanks the original data and annotations from cosmicad and akshaylamba. The original dataset is re-organized into VOC format. BCCD Dataset is under MIT licence.
Data preparation is important to use machine learning. In this project, the Faster R-CNN algorithm from keras-frcnn for Object Detection is used. From this dataset, nicolaschen1 developed two Python scripts to make preparation data (CSV file and images) for recognition of abnormalities in blood cells on medical images.
export.py: it creates the file "test.csv" with all data needed: filename, class_name, x1,y1,x2,y2. plot.py: it plots the boxes for each image and save it in a new directory.
Image Type : jpeg(JPEG) Width x Height : 640 x 480
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('bccd', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/bccd-1.0.0.png" alt="Visualization" width="500px">
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset consists of three data folders including all related documents of the online survey conducted within the NESP 3.2.3 project (Tropical Water Quality Hub) and a survey format document representing how the survey was designed. Apart from participants’ demographic information, the survey consists of three sections: conjoint analysis, picture rating and open question. Correspondent outcome of these three sections are downloaded from Qualtrics website and used for three different data analysis processes.
Related data to the first section “conjoint analysis” is saved in the Conjoint analysis folder which contains two sub-folders. The first one includes a plan file of SAV. Format representing the design suggestion by SPSS orthogonal analysis for testing beauty factors and 9 photoshoped pictures used in the survey. The second (i.e. Final results) contains 1 SAV. file named “data1” which is the imported results of conjoint analysis section in SPSS, 1 SPS. file named “Syntax1” representing the code used to run conjoint analysis, 2 SAV. files as the output of conjoint analysis by SPSS, and 1 SPV file named “Final output” showing results of further data analysis by SPSS on the basis of utility and importance data.
Related data to the second section “Picture rating” is saved into Picture rating folder including two subfolders. One subfolder contains 2500 pictures of Great Barrier Reef used in the rating survey section. These pictures are organised by named and stored in two folders named as “Survey Part 1” and “Survey Part 2” which are correspondent with two parts of the rating survey sections. The other subfolder “Rating results” consist of one XLSX. file representing survey results downloaded from Qualtric website.
Finally, related data to the open question is saved in “Open question” folder. It contains one csv. file and one PDF. file recording participants’ answers to the open question as well as one PNG. file representing a screenshot of Leximancer analysis outcome.
Methods: This dataset resulted from the input and output of an online survey regarding how people assess the beauty of Great Barrier Reef. This survey was designed for multiple purposes including three main sections: (1) conjoint analysis (ranking 9 photoshopped pictures to determine the relative importance weights of beauty attributes), (2) picture rating (2500 pictures to be rated) and (3) open question on the factors that makes a picture of the Great Barrier Reef beautiful in participants’ opinion (determining beauty factors from tourist perspective). Pictures used in this survey were downloaded from public sources such as websites of the Tourism and Events Queensland and Tropical Tourism North Queensland as well as tourist sharing sources (i.e. Flickr). Flickr pictures were downloaded using the key words “Great Barrier Reef”. About 10,000 pictures were downloaded in August and September 2017. 2,500 pictures were then selected based on several research criteria: (1) underwater pictures of GBR, (2) without humans, (3) viewed from 1-2 metres from objects and (4) of high resolution.
The survey was created on Qualtrics website and launched on 4th October 2017 using Qualtrics survey service. Each participant rated 50 pictures randomly selected from the pool of 2500 survey pictures. 772 survey completions were recorded and 705 questionnaires were eligible for data analysis after filtering unqualified questionnaires. Conjoint analysis data was imported to IBM SPSS using SAV. format and the output was saved using SPV. format. Automatic aesthetic rating of 2500 Great Barrier Reef pictures –all these pictures are rated (1 – 10 scale) by at least 10 participants and this dataset was saved in a XLSX. file which is used to train and test an Artificial Intelligence (AI)-based system recognising and assessing the beauty of natural scenes. Answers of the open-question were saved in a XLSX. file and a PDF. file to be employed for theme analysis by Leximancer software.
Further information can be found in the following publication: Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Monitoring aesthetic value of the Great Barrier Reef by using innovative technologies and artificial intelligence, Griffith Institute for Tourism Research Report No 15.
Format: The Online survey dataset includes one PDF file representing the survey format with all sections and questions. It also contains three subfolders, each has multiple files. The subfolder of Conjoint analysis contains an image of the 9 JPG. Pictures, 1 SAV. format file for the Orthoplan subroutine outcome and 5 outcome documents (i.e. 3 SAV. files, 1 SPS. file, 1 SPV. file). The subfolder of Picture rating contains a capture of the 2500 pictures used in the survey, 1 excel file for rating results. The subfolder of Open question includes 1 CSV. file, 1 PDF. file representing participants’ answers and one PNG. file for the analysis outcome.
Data Dictionary:
Card 1: Picture design option number 1 suggested by SPSS orthogonal analysis. Importance value: The relative importance weight of each beauty attribute calculated by SPSS conjoint analysis. Utility: Score reflecting influential valence and degree of each beauty attribute on beauty score. Syntax: Code used to run conjoint analysis by SPSS Leximancer: Specialised software for qualitative data analysis. Concept map: A map showing the relationship between concepts identified Q1_1: Beauty score of the picture Q1_1 by the correspondent participant (i.e. survey part 1) Q2.1_1: Beauty score of the picture Q2.1_1 by the correspondent participant (i.e. survey part 2) Conjoint _1: Ranking of the picture 1 designed for conjoint analysis by the correspondent participant
References: Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Monitoring aesthetic value of the Great Barrier Reef by using innovative technologies and artificial intelligence, Griffith Institute for Tourism Research Report No 15.
Data Location:
This dataset is filed in the eAtlas enduring data repository at: data esp3\3.2.3_Aesthetic-value-GBR
Description of the REX3 database
This repository provides the Resolved EXIOBASE database version 3 (REX3) of the study "Biodiversity impacts of recent land-use change driven by increases in agri-food imports” published in Nature Sustainability. Also the REX3 database was used in Chapter 3 of the Global Resource Outlook 2024 from the UNEP International Resource Panel (IRP), including a data visualizer that allows for downscaling.
In REX3, Exiobase version 3.8 was merged with Eora26, production data from FAOSTAT, and bilateral trade data from the BACI database to create a highly-resolved MRIO database with comprehensive regionalized environmental impact assessment following the UNEP-SETAC guidelines and integrating land use data from the LUH2 database. REX3 distinguishes 189 countries, 163 sectors, time series from 1995 to 2022, and several environmental and socioeconomic extensions. The environmental impact assessment includes climate impacts, PM health impacts, water stress, and biodiversity impact from land occupation, land use change, and eutrophication.
The folders "REX3_Year" provide the database for each year. Each folder contains the following files (*.mat-files):
T_REX: the transaction matrix
Y_REX: the final demand matrix
Q_REX and Q_Y_REX: the satellite matrix of the economy and the final demand
The folder "REX3_Labels" provides the labels of the matrices, countries, sectors and extensions.
*The database is also available as textfiles --> contact livia.cabernard@tum.de
While Exiobase version 3.8.2 was used for the study "Biodiversity impacts of recent land-use change driven by increases in agri-food imports” and the Global Resource Outlook 2024, the REX3 database shared in this repository is based on Exiobase version 3.8, as this is the earliest exiobase version that can be still shared via a Creative Commons Attribution 4.0 International License. However, the matlab code attached to this repository allows to compile the REX3 database with earlier exiobase versions as well (e.g., version 3.8.2), as described in the section below.
Codes to compile REX3 and reproduce the results of the study “Biodiversity impacts of recent land-use change driven by increases in agri-food imports”
The folder "matlab code to compile REX3" provides the code to compile the REX3 database. This can also be done by using an earlier exiobase version (e.g., version 3.8.2). For this purpose, the data from EXIOBASE3 need to be saved into the subfolder Files/Exiobase/…, while the data from Eora26 need to be saved into the subfolder Files/Eora26/bp/…
The folder "R code for regionalized BD impact assessment based on LUH2 data and maps (Figure 1)" contains the R code to weight the land use data from the LUH2 dataset with the species loss factors from UNEP-SETAC and to create the maps shown in Figure 1 of the paper. For this purpose, the data from the LUH2 dataset (transitions.nc) need to be stored in the subfolder "LUH2 data".
The folder "matlab code to calculate MRIO results (Figure 2-5)" contains the matlab code to calculate the MRIO Results for Figure 2-5 of the study.
The folder "R code to illustrate sankeys – Figure 3–5, S10" contains the R code to visualize the sankeys.
Data visualizer to downscale the results of the IRP Global Resource Outlook 2024 based on REX3:
A data visualizer that is based on REX3 and allows to downscale the results of the IRP Global Resource Outlook 2024 on a country level can be found here.
Earlier versions of REX:
An earlier version of this database (REX1) with time series from 1995–2015 is described in Cabernard & Pfister 2021.
An earlier version including GTAP and mining-related biodiversity impacts for the year 2014 (REX2) is described in Cabernard & Pfister 2022.
Download & conversion from .mat to .zarr files for efficient data handling:
A package for downloading, extracting, and converting REX3 data from MATLAB (.mat) to .zarr format has been provided by Yanfei Shan here: https://github.com/FayeShan/REX3_handler. Once the files are converted to .zarr format, the data can be explored and processed flexibly. For example, you can use pandas to convert the data into CSV, or export it as Parquet, which is more efficient for handling large datasets. Please note note that this package is still under development and that more functions for MRIO analysis will be added in the future.
An Rmd file with code to reproduce the analysis in R is included.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘🍷 Alcohol vs Life Expectancy’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/alcohol-vs-life-expectancye on 13 February 2022.
--- Dataset description provided by original source is as follows ---
There is a surprising relationship between alcohol consumption and life expectancy. In fact, the data suggest that life expectancy and alcohol consumption are positively correlated - 1.2 additional years for every 1 liter of alcohol consumed annually. This is, of course, a spurious finding, because the correlation of this relationship is very low - 0.28. This indicates that other factors in those countries where alcohol consumption is comparatively high or low are contributing to differences in life expectancy, and further analysis is warranted.
https://data.world/api/databeats/dataset/alcohol-vs-life-expectancy/file/raw/LifeExpectancy_v_AlcoholConsumption_Plot.jpg" alt="LifeExpectancy_v_AlcoholConsumption_Plot.jpg">
The original drinks.csv file in the UNCC/DSBA-6100 dataset was missing values for The Bahamas, Denmark, and Macedonia for the wine, spirits, and beer attributes, respectively. Drinks_solution.csv shows these values filled in, for which I used the Mean of the rest of the data column.
Other methods were considered and ruled out:
beer_servings
, spirit_servings
, and wine_servings
), and upon reviewing the Bahamas, Denmark, and Macedonia more closely, it is apparent that 0 would be a poor choice for the missing values, as all three countries clearly consume alcohol.Filling missing values with MEAN - In the case of the drinks dataset, this is the best approach. The MEAN averages for the columns happen to be very close to the actual data from where we sourced this exercise. In addition, the MEAN will not skew the data, which the prior approaches would do.
The original drinks.csv dataset also had an empty data column: total_litres_of_pure_alcohol
. This column needed to be calculated in order to do a simple 2D plot and trendline. It would have been possible to instead run a multi-variable regression on the data and therefore skip this step, but this adds an extra layer of complication to understanding the analysis - not to mention the point of the exercise is to go through an example of calculating new attributes (or "feature engineering") using domain knowledge.
The graphic found at the Wikipedia / Standard Drink page shows the following breakdown:
The conversion factor from fl oz to L is 1 fl oz : 0.0295735 L
Therefore, the following formula was used to compute the empty column:
total_litres_of_pure_alcohol
=
(beer_servings * 12 fl oz per serving * 0.05 ABV + spirit_servings * 1.5 fl oz * 0.4 ABV + wine_servings * 5 fl oz * 0.12 ABV) * 0.0295735 liters per fl oz
The lifeexpectancy.csv datafile in the https://data.world/uncc-dsba/dsba-6100-fall-2016 dataset contains life expectancy data for each country. The following query will join this data to the cleaned drinks.csv data file:
# Life Expectancy vs Alcohol Consumption
PREFIX drinks: <http://data.world/databeats/alcohol-vs-life-expectancy/drinks_solution.csv/drinks_solution#>
PREFIX life: <http://data.world/uncc-dsba/dsba-6100-fall-2016/lifeexpectancy.csv/lifeexpectancy#>
PREFIX countries: <http://data.world/databeats/alcohol-vs-life-expectancy/countryTable.csv/countryTable#>
SELECT ?country ?alc ?years
WHERE {
SERVICE <https://query.data.world/sparql/databeats/alcohol-vs-life-expectancy> {
?r1 drinks:total_litres_of_pure_alcohol ?alc .
?r1 drinks:country ?country .
?r2 countries:drinksCountry ?country .
?r2 countries:leCountry ?leCountry .
}
SERVICE <https://query.data.world/sparql/uncc-dsba/dsba-6100-fall-2016> {
?r3 life:CountryDisplay ?leCountry .
?r3 life:GhoCode ?gho_code .
?r3 life:Numeric ?years .
?r3 life:YearCode ?reporting_year .
?r3 life:SexDisplay ?sex .
}
FILTER ( ?gho_code = "WHOSIS_000001" && ?reporting_year = 2013 && ?sex = "Both sexes" )
}
ORDER BY ?country
The resulting joined data can then be saved to local disk and imported into any analysis tool like Excel, Numbers, R, etc. to make a simple scatterplot. A trendline and R^2 should be added to determine the relationship between Alcohol Consumption and Life Expectancy (if any).
https://data.world/api/databeats/dataset/alcohol-vs-life-expectancy/file/raw/LifeExpectancy_v_AlcoholConsumption_Plot.jpg" alt="LifeExpectancy_v_AlcoholConsumption_Plot.jpg">
This dataset was created by Jonathan Ortiz and contains around 200 samples along with Beer Servings, Spirit Servings, technical information and other features such as: - Total Litres Of Pure Alcohol - Wine Servings - and more.
- Analyze Beer Servings in relation to Spirit Servings
- Study the influence of Total Litres Of Pure Alcohol on Wine Servings
- More datasets
If you use this dataset in your research, please credit Jonathan Ortiz
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The aim of the paper "Beyond the Big Five Personality Traits for Music Recommendation Systems" is to investigate the influence of personality traits, characterized by the BFI (Big Five Inventory) and its significant revision called BFI-2, on music recommendation error. The BFI-2 describes the lower-order facets of the Big Five personality traits. We performed experiments with 279 participants, using an application (called Music Master) we developed for music listening and ranking, and for collecting personality profiles of the users. Additionally, 29-dimensional vectors of audio features were extracted to describe the music files.
In our paper, we used this data set to test several hypotheses about the influence of personality traits and the audio features on music recommendation error. The experiments have showed that every combination of Big-Five personality traits produces worse results than using lower-order personality facets. Additionally, we found a small subset of personality facets that yielded the lowest recommendation error. This finding allows condensing the personality questionnaire to only the most essential questions.
The EXCEL file contains 5278 entries created for 279 participants. Each entry includes the preferences (expressed using the 5-point Likert scale) that refer to listening to music's cognitive aspect are denoted as Q1. The motivational and interpersonal aspects are denoted as Q2 and Q3, respectively. The following 20 variables (columns) contain 20 dimensional, extended Big Five personality traits values. The last 29 columns contain the values of low-level audio features, including emotions extracted from the audio files. The EXCEL file is ready to be saved in CSV and imported into memory using a suitable programming language (e.g. Python, R, Java, Matlab and others) for further processing, i.e. for creating user-item matrixes for collaborating filtering and evaluating its performance with the usage of proposed new rating types (motivational and interpersonal ones) described the article.
The usage of the data set requires citing the paper.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the american community survey (acs) with r and monetdb experimental. think of the american community survey (acs) as the united states' census for off-years - the ones that don't end in zero. every year, one percent of all americans respond, making it the largest complex sample administered by the u.s. government (the decennial census has a much broader reach, but since it attempts to contact 100% of the population, it's not a sur vey). the acs asks how people live and although the questionnaire only includes about three hundred questions on demography, income, insurance, it's often accurate at sub-state geographies and - depending how many years pooled - down to small counties. households are the sampling unit, and once a household gets selected for inclusion, all of its residents respond to the survey. this allows household-level data (like home ownership) to be collected more efficiently and lets researchers examine family structure. the census bureau runs and finances this behemoth, of course. the dow nloadable american community survey ships as two distinct household-level and person-level comma-separated value (.csv) files. merging the two just rectangulates the data, since each person in the person-file has exactly one matching record in the household-file. for analyses of small, smaller, and microscopic geographic areas, choose one-, three-, or fiv e-year pooled files. use as few pooled years as you can, unless you like sentences that start with, "over the period of 2006 - 2010, the average american ... [insert yer findings here]." rather than processing the acs public use microdata sample line-by-line, the r language brazenly reads everything into memory by default. to prevent overloading your computer, dr. thomas lumley wrote the sqlsurvey package principally to deal with t his ram-gobbling monster. if you're already familiar with syntax used for the survey package, be patient and read the sqlsurvey examples carefully when something doesn't behave as you expect it to - some sqlsurvey commands require a different structure (i.e. svyby gets called through svymean) and others might not exist anytime soon (like svyolr). gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests), so follow the monetdb installation instructions before running this acs code. monetdb imports, writes, recodes data slowly, but reads it hyper-fast . a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat. importation scripts (especially the ones i've already written for you) can be left running overnight sans hand-holding. the acs weights generalize to the whole united states population including individuals living in group quarters, but non-residential respondents get an abridged questionnaire, so most (not all) analysts exclude records with a relp variable of 16 or 17 right off the bat. this new github repository contains four scripts: 2005-2011 - download all microdata.R create the batch (.bat) file needed to initiate the monet database in the future download, unzip, and import each file for every year and size specified by the user create and save household- and merged/person-level replicate weight complex sample designs create a well-documented block of code to re-initiate the monet db server in the future fair warning: this full script takes a loooong time. run it friday afternoon, commune with nature for the weekend, and if you've got a fast processor and speedy internet connection, monday morning it should be ready for action. otherwise, either download only the years and sizes you need or - if you gotta have 'em all - run it, minimize it, and then don't disturb it for a week. 2011 single-year - analysis e xamples.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file perform the standard repertoire of analysis examples, only this time using sqlsurvey functions 2011 single-year - variable reco de example.R run the well-documented block of code to re-initiate the monetdb server copy the single-year 2011 table to maintain the pristine original add a new age category variable by hand add a new age category variable systematically re-create then save the sqlsurvey replicate weight complex sample design on this new table close everything, then load everything back up in a fresh instance of r replicate a few of the census statistics. no muss, no fuss replicate census estimates - 2011.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file match every nation wide statistic on the census bureau's estimates page, using sqlsurvey functions click here to view these four scripts for more detail about the american community survey (acs), visit: < ul> the us census...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.