Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R script used with accompanying data frame 'plot_character' that is within the project to calculate summary statistics and structural equation modelling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains two files both of which contain R objects.
chr19_snpdata_hm3only.RDS : A data frame with snp information
evd_list_chr19_hm3.RDS : A list of eigen decomposition of the SNP correlation matrix spanning chromosome 19
These data contain only SNPs in both 1k Genomes and HapMap3. Correlation matrices were estimated using LD Shrink. These data were built for use with the causeSims R package found here: https://github.com/jean997/causeSims
Facebook
TwitterThis data release provides comprehensive results of monotonic trend assessment for long-term U.S. Geological Survey (USGS) streamgages in or proximal to the watersheds of Mobile and Perdido Bays, south-central United States (Tatum and others, 2024). Long-term is defined as streamgages having at least five complete decades of daily streamflow data since January 1, 1950, exclusive to those streamgages also having the entire 2010s decade represented. Input data for the trend assessment are daily streamflow data retrieved on March 8, 2024 (U.S. Geological Survey, 2024) and formatted using the fill_dvenv() function in akqdecay (Crowley-Ornelas and others, 2024). Monotonic trends were assessed for each of 69 streamgages using 26 Mann-Kendall hypothesis tests for 20 hydrologic metrics understood as particularly useful in ecological studies (Henriksen and others, 2006) with another 6 metrics measuring well-known streamflow properties, such as annual harmonic mean streamflow (Asquith and Heitmuller, 2008) and annual mean streamflow with decadal flow-duration curve quantiles (10th, 50th, and 90th percentiles) (Crowley-Ornelas and others, 2023). Helsel and others (2020) provide background and description of the Mann-Kendall hypothesis test. Some of the trend analyses are based on the annual values of a hydrologic metric (calendar year is the time interval test) whereas others are decadal (decade is the time interval for the test). The principal result output for this data release (monotrnd_1hyp.txt) clearly distinguishes the time interval for the respective tests. This data release includes the computational workflow to conduct the hypothesis testing and requisite data manipulations to do so. The workflow is comprised of the core computation script monotrnd_script.R and an auxiliary script containing functions for 20 ecological flow metrics. This means that script monotrnd_script.R requires additional functions to be loaded into the R workspace and sources the file monotrnd_ecomets_include.R. This design is useful as part of isolation of the 20 ecological-oriented hydrologic metrics (subroutines) (logic and nomenclature therein is informed by Henriksen and others, 2006) from the streamgage-looping workflow and other data manipulation features in monotrnd_script.R. The monotrnd_script.R is designed to use time series of daily mean streamflow stored in an R environment data object using the streamgage identification number as the key and a data frame (table) of the daily streamflows in the format defined by the dvget() and filled by the filldv_env() functions of the akqdecay R package (See supplemental information section; Crowley-Ornelas and others, 2024). Additionally, monotrnd_script.R tags a specific subset of streamgages within the workflow, identified by the authors as "major nodes," with a binary indicator (1 or 0) to support targeted analyses on these selected locations. The data in file monotrnd_1hyp.txt are comma-delimited results of Kendall tau or other test statistics and p-values of the Mann-Kendall hypothesis tests as part of monotonic trend assessment for 69 USGS streamgages using 26 Mann–Kendall hypothesis tests on a variety of streamflow metrics. The data include USGS streamgage identification numbers with prepended "S" character, decimal latitudes and longitudes for the streamgage locations, range of calendar year and decades of streamflow processed along with integer counts of number of calendar years and decades, Kendall tau (or other test statistic) and associated p-value of the test statistic for the 26 streamflow metrics considered. Broadly, the "left side of the table" presents the results for the tests on metrics using calendar year time steps, and the "right side of the table" presents the results for the tests on metrics using decade time steps. The content of the file does not assign or draw conclusions on statistical significance because the p-values are provided. The file monotrnd_dictionary_1hyp.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in the monotrnd_1hyp.txt. (This dictionary and two others accompany this data release to facilitate potential reuse of information by some users.) The source of monotrnd_1hyp.txt stems from ending computational steps in script monotrnd_script.R. Short summaries synthesizing information in file monotrnd_1hyp.txt are available in files monotrnd_3cnt.txt and monotrnd_2stn.txt also accompanying this data release. The data in file monotrnd_2stn.txt are comma-delimited summaries by streamgage identification number of the monotonic trend assessments for 26 Mann-Kendall hypothesis tests on streamflow metrics as described elsewhere in this data release. The summary data herein are composed of records (rows) by streamgage that include columns of (1) streamgage identification numbers with a prepended "S" character, (2) decimal latitudes and longitudes for the streamgage locations, (3) the integer counts of the number of hypothesis tests, (4) the integer count of number of tests for which the computed hypothesis test p-values less than the 0.05 level of statistical significance (so-called alpha = 0.05), and (5) colon-delimited strings of alphanumeric characters identifying each of the statistically significant tests for the respective streamgage. The file monotrnd_dictionary_2stn.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in monotrnd_2stn.txt. The source of monotrnd_2stn.txt stems from ending computational steps in script monotrnd_script.R described elsewhere in this data release from its production of the monotrnd_1hyp.txt; this later data file provides the values used to assemble monotrnd_2stn.txt. The information in file monotrnd_3cnt.txt are comma-delimited summaries of Kendall tau or other test statistic arithmetic means as well as integer counts of statistically significant trends as part of monotonic trend assessment using 26 Mann-Kendall hypothesis tests on a variety of streamflow metrics for 69 USGS streamgages as described elsewhere in this data release. The two-column summary data herein are composed of a first row indicating by character string of the integer number of streamgages (69) and then subsequent rows in pairs of three-decimal character-string representation of mean Kendall tau (or the test statistics of a seasonal Mann-Kendall test) followed by character string of the integer number of the counts of statistically significant tests for the respective test at it was applied to the 69 streamgages. Statistical significance is defined as p-values less than the 0.05 level of statistical significance (so-called alpha = 0.05). The file monotrnd_dictionary_3cnt.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in the monotrnd_3cnt.txt. The source of monotrnd_3cnt.txt stems from ending computational steps in script monotrnd_script.R described elsewhere in this data release from its production of the monotrnd_1hyp.txt; this later data file provides the values used to assemble monotrnd_3cnt.txt.
Facebook
TwitterUnderstanding the evolution of traits subject to trade-offs is challenging because phenotypes can (co)vary at both the among- and within-individual levels. Among-individual covariation indicates consistent, possibly genetic, differences in how individuals resolve the trade-off, while within-individual covariation indicates trait plasticity. There is also the potential for consistent among-individual differences in behavioral plasticity, although this has rarely been investigated. We studied the sources of (co)variance in two characteristics of an acoustic advertisement signal that trade off with one another and are under sexual selection in the gray treefrog, Hyla chrysoscelis: call duration and call rate. We recorded males on multiple nights calling spontaneously and in response to playbacks simulating different competition levels. Call duration, call rate, and their product, call effort, were all repeatable both within and across social contexts. Call duration and call rate covaried n..., , , # Data and code from:Partitioning variance in a signaling trade-off under sexual selection reveals among-individual covariance in trait allocation
Michael S. Reichert, Ivan de la Hera, Maria Moiron
Evolution 2024
Summary: Data are measurements of the characteristics of individual calls from a study of individual variation in calling in Cope's gray treefrog, Hyla chrysoscelis.
Note: There are some NA entries in the data files because these are outputs of R data frames. NA corresponds to an empty cell (i.e. no data are available for that variable for that row).
List of files: TreefrogVariance.csv -This is the main raw data file. Each row contains the data from a single call. Variables are as follows: CD - call duration, in seconds CR - call rate, in calls/second *Note that the intercall interval (ICI), which is analyzed in the supplement as an alternative to call rate, is not directly included in this data file but can be calculated a...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The COVID-19 pandemic lockdown worldwide provided a unique research opportunity for ecologists to investigate the human-wildlife relationship under abrupt changes in human mobility, also known as Anthropause. Here we chose 15 common non-migratory bird species with different levels of synanthrope and we aimed to compare how human mobility changes could influence the occupancy of fully synanthropic species such as House Sparrow (Passer domesticus) versus casual to tangential synanthropic species such as White-breasted Nuthatch (Sitta carolinensis). We extracted data from the eBird citizen science project during three study periods in the spring and summer of 2020 when human mobility changed unevenly across different counties in North Carolina. We used the COVID-19 Community Mobility reports from Google to examine how community mobility changes towards workplaces, an indicator of overall human movements at the county level, could influence bird occupancy. Methods The data source we used for bird data was eBird, a global citizen science project run by the Cornell Lab of Ornithology. We used the COVID-19 Community Mobility Reports by Google to represent the pause of human activities at the county level in North Carolina. These data are publicly available and were last updated on 10/15/2022. We used forest land cover data from NC One Map that has a high resolution (1-meter pixel) raster data from 2016 imagery to represent canopy cover at each eBird checklist location. We also used the raster data of the 2019 National Land Cover Database to represent the degree of development/impervious surface at each eBird checklist location. All three measurements were used for the highest resolution that was available to use. We downloaded the eBird Basic Dataset (EBD) that contains the 15 study species from February to June 2020. We also downloaded the sampling event data that contains the checklist efforts information. First, we used the R package Auk (version 0.6.0) in R (version 4.2.1) to filter data in the following conditions: (1) Date: 02/19/2020 - 03/29/2020; (2) Checklist type: stationary; (3) Complete checklist; (4) Time: 07:00 am - 06:00 pm; (5) Checklist duration: 5-20 mins; (6) Location: North Carolina. After filtering data, we used the zero fill function from Auk to create detection/non-detection data of each study species in NC. Then we used the repeat visits filter from Auk to filter eBird checklist locations where at least 2 checklists (max 10 checklists) have been submitted to the same location by the same observer, allowing us to create a hierarchical data frame where both detection and state process can be analyzed using Occupancy Modeling. This data frame was in a matrix format that each row represents a sampling location and the columns represent the detection and non-detection of the 2-10 repeat sampling events. For the Google Community Mobility data, we chose the “Workplaces” categoriy of mobility data to analyze the Anthropause effect because it was highly relevant to the pause of human activities in urban areas. The mobility data from Google is a percentage change compared to a baseline for each day. A baseline day represents a normal value for the day of the week from the 5-week period (01/03/2020-02/06/2020). For example, a mobility value of -30.0 for Wake County on Apr 15, 2020, means the overall mobility in Wake County on that day decreased by 30% compared to the baseline day a few months ago. Because the eBird data we used covers a wider range of dates rather than each day, we took the average value of mobility before lockdown, during lockdown, and after lockdown in each county in NC. For the environmental variables, we calculated the values in ArcGIS Pro (version 3.1.0). We created a 200 m buffer at each eligible eBird checklist location. For the forest cover data, we used “Zonal Statistics as Table” to extract the percentage of forest cover at each checklist location’s 200-meter circular buffer. For the National Land Cover Database (NLCD) data, we combined low-intensity, medium-intensity, and high-intensity development as development covers and used “Summarize Within” to extract the percentage of development cover using the polygon version of NLCD. We used a correlation matrix of the three predictors (workplace mobility, percent forest cover, and percent development cover) and found no co-linearity. Thus, these three predictors plus the interaction between workplace mobility and percent development cover were the site covariates of the Occupancy Models. For the detection covariates, four predictors were considered including time of observation, checklist duration, number of observers, and workplace mobility. These detection covariates were also not highly correlated. We then merged all data into an unmarked data frame using the “unmarked” R package (version 1.2.5). The unmarked data frame has eBird sampling locations as sites (rows in the data frame) and repeat checklists at the same sampling locations as repeat visits (columns in the data frame).
Facebook
Twittertitle: 'BellaBeat Fitbit'
author: 'C Romero'
date: 'r Sys.Date()'
output:
html_document:
number_sections: true
##Installation of the base package for data analysis tool
install.packages("base")
##Installation of the ggplot2 package for data analysis tool
install.packages("ggplot2")
##install Lubridate is an R package that makes it easier to work with dates and times.
install.packages("lubridate")
```{r}
##Installation of the tidyverse package for data analysis tool
install.packages("tidyverse")
##Installation of the tidyr package for data analysis tool
install.packages("dplyr")
##Installation of the readr package for data analysis tool
install.packages("readr")
##Installation of the tidyr package for data analysis tool
install.packages("tidyr")
library(base) library(lubridate)# make dealing with dates a little easier library(ggplot2)# create elegant data visialtions using the grammar of graphics library(dplyr)# a grammar of data manpulation library(readr)# read rectangular data text library(tidyr)
## Running code
In a notebook, you can run a single code cell by clicking in the cell and then hitting
the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script,
you can run code by highlighting the code you want to run and then clicking the blue arrow
at the bottom of this window.
## Reading in files
```{r}
list.files(path = "../input")
# load the activity and sleep data set
```{r}
dailyActivity <- read_csv("../input/wellness/dailyActivity_merge.csv")
sleepDay <- read_csv("../input/wellness/sleepDay_merged.csv")
sum(duplicated(dailyActivity)) sum(duplicated(sleepDay)) sum(is.na(dailyActivity)) sum(is.na(sleepDay))
sleepy <- sleepDay %>% distinct() head(sleepy) head(dailyActivity)
n_distinct(dailyActivity$Id) n_distinct(sleepy$Id)
dailyActivity %>% group_by(Id) %>% summarise(freq = sum(TotalSteps)) %>% arrange(-freq) Tot_dist <- dailyActivity %>% mutate(Id = as.character(dailyActivity$Id)) %>% group_by(Id) %>% summarise(dizzy = sum(TotalDistance)) %>% arrange(-dizzy)
sleepy %>% group_by(Id) %>% summarise(Msleep = sum(TotalMinutesAsleep)) %>% arrange(Msleep) sleepy %>% group_by(Id) %>% summarise(inBed = sum(TotalTimeInBed)) %>% arrange(inBed)
ggplot(Tot_dist) +
geom_count(mapping = aes(y= dizzy, x= Id, color = Id, fill = Id, size = 2)) +
labs(x = "member id's", title = "distance miles" ) +
theme(axis.text.x = element_text(angle = 90))
```
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.
Methods Overview
This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.
Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.
The methods for compilation are contained in the supplementary information of the manuscript but also here:
Bird data
For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).
Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.
The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.
Environmental data
Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Achieving a high-quality reconstruction of a phylogenetic tree with branch lengths proportional to absolute time (chronogram) is a difficult and time-consuming task. But the increased availability of fossil and molecular data, and time-efficient analytical techniques has resulted in many recent publications of large chronograms for a large number and wide diversity of organisms. Knowledge of the evolutionary time frame of organisms is key for research in the natural sciences. It also represent valuable information for education, science communication, and policy decisions. When chronograms are shared in public, open databases, this wealth of expertly-curated and peer-reviewed data on evolutionary timeframe is exposed in a programatic and reusable way, as intensive and localized efforts have improved data sharing practices, as well as incentivizited open science in biology. Here we present DateLife, a service implemented as an R package and an R Shiny website application available at www.datelife.org, that provides functionalities for efficient and easy finding, summary, reuse, and reanalysis of expert, peer-reviewed, public data on time frame of evolution. The main DateLife workflow constructs a chronogram for any given combination of taxon names by searching a local chronogram database constructed and curated from the Open Tree of Life Phylesystem phylogenetic database, which incorporates phylogenetic data from the TreeBASE database as well. We implement and test methods for summarizing time data from multiple source chronograms using supertree and congruification algorithms, and using age data extracted from source chronograms as secondary calibration points to add branch lengths proportional to absolute time to a tree topology. DateLife will be useful to increase awareness of the existing variation in alternative hypothesis of evolutionary time for the same organisms, and can foster exploration of the effect of alternative evolutionary timing hypotheses on the results of downstream analyses, providing a framework for a more informed interpretation of evolutionary results. Methods This dataset contains files, figures and tables from the two examples shown in the manuscript (small example and fringillidae example), as well as from the cross validation analysis performed. Small example of the Datelife workflow. 1. Processed an input of 6 bird species within the Passeriformes (Pheucticus tibialis, Rhodothraupis celaeno, Emberiza citrinella, Emberiza leucocephalos, Emberiza elegans and Platyspiza crassirostris); 2. Used process names to search DateLife's chronogram database; 3. Summarized results from matching chronograms. Fringillidae example: http://phylotastic.org/datelife/articles/fringiliidae.html Cross validation: We performed a cross validation analysis of the DateLife workflow using 19 Fringillidae chronograms found in datelife's database. We used the individual tree topologies from each of the 19 source chronograms as inputs, treating their node ages as unknown. We then estimated dates for these topologies using node ages of chronograms belonging t o the remaining 12 studies as secondary calibrations, smoothingwith BLADJ.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Updated to include all season up to and including season 10.
This dataset contains data from the TV series Alone collected and shared by Dan Oehm. As described in Oehm's blog post](https://gradientdescending.com/alone-r-package-datasets-from-the-survival-tv-series/), in the survival TV series ‘Alone,' 10 survivalists are dropped in an extremely remote area and must fend for themselves. They aim to last 100 days in the Artic winter, living off the land through their survival skills, endurance, and mental fortitude.
This package contains four datasets:
survivalists.csv: A data frame of survivalists across all 9 seasons detailing name and demographics, location and profession, result, days lasted, reasons for tapping out (detailed and categorised), and page URL.
loadouts.csv: The rules allow each survivalist to take 10 items with them. This dataset includes information on each survivalist's loadout. It has detailed item descriptions and a simplified version for easier aggregation and analysis
episodes.csv: This dataset contains details of each episode including the title, number of viewers, beginning quote, and IMDb rating. New episodes are added at the end of future seasons.
seasons.csv: The season summary dataset includes location, latitude and longitude, and other season-level information. It includes the date of drop-off where the information exists.
Acknowledging the Alone dataset
Dan Oehm:
Alone data package: https://github.com/doehm/alone Alone data package blog post: https://gradientdescending.com/alone-r-package-datasets-from-the-survival-tv-series/ Examples of analyses are included in Dan Oehm's blog post.
References
History: https://www.history.com/shows/alone/cast
Wikipedia: https://en.wikipedia.org/wiki/Alone_(TV_series)
Wikipedia (episodes): https://en.wikipedia.org/wiki/List_of_Alone_episodes#Season_1_(2015)_-_Vancouver_Island
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Copernicus Land Monitoring Services provides Surface Soil Moisture 2014-present (raster 1 km), Europe, daily – version 1. Each day covers only 5 to 10% of European land mask and shows lines of scenes (obvious artifacts). This is the long-term aggregates of daily images of soil moisture (0–100%) based on two types of aggregation:
The soil moisture rasters are based on Sentinel 1 and described in detail in:
You can access and download the original data as .nc files from: https://globalland.vito.be/download/manifest/ssm_1km_v1_daily_netcdf/.
The files with pattern "soil.moisture_s1.clms.qr.*.p0.*.gf_m_1km_20140101_20241231_eu_epsg4326_v20250211.tif" are the gap-filled soil moisture quarterly estimates. For gap filling I build a model using cca 250k random training points and relationship with CHELSA climate bioclimatic variables, ESA CCI snow cover probability, ESA CCI forest and bare areas percent cover and Global Water Pack long-term surface water fraction. The gap-filling model had an R-square of 0.96 and RMSE of 6.5% of soil moisture.
Aggregation has been generated using the terra package in R in combination with the matrixStats::rowQuantiles function. Tiling system and land mask for pan-EU is also available.
library(terra)
library(matrixStats)
g1 = terra::vect("/mnt/inca/EU_landmask/tilling_filter/eu_ard2_final_status.gpkg")
## 1254 tiles
tile = g1[534]
nc.lst = list.files('/mnt/landmark/SM1km/ssm_1km_v1_daily_netcdf/', pattern = glob2rx("*.nc$"), full.names=TRUE)
## 3726
## test it
#r = terra::rast(nc.lst[100:210])
agg_tile = function(r, tile, pv=c(0.05,0.5,0.95), out.year="2015.annual"){
bb = paste(as.vector(ext(tile)), collapse = ".")
out.tif = paste0("./eu_tmp/", out.year, "/sm1km_", pv, "_", out.year, "_", bb, ".tif")
if(any(!file.exists(out.tif))){
r.t = terra::crop(r, ext(tile))
r.t = as.data.frame(r.t, xy=TRUE, na.rm=FALSE)
sel.c = grep(glob2rx("ssm$"), colnames(r.t))
t1s = cbind(data.frame(matrixStats::rowQuantiles(as.matrix(r.t[,sel.c]), probs = pv, na.rm=TRUE)), data.frame(x=r.t$x, y=r.t$y))
## write to GeoTIFFs
r.o = terra::rast(t1s[,c("x","y","X5.","X50.","X95.")], type="xyz", crs="+proj=longlat +datum=WGS84 +no_defs")
for(k in 1:length(pv)){
terra::writeRaster(r.o[[k]], filename=out.tif[k], gdal=c("COMPRESS=DEFLATE"), datatype='INT2U', NAflag=32768, overwrite=FALSE)
}
rm(r.t); gc()
tmpFiles(remove=TRUE)
}
}
## quarterly values:
lA = data.frame(filename=nc.lst)
library(lubridate)
lA$Date = ymd(sapply(lA$filename, function(i){substr(strsplit(basename(i), "_")[[1]][4], 1, 8)}))
#summary(is.na(lA$Date))
#hist(lA$Date, breaks=60)
lA$quarter = quarter(lA$Date, fiscal_start = 11)
summary(as.factor(lA$quarter))
for(qr in 1:4){
#qr=1
pth = paste0("A.q", qr)
rs = terra::rast(lA$filename[lA$quarter==qr])
x = parallel::mclapply(sample(1:length(g1)), function(i){try( agg_tile(rs, tile=g1[i], out.year=pth) )}, mc.cores=20)
for(type in c(0.05,0.5,0.95)){
x <- list.files(path=paste0("./eu_tmp/", pth), pattern=glob2rx(paste0("sm1km_", type, "_*.tif$")), full.names=TRUE)
out.tmp <- paste0(pth, ".", type, ".sm1km_eu.txt")
vrt.tmp <- paste0(pth, ".", type, ".sm1km_eu.vrt")
cat(x, sep="
", file=out.tmp)
system(paste0('gdalbuildvrt -input_file_list ', out.tmp, ' ', vrt.tmp))
system(paste0('gdal_translate ', vrt.tmp, ' ./cogs/soil.moisture_s1.clms.qr.', qr, '.p', type, '_m_1km_20140101_20241231_eu_epsg4326_v20250206.tif -ot "Byte" -r "near" --config GDAL_CACHEMAX 9216 -co BIGTIFF=YES -co NUM_THREADS=80 -co COMPRESS=DEFLATE -of COG -projwin -32 72 45 27'))
}
}
## per year ----
for(year in 2015:2023){
l.lst = nc.lst[grep(year, basename(nc.lst))]
r = terra::rast(l.lst)
pth = paste0(year, ".annual")
x = parallel::mclapply(sample(1:length(g1)), function(i){try( agg_tile(r, tile=g1[i], out.year=pth) )}, mc.cores=40)
## Mosaics:
for(type in c(0.05,0.5,0.95)){
x <- list.files(path=paste0("./eu_tmp/", pth), pattern=glob2rx(paste0("sm1km_", type, "_*.tif$")), full.names=TRUE)
out.tmp <- paste0(pth, ".", type, ".sm1km_eu.txt")
vrt.tmp <- paste0(pth, ".", type, ".sm1km_eu.vrt")
cat(x, sep="
", file=out.tmp)
system(paste0('gdalbuildvrt -input_file_list ', out.tmp, ' ', vrt.tmp))
system(paste0('gdal_translate ', vrt.tmp, ' ./cogs/soil.moisture_s1.clms.annual.', type, '_m_1km_', year, '0101_', year, '1231_eu_epsg4326_v20250206.tif -ot "Byte" -r "near" --config GDAL_CACHEMAX 9216 -co BIGTIFF=YES -co NUM_THREADS=80 -co COMPRESS=DEFLATE -of COG -projwin -32 72 45 27'))
}
}
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Major League Baseball Data from the 1986 and 1987 seasons.
Hitters
A data frame with 322 observations of major league players on the following 20 variables.
AtBat: Number of times at bat in 1986
Hits: Number of hits in 1986
HmRun: Number of home runs in 1986
Runs: Number of runs in 1986
RBI: Number of runs batted in in 1986
Walks: Number of walks in 1986
Years: Number of years in the major leagues
CAtBat: Number of times at bat during his career
CHits: Number of hits during his career
CHmRun: Number of home runs during his career
CRuns: Number of runs during his career
CRBI: Number of runs batted in during his career
CWalks: Number of walks during his career
League: A factor with levels A and N indicating player's league at the end of 1986
Division: A factor with levels E and W indicating player's division at the end of 1986
PutOuts: Number of put outs in 1986
Assists: Number of assists in 1986
Errors: Number of errors in 1986
Salary: 1987 annual salary on opening day in thousands of dollars
NewLeague: A factor with levels A and N indicating player's league at the beginning of 1987
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.
Games, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York
summary(Hitters)
Dataset imported from https://www.r-project.org.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Read me – Schimmelradar manuscript
The code in this repository was written to analyse the data and generate figures for the manuscript “Land use drives spatial structure of drug resistance in a fungal pathogen”.
This repository consists of two original .csv raw data files, 2 .tif files that are minimally reformatted after being downloaded from LGN.nl and www.pdok.nl/introductie/-/article/basisregistratie-gewaspercelen-brp-, and 9 scripts using the R language. The remaining files include intermediate .tif and .csv files to skip more computationally heavy steps of the analysis and facilitate the reproduction of the analysis.
Data files:§1
Schimmelradar_360_submission.csv: The raw phenotypic resistance spatial data from the air sample
Sample: an arbitrary sample code given to each of the participants
Area: A random number assigned to each of the 100 areas the Netherlands was split up into to facilitate an even spread of samples across the country during participant selection.
Logistics status: Variable used to indicate whether the sample was returned in good order, not otherwise used in the analysis.
Arrived back on: The date by which the sample arrived back at Wageningen University
Quality seals: quality of the seals upon sample return, only samples of a quality designated as good seals were processed. (also see Supplement file – section A).
Start sampling: The date on which the trap was deployed and the stickers exposed to the air, recorded by the participant
End sampling: The date on which the trap was taken down and the stickers were re-covered and no longer exposed to the air, recorded by the participant
3 back in area?: Binary indicating whether at least three samples have been returned in the respective area (see Area)
Batch: The date on which processing of the sample was started. To be more specific, the date at which Flamingo medium was poured over the seals of the sample and incubation was started.
Lab processing: Binary indication completion of lab processing
Tot ITR: A. fumigatus CFU count in the permissive layer of the itraconazole-treated plate
RES ITR: CFU count of colonies that had breached the surface of the itraconazole-treated layer after incubation and were visually (with the unaided eye) sporulating.
RF ITR: The itraconazole (~4 mg/L) resistance fraction = RES ITR/Tot ITR
Muccor ITR: Indication of the presence of Mucorales spp. growth on the itraconazole treatment plate
Tot VOR: A. fumigatus CFU count in the permissive layer of the voriconazole-treated plate
RES VOR: CFU count of colonies that had breached the surface of the voriconazole-treated layer after incubation and were visually (with the unaided eye) sporulating.
RF VOR: The voriconazole (~2 mg/L) resistance fraction = RES VOR/Tot VOR
Muccor VOR: Indication of the presence of Mucorales spp. growth on the voriconazole treatment plate
Tot CON: CFU count on the untreated growth control plate Note: note on the sample based on either information given by the participant or observations in the lab. The exclude label was given if the sample had either too little (<25) or too many (>300) CFUs on one or more of the plates (also see Supplement file – section A).
Lat: Exact latitude of the address where the sample was taken. Not used in the published version of the code and hidden for privacy reasons.
Long: Exact longitude of the address where the sample was taken. Not used in the published version of the code and hidden for privacy reasons.
Round_Lat: Rounded latitude of the address where the sample was taken. Rounded down to two decimals (the equivalent of a 1 km2 area), so they could not be linked to a specific address. Used in the published version of the code.
Round_Long: Rounded longitude of the address where the sample was taken. Rounded down to two decimals (the equivalent of a 1 km2 area), so they could not be linked to a specific address. Used in the published version of the code.
Analysis_genotypic_schimmelradar_TR_types.csv: The genotype data inferred from gel electrophoresis for all resistant isolates
TR type: Indicates the length of the tandem repeats in bp, as judged from a gel. 34 bp, 46 bp, or multiples of 46.
Plate: 96-well plate on which the isolate was cultured
96-well: well in which the isolate was cultured
Azole: Azole on which the isolate was grown and resistant to. Itraconazole (ITRA) or Voriconazole (VORI).
Sample: The air sample the isolate was taken from, corresponds to “Sample” in “Schimmelradar_360_submission.csv”.
Strata: The number that equates to “Area” in “Schimmelradar_360_submission.csv”.
WT: A binary that indicates whether an isolate had a wildtype cyp51a promotor.
TR34: A binary that indicates whether an isolate had a TR34 cyp51a promotor.
TR46: A binary that indicates whether an isolate had a TR46 cyp51a promotor.
TR46_3: A binary that indicates whether an isolate had a TR46*3 cyp51a promotor.
TR46_4: A binary that indicates whether an isolate had a TR46*4 cyp51a promotor.
Script 1 - generation_100_equisized_areas_NL
NOTE: Running this code is not necessary for the other analyses, it was used primarily for sample selection. The area distribution was used during the analysis in script 2B, yet each sample is already linked to an area in “Schimmelradar_360_submission.csv". This script was written to generate a spatial polygons data frame of 100 equisized areas of the Netherlands. The registrations for the citizen science project Schimmelradar were binned into these areas to facilitate a relatively even distribution of samples throughout the country which can be seen in Figure S1. The spatial polygons data frame can be opened and displayed in open-source software such as QGIS. The package “spcosa” used to generate the areas has RJava as a dependency, so having Java installed is required to run this script. The R script uses a shapefile of the Netherlands from the tmap package to generate the areas within the Netherlands. Generating a similar distribution for other countries will require different shape files!
Script 2 - Spatial_data_integration_fungalradar
This script produces 4 data files that describe land use in the Netherlands: The three focal.RData files with land use and resistant/colony counts, as well as the “Predictor_raster_NL.tif” land use file.
In this script, both the phenotypic and genotypic resistance spatial data from the air samples taken during the Fungal radar citizen science project are integrated with the land use and weather data used to model them. It is not recommended to run this code because the data extraction is fairly computationally demanding and it does not itself contain key statistical analyses. Rather it is used to generate the objects used for modelling and spatial predictions that are also included in this repository.
The phenotypic resistance is summarised in Table 1, which is generated in this script. Subsequently, the spatial data from the LNG22 and BRP datasets are integrated into the data. These dataset can be loaded from the "LGN2022.tif" and "Gewas22rast.tiff" raster files, respectively. Link to webpages where these files can be downloaded can found in the code.
Once the raster files are loaded, the code generates heatmaps and calculates the proportions of all the land use classes in both a 5 and 10-km radius around every sample and across the country to make spatial predictions. Only the 10 km radius data are used in the later analysis, but the 5 km radius was generated to test if that radius would be more appropriate, during an earlier stage of the analyses, and was left in for completeness. For documentation of the LGN22 data set, we refer to https://lgn.nl/documentatie and for BRP to https://nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/44e6d4d3-8fc5-47d6-8712-33dd6d244eef, both of these online resources are in Dutch but can be readily translated. A list of the variables that were included from these datasets during model selection can be found in Table S3. Alongside land-use data, the code extracts weather data from datafiles that can be downloaded from https://cds.climate.copernicus.eu/datasets/sis-agrometeorological-indicators?tab=download for the Netherlands during the sampling window, dates and dimensions are listed within the code. The Weather_schimmelradar folder contains four subfolders for each weather variable that was considered during modelling: temperature, wind speed, precipitation and humidity. Each of these subfolders contains 44 .nc files that each cover the daily mean of the respective weather variable across the Netherlands for each of the 44 days of the sampling window the citizen scientists were given.
All spatial objects weather + land use are eventually merged into one predictor raster "Predictor_raster_NL.tif". The land use fractions and weather data are subsequently integrated with the air sample data into a single spatial data frame along with the resistance data and saved into an R object "Schimmelradar360spat_focal.RData". The script concludes by merging the cyp51a haplotype data with this object as well, to create two different objects: "Schimmelradar360spat_focal_TR_VORI.RData" for the haplotype data of the voriconazole resistant isolates and "Schimmelradar360spat_focal_TR_ITRA.RData" including the haplotype data of itraconazole resistant isolates. These two datasets are modeled separately in scripts 5,9 and 6,8, respectively. This final section of the script also generates summary table S2, which summarises the frequency of the cyp51a haplotypes per azole treatment.
If the relevant objects are loaded
Facebook
TwitterThe 1996 Zambia Demographic and Health Survey (ZDHS) is a nationally representative survey conducted by the Central Statistical Office at the request of the Ministry of Health, with the aim of gathering reliable information on fertility, childhood and maternal mortality rates, maternal and child health indicators, contraceptive knowledge and use, and knowledge and prevalence of sexually transmitted diseases (STDs) including AIDS. The survey is a follow-up to the Zambia DHS survey carried out in 1992.
The primary objectives of the ZDHS are: - To collect up-to-date information on fertility, infant and child mortality and family planning; - To collect information on health-related matters such as breastfeeding, antenatal care, children's immunisations and childhood diseases; - To assess the nutritional status of mothers and children; iv) To support dissemination and utilisation of the results in planning, managing and improving family planning and health services in the country; and - To enhance the survey capabilities of the institutions involved in order to facilitate the implementation of surveys of this type in the future.
SUMMARY OF FINDINGS
FERTILITY
FAMILY PLANNING
MATERNAL AND CHILD HEALTH
The 1996 Zambia Demographic and Health Survey (ZDHS) is a nationally representative survey. The sample was designed to produce reliable estimates for the country as a whole, for the urban and the rural areas separately, and for each of the nine provinces in the country.
The survey covered all de jure household members (usual residents), all women of reproductive age, aged 15-49 years in the total sample of households, men aged 15-59 and Children under age 5 resident in the household.
Sample survey data
The 1996 ZDHS covered the population residing in private households in the country. The design for the ZDHS called for a representative probability sample of approximately 8,000 completed individual interviews with women between the ages of 15 and 49. It is designed principally to produce reliable estimates for the country as a whole, for the urban and the rural areas separately, and for each of the nine provinces in the country. In addition to the sample of women, a sub-sample of about 2,000 men between the ages of 15 and 59 was also designed and selected to allow for the study of AIDS knowledge and other topics.
SAMPLING FRAME
Zambia is divided administratively into nine provinces and 57 districts. For the Census of Population, Housing and Agriculture of 1990, the whole country was demarcated into census supervisory areas (CSAs). Each CSA was in turn divided into standard enumeration areas (SEAs) of approximately equal size. For the 1992 ZDHS, this frame of about 4,200 CSAs and their corresponding SEAs served as the sampling frame. The measure of size was the number of households obtained during a quick count operation carried out in 1987. These same CSAs and SEAs were later updated with new measures of size which are the actual numbers of households and population figures obtained in the census. The sample for the 1996 ZDHS was selected from this updated CSA and SEA frame.
CHARACTERISTICS OF THE AMPLE
The sample for ZDHS was selected in three stages. At the first stage, 312 primary sampling units corresponding to the CSAs were selected from the frame of CSAs with probability proportional to size, the size being the number of households obtained from the 1990 census. At the second stage, one SEA was selected, again with probability proportional to size, within each selected CSA. An updating of the maps as well as a complete listing of the households in the selected SEAs was carried out. The list of households obtained was used as the frame for the third-stage sampling in which households were selected for interview. Women between the ages of 15 and 49 were identified in these households and interviewed. Men between the ages of 15 and 59 were also interviewed, but only in one-fourth of the households selected for the women's survey.
SAMPLE ALLOCATION
The provinces, stratified by urban and rural areas, were the sampling strata. There were thus 18 strata. The proportional allocation would result in a completely self-weighting sample but would not allow for reliable estimates for at least three of the nine provinces, namely Luapula, North-Western and Western. Results of other demographic and health surveys show that a minimum sample of 800-1,000 women is required in order to obtain estimates of fertility and childhood mortality rates at an acceptable level of sampling errors. It was decided to allocate a sample of 1,000 women to each of the three largest provinces, and a sample of 800 women to the two smallest provinces. The remaining provinces got samples of 850 women. Within each province, the sample was distributed approximately proportionally to the urban and rural areas.
STRATIFICATION AND SYSTEMATIC SELECTION OF CLUSTERS
A cluster is the ultimate area unit retained in the survey. In the 1992 ZDHS and the 1996 ZDHS, the cluster corresponds exactly to an SEA selected from the CSA that contains it. In order to decrease sampling errors of comparisons over time between 1992 and 1996--it was decided that as many as possible of the 1992 clusters be retained. After carefully examining the 262 CSAs that were included in the 1992 ZDHS, locating them in the updated frame and verifying their SEA composition, it was decided to retain 213 CSAs (and their corresponding SEAs). This amounted to almost 70 percent of the new sample. Only 99 new CSAs and their corresponding SEAs were selected.
As in the 1992 ZDHS, stratification of the CSAs was only geographic. In each stratum, the CSAs were listed by districts ordered geographically. The procedure for selecting CSAs in each stratum consisted of: (1) calculating the sampling interval for the stratum: (2) calculating the cumulated size of each CSA; (3) calculating the series of sampling numbers R, R+I, R+21, .... R+(a-1)l, where R is a random number between 1 and 1; (4) comparing each sampling number with the cumulated sizes.
The reasons for not
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data supplement the article Schomaker, J., Walper, D., Wittmann, B.C., & Einhäuser, W. (2017). Attention in natural scenes: Affective-motivational factors guide gaze independently of visual salience. Vision Research, 133, 161-175.
Use is free for academic purposes, provided the aforementioned article is appropriately cited.
The directory contains the following files
stimuli.tar.gz - stimuli used in this study; note that this is based on the MONS database, but some deviations from the final version of the database do exist.
ratings.mat contains the variables arousal - mean arousal rating valence - mean valence rating valence2 - squared mean valence rating (after subtracting midpoint) motivationalValue - mean motivation rating motivaionalValue2 - squared mean motivation rating (after subtracting midpoint)
All variables are 104x3, where the first dimension is the stimulus number, and the second dimension the motivation ground truth (aversive, neutral, appetitive)
Experiment 1
fixationsExperiment1.mat contains the variables fixationX, fixationY, fixationDuration, fixaitonOnset, fixationInitial, which contain for each fixation horizontal and vertical coordinate, the duration, the time of the onset relative to the trial onset and whether it is the initial fixation. All variables have dimensions 16x104x3x50, where the first dimension is the observer, the second the scene, the third the condition and the forth a counter of fixations. Whenever there are less than 50 fixations the remainder are filled with NaN.
boundingBoxesExperiment1.mat contains for each critical object the bounding box coordinates x,y of upper left corner and width and height as variables boundingBoxX, boundingBoxY, boundingBoxW, boundingBoxH respectively. Note that this is relative to the eyetracker coordinates of experiment 1 (full display 1024x768, presentation in the center) and will therefore not match the coordinates of the images in the archive or the bounding box coordinates of experiment 2. Dimensions are 104x3, the dimensions representing scene number and condition, respectively.
figure2.m uses these data to computes figure 2 of the article from these data
dataForExperiment1.Rdata contains the data frame data, which contains for each fixation the values of the predictors used in the model of table 1. This is computed from the matlab data listed above in addition to the peak values of the AWS salience in the object.
table1.R computes and prints the models for table 1
Experiment 2
fixationsExperiment2.mat contains fixation data for experiment 2. Variable names as in experiment 1. Dimensions are 18x99x3x3x50, where the first dimension is the observer, the second the image number, the third the visual condition, the third the motivational condition and the fifth the fixation count. Since only one visual condition was shown to each observer per motivational condition, there is an additional variable 'hasData', which is 1 if the image was presented to the observer in this condition and 0 otherwise. Since fixations can be outside the image and will therefore be excluded, there is also an additional variable fixationNumber to keep a correct count of the fixation number in the trial.
boundingBoxesExperiment2.mat contains bounding box data for experiment 2 in image (and fixation) coordinates. Notation as for experiment 1, but coordinates refer to image and eyetracking coordinates used for experiment 2 and therefore can differ occasionally.
figure3and4.m generates figures 3 and 4 of the article from these data files.
dataForExperiment2.Rdata contains the data frame data, which contains for each fixation the values of the predictors used in the model of tables 2 amd 3. This is computed from the matlab data listed above in addition to the peak values of the AWS salience in the object. The fields imgMot and imgVis contain the motivational ground truth and the salience manipulation, respectively.
table2.R uses the Rdata file to compute the models for table 2 of the article and print summary results
table3.R uses the Rdata file to compute the models for table 3 of the article and print summary results. Note that the computation can take substantial time; results might deviate slightly depending on the exact version of R and its libraries used.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary
The ITOP dataset (Invariant Top View) contains 100K depth images from side and top views of a person in a scene. For each image, the location of 15 human body parts are labeled with 3-dimensional (x,y,z) coordinates, relative to the sensor's position. Read the full paper for more context [pdf].
Getting Started
Download then decompress the h5.gz file.
gunzip ITOP_side_test_depth_map.h5.gz
Using Python and h5py (pip install h5py or conda install h5py), we can load the contents:
import h5py import numpy as np
f = h5py.File('ITOP_side_test_depth_map.h5', 'r') data, ids = f.get('data'), f.get('id') data, ids = np.asarray(data), np.asarray(ids)
print(data.shape, ids.shape)
Note: For any of the *_images.h5.gz files, the underlying file is a tar file and not a h5 file. Please rename the file extension from h5.gz to tar.gz before opening. The following commands will work:
mv ITOP_side_test_images.h5.gz ITOP_side_test_images.tar.gz tar xf ITOP_side_test_images.tar.gz
Metadata
File sizes for images, depth maps, point clouds, and labels refer to the uncompressed size.
+-------+--------+---------+---------+----------+------------+--------------+---------+ | View | Split | Frames | People | Images | Depth Map | Point Cloud | Labels | +-------+--------+---------+---------+----------+------------+--------------+---------+ | Side | Train | 39,795 | 16 | 1.1 GiB | 5.7 GiB | 18 GiB | 2.9 GiB | | Side | Test | 10,501 | 4 | 276 MiB | 1.6 GiB | 4.6 GiB | 771 MiB | | Top | Train | 39,795 | 16 | 974 MiB | 5.7 GiB | 18 GiB | 2.9 GiB | | Top | Test | 10,501 | 4 | 261 MiB | 1.6 GiB | 4.6 GiB | 771 MiB | +-------+--------+---------+---------+----------+------------+--------------+---------+
Data Schema
Each file contains several HDF5 datasets at the root level. Dimensions, attributes, and data types are listed below. The key refers to the (HDF5) dataset name. Let (n) denote the number of images.
Transformation
To convert from point clouds to a (240 \times 320) image, the following transformations were used. Let (x_{\textrm{img}}) and (y_{\textrm{img}}) denote the ((x,y)) coordinate in the image plane. Using the raw point cloud ((x,y,z)) real world coordinates, we compute the depth map as follows: (x_{\textrm{img}} = \frac{x}{Cz} + 160) and (y_{\textrm{img}} = -\frac{y}{Cz} + 120) where (C\approx 3.50×10^{−3} = 0.0035) is the intrinsic camera calibration parameter. This results in the depth map: ((x_{\textrm{img}}, y_{\textrm{img}}, z)).
Joint ID (Index) Mapping
joint_id_to_name = { 0: 'Head', 8: 'Torso', 1: 'Neck', 9: 'R Hip', 2: 'R Shoulder', 10: 'L Hip', 3: 'L Shoulder', 11: 'R Knee', 4: 'R Elbow', 12: 'L Knee', 5: 'L Elbow', 13: 'R Foot', 6: 'R Hand', 14: 'L Foot', 7: 'L Hand', }
Depth Maps
Key: id
Dimensions: ((n,))
Data Type: uint8
Description: Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
Key: data
Dimensions: ((n,240,320))
Data Type: float16
Description: Depth map (i.e. mesh) corresponding to a single frame. Depth values are in real world meters (m).
Point Clouds
Key: id
Dimensions: ((n,))
Data Type: uint8
Description: Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
Key: data
Dimensions: ((n,76800,3))
Data Type: float16
Description: Point cloud containing 76,800 points (240x320). Each point is represented by a 3D tuple measured in real world meters (m).
Labels
Key: id
Dimensions: ((n,))
Data Type: uint8
Description: Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
Key: is_valid
Dimensions: ((n,))
Data Type: uint8
Description: Flag corresponding to the result of the human labeling effort. This is a boolean value (represented by an integer) where a one (1) denotes clean, human-approved data. A zero (0) denotes noisy human body part labels. If is_valid is equal to zero, you should not use any of the provided human joint locations for the particular frame.
Key: visible_joints
Dimensions: ((n,15))
Data Type: int16
Description: Binary mask indicating if each human joint is visible or occluded. This is denoted by (\alpha) in the paper. If (\alpha_j=1) then the (j^{th}) joint is visible (i.e. not occluded). Otherwise, if (\alpha_j = 0) then the (j^{th}) joint is occluded.
Key: image_coordinates
Dimensions: ((n,15,2))
Data Type: int16
Description: Two-dimensional ((x,y)) points corresponding to the location of each joint in the depth image or depth map.
Key: real_world_coordinates
Dimensions: ((n,15,3))
Data Type: float16
Description: Three-dimensional ((x,y,z)) points corresponding to the location of each joint in real world meters (m).
Key: segmentation
Dimensions: ((n,240,320))
Data Type: int8
Description: Pixel-wise assignment of body part labels. The background class (i.e. no body part) is denoted by −1.
Citation
If you would like to cite our work, please use the following.
Haque A, Peng B, Luo Z, Alahi A, Yeung S, Fei-Fei L. (2016). Towards Viewpoint Invariant 3D Human Pose Estimation. European Conference on Computer Vision. Amsterdam, Netherlands. Springer.
@inproceedings{haque2016viewpoint, title={Towards Viewpoint Invariant 3D Human Pose Estimation}, author={Haque, Albert and Peng, Boya and Luo, Zelun and Alahi, Alexandre and Yeung, Serena and Fei-Fei, Li}, booktitle = {European Conference on Computer Vision}, month = {October}, year = {2016} }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The human body is an outstandingly complex machine including around 1000 muscles and joints acting synergistically. Yet, the coordination of the enormous amount of degrees of freedom needed for movement is mastered by our one brain and spinal cord. The idea that some synergistic neural components of movement exist was already suggested at the beginning of the XX century. Since then, it has been widely accepted that the central nervous system might simplify the production of movement by avoiding the control of each muscle individually. Instead, it might be controlling muscles in common patterns that have been called muscle synergies. Only with the advent of modern computational methods and hardware it has been possible to numerically extract synergies from electromyography (EMG) signals. However, typical experimental setups do not include a big number of individuals, with common sample sizes of five to 20 participants. With this study, we make publicly available a set of EMG activities recorded during treadmill running from the right lower limb of 135 healthy and young adults (78 males, 57 females). Moreover, we include in this open access data set the code used to extract synergies from EMG data using non-negative matrix factorization and the relative outcomes. Muscle synergies, containing the time-invariant muscle weightings (motor modules) and the time-dependent activation coefficients (motor primitives), were extracted from 13 ipsilateral EMG activities using non-negative matrix factorization. Four synergies were enough to describe as many gait cycle phases during running: weight acceptance, propulsion, early swing and late swing. We foresee many possible applications of our data, that we can summarize in three key points. First, it can be a prime source for broadening the representation of human motor control due to the big sample size. Second, it could serve as a benchmark for scientists from multiple disciplines such as musculoskeletal modelling, robotics, clinical neuroscience, sport science, etc. Third, the data set could be used both to train students or to support established scientists in the perfection of current muscle synergies extraction methods.
The "RAW_DATA.RData" R list consists of elements of S3 class "EMG", each of which is a human locomotion trial containing cycle segmentation timings and raw electromyographic (EMG) data from 13 muscles of the right-side leg. Cycle times are structured as data frames containing two columns that correspond to touchdown (first column) and lift-off (second column). Raw EMG data sets are also structured as data frames with one row for each recorded data point and 14 columns. The first column contains the incremental time in seconds. The remaining 13 columns contain the raw EMG data, named with the following muscle abbreviations: ME = gluteus medius, MA = gluteus maximus, FL = tensor fasciæ latæ, RF = rectus femoris, VM = vastus medialis, VL = vastus lateralis, ST = semitendinosus, BF = biceps femoris, TA = tibialis anterior, PL = peroneus longus, GM = gastrocnemius medialis, GL = gastrocnemius lateralis, SO = soleus.
The file "dataset.rar" contains data in older format, not compatible with the R package musclesyneRgies.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R script used with accompanying data frame 'plot_character' that is within the project to calculate summary statistics and structural equation modelling.