Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These data sets are a subset of the pre-processed versions of the multi-omics datasets originally used in the empirical comparison study presented in Hornung & Wright (2019). In the latter paper, survival was considered as the outcome variable, given by the data.frame "targetvar" (first column "status": status indicator, second column "time": survival times). The remaining objects in each Rda file are "clin", "cnv", "mirna", "mutation", and "rna", which contain clinical data, copy number variation data, miRNA data, mutation data, and RNA data, respectively. Hornung et al. (2023) used the subset of the data sets presented here, where the outcome variable was the presence vs. absence of the TP53 mutation ("1" vs. "0"). This information is provided by the "TP53" column of the mutation data. Note that while predicting this outcome is not contextually meaningful, TP53 mutations have been found to be associated with poor clinical outcomes in cancer patients (Wang & Sun, 2017). In this context, TP53 can be used as a surrogate for a phenotypic outcome. These data sets are intended to test machine learning or statistical methods and may not be useful for biological analysis.
References:
R. Hornung, F. Ludwigs, J. Hagenberg, and A.-L. Boulesteix. Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study. arXiv, arXiv:2302.03991, 2023.
R. Hornung, and M. N. Wright. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics, 20:358, 2019.
X. Wang, and Q. Sun. TP53 mutations, expression and interaction networks in human cancers. Oncotarget, 8(1):624-643, 2017.
Facebook
TwitterWhich countries have the most social contacts in the world? In particular, do countries with more social contacts among the elderly report more deaths caused by a pandemic caused by a respiratory virus?
With the emergence of the COVID-19 pandemic, reports have shown that the elderly are at a higher risk of dying than any other age groups. 8 out of 10 deaths reported in the U.S. have been in adults 65 years old and older. Countries have also began to enforce 2km social distancing to contain the pandemic.
To this end, I wanted to explore the relationship between social contacts among the elderly and its relationship with the number of COVID-19 deaths across countries.
This dataset includes a subset of the projected social contact matrices in 152 countries from surveys Prem et al. 2020. It was based on the POLYMOD study where information on social contacts was obtained using cross-sectional surveys in Belgium (BE), Germany (DE), Finland (FI), Great Britain (GB), Italy (IT), Luxembourg (LU), The Netherlands (NL), and Poland (PL) between May 2005 and September 2006.
This dataset includes contact rates from study participants ages 65+ for all countries from all sources of contact (work, home, school and others).
I used this R code to extract this data:
load('../input/contacts.Rdata') # https://github.com/kieshaprem/covid19-agestructureSEIR-wuhan-social-distancing/blob/master/data/contacts.Rdata
View(contacts)
contacts[["ALB"]][["home"]]
contacts[["ITA"]][["all"]]
rowSums(contacts[["ALB"]][["all"]])
out1 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[16,]; out <- rbind(out, data.frame(x)) }
out2 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[15,]; out <- rbind(out, data.frame(x)) }
out3 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[14,]; out <- rbind(out, data.frame(x)) }
m1 = data.frame(t(matrix(unlist(out1), nrow=16)))
m2 = data.frame(t(matrix(unlist(out2), nrow=16)))
m3 = data.frame(t(matrix(unlist(out3), nrow=16)))
rownames(m1) = names(contacts)
colnames(m1) = c("00_04", "05_09", "10_14", "15_19", "20_24", "25_29", "30_34", "35_39", "40_44", "45_49", "50_54", "55_59", "60_64", "65_69", "70_74", "75_79")
rownames(m2) = rownames(m1)
rownames(m3) = rownames(m1)
colnames(m2) = colnames(m1)
colnames(m3) = colnames(m1)
write.csv(zapsmall(m1),"contacts_75_79.csv", row.names = TRUE)
write.csv(zapsmall(m2),"contacts_70_74.csv", row.names = TRUE)
write.csv(zapsmall(m3),"contacts_65_69.csv", row.names = TRUE)
Rows names correspond to the 3 letter country ISO code, e.g. ITA represents Italy. Column names are the age groups of the individuals contacted in 5 year intervals from 0 to 80 years old. Cell values are the projected mean social contact rate.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1139998%2Ffa3ddc065ea46009e345f24ab0d905d2%2Fcontact_distribution.png?generation=1588258740223812&alt=media" alt="">
Thanks goes to Dr. Kiesha Prem for her correspondence and her team for publishing their work on social contact matrices.
Facebook
TwitterAttribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Overview
Zip file contains two netCDF files with a subset of data from the "No Change 2" (NC2) experiment conducted by Savi et al., 2020 and published in Earth Surface Dynamics (https://doi.org/10.5194/esurf-8-303-2020) with the original data available via the Sediment Experimentalists Network Project Space SEAD Internal Repository (https://doi.org/10.26009/s0ZOQ0S6). Topographic scan data were re-formatted into the netCDF file "T_NC2_scans.nc", and overhead imagery was extracted from the video of the experiment approximately once every minute of experimental time and RGB band data is provided in the formatted netCDF file "T_NC2_images.nc". These data were formatted into netCDF files for easy loading into the "deltametrics" analysis toolbox.
Additional Details
Re-packaging the scan data from the .tif files was straightforward. From the metadata spreadsheet, we know the times at which the scans were taken (and can eliminate the redundant scan). From the paper itself we know the resolution of the topographic scans is 1 mm in the horizontal and vertical. We also know the input discharges, both water and sediment, through both the main channel and tributary, from the paper. We provide these values as metadata in the netCDF files. The scans form the 'eta' field representing the topography in the file. The packaged up netCDF file is called 'T_NC2_scans.nc'.
Overhead imagery from the T_NC2_Complete21fps.wmv video file was extracted using the following command:
ffmpeg -i T_NC2_Complete21fps.wmv -r 21 T_NC2_frames/%04d.png
This command utilizes the ffmpeg tool to extract the frames at a rate of 21 frames per second (-r 21) as the file name implies that is the rate at which the overhead photos were combined into a video. The NC designation indicates that this experiment was performed with no change in the input conditions in either the main or tributary channels.
The experiment ran for a total of 480 minutes. A total of 1466 images were obtained from the ffmpeg extraction. This translates to an image approximately every 20 seconds of real time (480 minutes / 1466 frames * 60 seconds/minute = 19.6453 seconds / frame). We sample every 3rd frame, which gives us images roughly once a minute (489 frames in all), to create the subset of data re-packaged as a netCDF file for deltametrics. Dimensions for the pixels were approximated based on our knowledge of the topographic scan resolution. Assuming the extents of the scans and overhead images are the same (although they are not), we calculate the number of millimeters per pixel in the x and y directions for the overhead images. We assume the pixels are more likely to be square than rectangular, so we average these values and assign this as the distance per pixel in both the x and y dimensions for these data.
References
Savi, Sara, et al. "Interactions between main channels and tributary alluvial fans: channel adjustments and sediment-signal propagation." Earth Surface Dynamics 8.2 (2020): 303-322.
Physical experiments on interactions between main-channels and tributary alluvial fans
S. Savi, Tofelde, A. Wickert, A. Bufe, T. Schildgen, and M. Strecker
https://doi.org/10.26009/s0ZOQ0S6
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These data sets are a subset of the pre-processed versions of the multi-omics datasets originally used in the empirical comparison study presented in Hornung & Wright (2019). In the latter paper, survival was considered as the outcome variable, given by the data.frame "targetvar" (first column "status": status indicator, second column "time": survival times). The remaining objects in each Rda file are "clin", "cnv", "mirna", "mutation", and "rna", which contain clinical data, copy number variation data, miRNA data, mutation data, and RNA data, respectively. Hornung et al. (2023) used the subset of the data sets presented here, where the outcome variable was the presence vs. absence of the TP53 mutation ("1" vs. "0"). This information is provided by the "TP53" column of the mutation data. Note that while predicting this outcome is not contextually meaningful, TP53 mutations have been found to be associated with poor clinical outcomes in cancer patients (Wang & Sun, 2017). In this context, TP53 can be used as a surrogate for a phenotypic outcome. These data sets are intended to test machine learning or statistical methods and may not be useful for biological analysis.
References:
R. Hornung, F. Ludwigs, J. Hagenberg, and A.-L. Boulesteix. Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study. arXiv, arXiv:2302.03991, 2023.
R. Hornung, and M. N. Wright. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics, 20:358, 2019.
X. Wang, and Q. Sun. TP53 mutations, expression and interaction networks in human cancers. Oncotarget, 8(1):624-643, 2017.