3 datasets found

Data sets used in the empirical comparison study presented in "Prediction...
figshare.com
zip
Updated Apr 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roman Hornung (2023). Data sets used in the empirical comparison study presented in "Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study" [Dataset]. http://doi.org/10.6084/m9.figshare.22304050.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22304050.v2
Dataset updated
Apr 5, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Roman Hornung
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
These data sets are a subset of the pre-processed versions of the multi-omics datasets originally used in the empirical comparison study presented in Hornung & Wright (2019). In the latter paper, survival was considered as the outcome variable, given by the data.frame "targetvar" (first column "status": status indicator, second column "time": survival times). The remaining objects in each Rda file are "clin", "cnv", "mirna", "mutation", and "rna", which contain clinical data, copy number variation data, miRNA data, mutation data, and RNA data, respectively. Hornung et al. (2023) used the subset of the data sets presented here, where the outcome variable was the presence vs. absence of the TP53 mutation ("1" vs. "0"). This information is provided by the "TP53" column of the mutation data. Note that while predicting this outcome is not contextually meaningful, TP53 mutations have been found to be associated with poor clinical outcomes in cancer patients (Wang & Sun, 2017). In this context, TP53 can be used as a surrogate for a phenotypic outcome. These data sets are intended to test machine learning or statistical methods and may not be useful for biological analysis.

References:

R. Hornung, F. Ludwigs, J. Hagenberg, and A.-L. Boulesteix. Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study. arXiv, arXiv:2302.03991, 2023.

R. Hornung, and M. N. Wright. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics, 20:358, 2019.

X. Wang, and Q. Sun. TP53 mutations, expression and interaction networks in human cancers. Oncotarget, 8(1):624-643, 2017.
Social Contacts
kaggle.com
zip
Updated Apr 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick (2020). Social Contacts [Dataset]. https://www.kaggle.com/bitsnpieces/social-contacts
Explore at:
zip(33056 bytes)Available download formats
Dataset updated
Apr 29, 2020
Authors
Patrick
Description
Inspiration

Which countries have the most social contacts in the world? In particular, do countries with more social contacts among the elderly report more deaths caused by a pandemic caused by a respiratory virus?

Context

With the emergence of the COVID-19 pandemic, reports have shown that the elderly are at a higher risk of dying than any other age groups. 8 out of 10 deaths reported in the U.S. have been in adults 65 years old and older. Countries have also began to enforce 2km social distancing to contain the pandemic.

To this end, I wanted to explore the relationship between social contacts among the elderly and its relationship with the number of COVID-19 deaths across countries.

Content

This dataset includes a subset of the projected social contact matrices in 152 countries from surveys Prem et al. 2020. It was based on the POLYMOD study where information on social contacts was obtained using cross-sectional surveys in Belgium (BE), Germany (DE), Finland (FI), Great Britain (GB), Italy (IT), Luxembourg (LU), The Netherlands (NL), and Poland (PL) between May 2005 and September 2006.

This dataset includes contact rates from study participants ages 65+ for all countries from all sources of contact (work, home, school and others).

I used this R code to extract this data:

load('../input/contacts.Rdata') # https://github.com/kieshaprem/covid19-agestructureSEIR-wuhan-social-distancing/blob/master/data/contacts.Rdata View(contacts) contacts[["ALB"]][["home"]] contacts[["ITA"]][["all"]] rowSums(contacts[["ALB"]][["all"]]) out1 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[16,]; out <- rbind(out, data.frame(x)) } out2 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[15,]; out <- rbind(out, data.frame(x)) } out3 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[14,]; out <- rbind(out, data.frame(x)) } m1 = data.frame(t(matrix(unlist(out1), nrow=16))) m2 = data.frame(t(matrix(unlist(out2), nrow=16))) m3 = data.frame(t(matrix(unlist(out3), nrow=16))) rownames(m1) = names(contacts) colnames(m1) = c("00_04", "05_09", "10_14", "15_19", "20_24", "25_29", "30_34", "35_39", "40_44", "45_49", "50_54", "55_59", "60_64", "65_69", "70_74", "75_79") rownames(m2) = rownames(m1) rownames(m3) = rownames(m1) colnames(m2) = colnames(m1) colnames(m3) = colnames(m1) write.csv(zapsmall(m1),"contacts_75_79.csv", row.names = TRUE) write.csv(zapsmall(m2),"contacts_70_74.csv", row.names = TRUE) write.csv(zapsmall(m3),"contacts_65_69.csv", row.names = TRUE)

Rows names correspond to the 3 letter country ISO code, e.g. ITA represents Italy. Column names are the age groups of the individuals contacted in 5 year intervals from 0 to 80 years old. Cell values are the projected mean social contact rate.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1139998%2Ffa3ddc065ea46009e345f24ab0d905d2%2Fcontact_distribution.png?generation=1588258740223812&alt=media" alt="">

Acknowledgements

Thanks goes to Dr. Kiesha Prem for her correspondence and her team for publishing their work on social contact matrices.

References

The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study

Projecting social contact matrices in 152 countries using contact surveys and demographic data

Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases (POLYMOD study)

Related resources

My starter notebook

http://www.socialcontactdata.org/

https://www.kaggle.com/tsubasatwi/close-contact-status-of-corona-in-japan

Facebook Data for Good Mobility Dashboard
Savi et al., 2020 -- tributary-main-channel interaction experiments --...
zenodo.org
zip
Updated Sep 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jayaram Hariharan; Jayaram Hariharan (2022). Savi et al., 2020 -- tributary-main-channel interaction experiments -- Experiment No Change 2 subset as netCDF files [Dataset]. http://doi.org/10.5281/zenodo.7047109
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7047109
Dataset updated
Sep 14, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jayaram Hariharan; Jayaram Hariharan
License
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Description
Overview

Zip file contains two netCDF files with a subset of data from the "No Change 2" (NC2) experiment conducted by Savi et al., 2020 and published in Earth Surface Dynamics (https://doi.org/10.5194/esurf-8-303-2020) with the original data available via the Sediment Experimentalists Network Project Space SEAD Internal Repository (https://doi.org/10.26009/s0ZOQ0S6). Topographic scan data were re-formatted into the netCDF file "T_NC2_scans.nc", and overhead imagery was extracted from the video of the experiment approximately once every minute of experimental time and RGB band data is provided in the formatted netCDF file "T_NC2_images.nc". These data were formatted into netCDF files for easy loading into the "deltametrics" analysis toolbox.

Additional Details

Re-packaging the scan data from the .tif files was straightforward. From the metadata spreadsheet, we know the times at which the scans were taken (and can eliminate the redundant scan). From the paper itself we know the resolution of the topographic scans is 1 mm in the horizontal and vertical. We also know the input discharges, both water and sediment, through both the main channel and tributary, from the paper. We provide these values as metadata in the netCDF files. The scans form the 'eta' field representing the topography in the file. The packaged up netCDF file is called 'T_NC2_scans.nc'.

Overhead imagery from the T_NC2_Complete21fps.wmv video file was extracted using the following command:

ffmpeg -i T_NC2_Complete21fps.wmv -r 21 T_NC2_frames/%04d.png

This command utilizes the ffmpeg tool to extract the frames at a rate of 21 frames per second (-r 21) as the file name implies that is the rate at which the overhead photos were combined into a video. The NC designation indicates that this experiment was performed with no change in the input conditions in either the main or tributary channels.

The experiment ran for a total of 480 minutes. A total of 1466 images were obtained from the ffmpeg extraction. This translates to an image approximately every 20 seconds of real time (480 minutes / 1466 frames * 60 seconds/minute = 19.6453 seconds / frame). We sample every 3rd frame, which gives us images roughly once a minute (489 frames in all), to create the subset of data re-packaged as a netCDF file for deltametrics. Dimensions for the pixels were approximated based on our knowledge of the topographic scan resolution. Assuming the extents of the scans and overhead images are the same (although they are not), we calculate the number of millimeters per pixel in the x and y directions for the overhead images. We assume the pixels are more likely to be square than rectangular, so we average these values and assign this as the distance per pixel in both the x and y dimensions for these data.

References

Savi, Sara, et al. "Interactions between main channels and tributary alluvial fans: channel adjustments and sediment-signal propagation." Earth Surface Dynamics 8.2 (2020): 303-322.

Physical experiments on interactions between main-channels and tributary alluvial fans
S. Savi, Tofelde, A. Wickert, A. Bufe, T. Schildgen, and M. Strecker
https://doi.org/10.26009/s0ZOQ0S6
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Roman Hornung (2023). Data sets used in the empirical comparison study presented in "Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study" [Dataset]. http://doi.org/10.6084/m9.figshare.22304050.v2

Data sets used in the empirical comparison study presented in "Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study"

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.22304050.v2

Dataset updated

Apr 5, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Roman Hornung

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

These data sets are a subset of the pre-processed versions of the multi-omics datasets originally used in the empirical comparison study presented in Hornung & Wright (2019). In the latter paper, survival was considered as the outcome variable, given by the data.frame "targetvar" (first column "status": status indicator, second column "time": survival times). The remaining objects in each Rda file are "clin", "cnv", "mirna", "mutation", and "rna", which contain clinical data, copy number variation data, miRNA data, mutation data, and RNA data, respectively. Hornung et al. (2023) used the subset of the data sets presented here, where the outcome variable was the presence vs. absence of the TP53 mutation ("1" vs. "0"). This information is provided by the "TP53" column of the mutation data. Note that while predicting this outcome is not contextually meaningful, TP53 mutations have been found to be associated with poor clinical outcomes in cancer patients (Wang & Sun, 2017). In this context, TP53 can be used as a surrogate for a phenotypic outcome. These data sets are intended to test machine learning or statistical methods and may not be useful for biological analysis.

References:

R. Hornung, F. Ludwigs, J. Hagenberg, and A.-L. Boulesteix. Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study. arXiv, arXiv:2302.03991, 2023.

R. Hornung, and M. N. Wright. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics, 20:358, 2019.

X. Wang, and Q. Sun. TP53 mutations, expression and interaction networks in human cancers. Oncotarget, 8(1):624-643, 2017.

Clear search

Close search

Google apps

Main menu

Data sets used in the empirical comparison study presented in "Prediction...

Social Contacts

Inspiration

Context

Content

Acknowledgements

References

Related resources

Savi et al., 2020 -- tributary-main-channel interaction experiments --...

Data sets used in the empirical comparison study presented in "Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study"