3 datasets found
  1. Data sets used in the empirical comparison study presented in "Prediction...

    • figshare.com
    zip
    Updated Apr 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Hornung (2023). Data sets used in the empirical comparison study presented in "Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study" [Dataset]. http://doi.org/10.6084/m9.figshare.22304050.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Roman Hornung
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    These data sets are a subset of the pre-processed versions of the multi-omics datasets originally used in the empirical comparison study presented in Hornung & Wright (2019). In the latter paper, survival was considered as the outcome variable, given by the data.frame "targetvar" (first column "status": status indicator, second column "time": survival times). The remaining objects in each Rda file are "clin", "cnv", "mirna", "mutation", and "rna", which contain clinical data, copy number variation data, miRNA data, mutation data, and RNA data, respectively. Hornung et al. (2023) used the subset of the data sets presented here, where the outcome variable was the presence vs. absence of the TP53 mutation ("1" vs. "0"). This information is provided by the "TP53" column of the mutation data. Note that while predicting this outcome is not contextually meaningful, TP53 mutations have been found to be associated with poor clinical outcomes in cancer patients (Wang & Sun, 2017). In this context, TP53 can be used as a surrogate for a phenotypic outcome. These data sets are intended to test machine learning or statistical methods and may not be useful for biological analysis.

    References:

    R. Hornung, F. Ludwigs, J. Hagenberg, and A.-L. Boulesteix. Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study. arXiv, arXiv:2302.03991, 2023.

    R. Hornung, and M. N. Wright. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics, 20:358, 2019.

    X. Wang, and Q. Sun. TP53 mutations, expression and interaction networks in human cancers. Oncotarget, 8(1):624-643, 2017.

  2. Social Contacts

    • kaggle.com
    zip
    Updated Apr 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick (2020). Social Contacts [Dataset]. https://www.kaggle.com/bitsnpieces/social-contacts
    Explore at:
    zip(33056 bytes)Available download formats
    Dataset updated
    Apr 29, 2020
    Authors
    Patrick
    Description

    Inspiration

    Which countries have the most social contacts in the world? In particular, do countries with more social contacts among the elderly report more deaths caused by a pandemic caused by a respiratory virus?

    Context

    With the emergence of the COVID-19 pandemic, reports have shown that the elderly are at a higher risk of dying than any other age groups. 8 out of 10 deaths reported in the U.S. have been in adults 65 years old and older. Countries have also began to enforce 2km social distancing to contain the pandemic.

    To this end, I wanted to explore the relationship between social contacts among the elderly and its relationship with the number of COVID-19 deaths across countries.

    Content

    This dataset includes a subset of the projected social contact matrices in 152 countries from surveys Prem et al. 2020. It was based on the POLYMOD study where information on social contacts was obtained using cross-sectional surveys in Belgium (BE), Germany (DE), Finland (FI), Great Britain (GB), Italy (IT), Luxembourg (LU), The Netherlands (NL), and Poland (PL) between May 2005 and September 2006.

    This dataset includes contact rates from study participants ages 65+ for all countries from all sources of contact (work, home, school and others).

    I used this R code to extract this data:

    load('../input/contacts.Rdata') # https://github.com/kieshaprem/covid19-agestructureSEIR-wuhan-social-distancing/blob/master/data/contacts.Rdata
    View(contacts)
    contacts[["ALB"]][["home"]]
    contacts[["ITA"]][["all"]]
    rowSums(contacts[["ALB"]][["all"]])
    out1 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[16,]; out <- rbind(out, data.frame(x)) }
    out2 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[15,]; out <- rbind(out, data.frame(x)) }
    out3 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[14,]; out <- rbind(out, data.frame(x)) }
    m1 = data.frame(t(matrix(unlist(out1), nrow=16)))
    m2 = data.frame(t(matrix(unlist(out2), nrow=16)))
    m3 = data.frame(t(matrix(unlist(out3), nrow=16)))
    rownames(m1) = names(contacts)
    colnames(m1) = c("00_04", "05_09", "10_14", "15_19", "20_24", "25_29", "30_34", "35_39", "40_44", "45_49", "50_54", "55_59", "60_64", "65_69", "70_74", "75_79")
    rownames(m2) = rownames(m1)
    rownames(m3) = rownames(m1)
    colnames(m2) = colnames(m1)
    colnames(m3) = colnames(m1)
    write.csv(zapsmall(m1),"contacts_75_79.csv", row.names = TRUE)
    write.csv(zapsmall(m2),"contacts_70_74.csv", row.names = TRUE)
    write.csv(zapsmall(m3),"contacts_65_69.csv", row.names = TRUE)
    

    Rows names correspond to the 3 letter country ISO code, e.g. ITA represents Italy. Column names are the age groups of the individuals contacted in 5 year intervals from 0 to 80 years old. Cell values are the projected mean social contact rate.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1139998%2Ffa3ddc065ea46009e345f24ab0d905d2%2Fcontact_distribution.png?generation=1588258740223812&alt=media" alt="">

    Acknowledgements

    Thanks goes to Dr. Kiesha Prem for her correspondence and her team for publishing their work on social contact matrices.

    References

    Related resources

  3. Savi et al., 2020 -- tributary-main-channel interaction experiments --...

    • zenodo.org
    zip
    Updated Sep 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jayaram Hariharan; Jayaram Hariharan (2022). Savi et al., 2020 -- tributary-main-channel interaction experiments -- Experiment No Change 2 subset as netCDF files [Dataset]. http://doi.org/10.5281/zenodo.7047109
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 14, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jayaram Hariharan; Jayaram Hariharan
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Overview

    Zip file contains two netCDF files with a subset of data from the "No Change 2" (NC2) experiment conducted by Savi et al., 2020 and published in Earth Surface Dynamics (https://doi.org/10.5194/esurf-8-303-2020) with the original data available via the Sediment Experimentalists Network Project Space SEAD Internal Repository (https://doi.org/10.26009/s0ZOQ0S6). Topographic scan data were re-formatted into the netCDF file "T_NC2_scans.nc", and overhead imagery was extracted from the video of the experiment approximately once every minute of experimental time and RGB band data is provided in the formatted netCDF file "T_NC2_images.nc". These data were formatted into netCDF files for easy loading into the "deltametrics" analysis toolbox.

    Additional Details

    Re-packaging the scan data from the .tif files was straightforward. From the metadata spreadsheet, we know the times at which the scans were taken (and can eliminate the redundant scan). From the paper itself we know the resolution of the topographic scans is 1 mm in the horizontal and vertical. We also know the input discharges, both water and sediment, through both the main channel and tributary, from the paper. We provide these values as metadata in the netCDF files. The scans form the 'eta' field representing the topography in the file. The packaged up netCDF file is called 'T_NC2_scans.nc'.

    Overhead imagery from the T_NC2_Complete21fps.wmv video file was extracted using the following command:

    ffmpeg -i T_NC2_Complete21fps.wmv -r 21 T_NC2_frames/%04d.png

    This command utilizes the ffmpeg tool to extract the frames at a rate of 21 frames per second (-r 21) as the file name implies that is the rate at which the overhead photos were combined into a video. The NC designation indicates that this experiment was performed with no change in the input conditions in either the main or tributary channels.

    The experiment ran for a total of 480 minutes. A total of 1466 images were obtained from the ffmpeg extraction. This translates to an image approximately every 20 seconds of real time (480 minutes / 1466 frames * 60 seconds/minute = 19.6453 seconds / frame). We sample every 3rd frame, which gives us images roughly once a minute (489 frames in all), to create the subset of data re-packaged as a netCDF file for deltametrics. Dimensions for the pixels were approximated based on our knowledge of the topographic scan resolution. Assuming the extents of the scans and overhead images are the same (although they are not), we calculate the number of millimeters per pixel in the x and y directions for the overhead images. We assume the pixels are more likely to be square than rectangular, so we average these values and assign this as the distance per pixel in both the x and y dimensions for these data.

    References

    Savi, Sara, et al. "Interactions between main channels and tributary alluvial fans: channel adjustments and sediment-signal propagation." Earth Surface Dynamics 8.2 (2020): 303-322.

    Physical experiments on interactions between main-channels and tributary alluvial fans
    S. Savi, Tofelde, A. Wickert, A. Bufe, T. Schildgen, and M. Strecker
    https://doi.org/10.26009/s0ZOQ0S6

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Roman Hornung (2023). Data sets used in the empirical comparison study presented in "Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study" [Dataset]. http://doi.org/10.6084/m9.figshare.22304050.v2
Organization logoOrganization logo

Data sets used in the empirical comparison study presented in "Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study"

Explore at:
zipAvailable download formats
Dataset updated
Apr 5, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Roman Hornung
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

These data sets are a subset of the pre-processed versions of the multi-omics datasets originally used in the empirical comparison study presented in Hornung & Wright (2019). In the latter paper, survival was considered as the outcome variable, given by the data.frame "targetvar" (first column "status": status indicator, second column "time": survival times). The remaining objects in each Rda file are "clin", "cnv", "mirna", "mutation", and "rna", which contain clinical data, copy number variation data, miRNA data, mutation data, and RNA data, respectively. Hornung et al. (2023) used the subset of the data sets presented here, where the outcome variable was the presence vs. absence of the TP53 mutation ("1" vs. "0"). This information is provided by the "TP53" column of the mutation data. Note that while predicting this outcome is not contextually meaningful, TP53 mutations have been found to be associated with poor clinical outcomes in cancer patients (Wang & Sun, 2017). In this context, TP53 can be used as a surrogate for a phenotypic outcome. These data sets are intended to test machine learning or statistical methods and may not be useful for biological analysis.

References:

R. Hornung, F. Ludwigs, J. Hagenberg, and A.-L. Boulesteix. Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study. arXiv, arXiv:2302.03991, 2023.

R. Hornung, and M. N. Wright. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics, 20:358, 2019.

X. Wang, and Q. Sun. TP53 mutations, expression and interaction networks in human cancers. Oncotarget, 8(1):624-643, 2017.

Search
Clear search
Close search
Google apps
Main menu