100+ datasets found

f
Data from: Optimal Transport based Cross-Domain Integration for...
tandf.figshare.com
pdf
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubai Yuan; Yijiao Zhang; Babak Shahbaba; Norbert Fortin; Keiland Cooper; Qing Nie; Annie Qu (2025). Optimal Transport based Cross-Domain Integration for Heterogeneous Data [Dataset]. http://doi.org/10.6084/m9.figshare.29828924.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29828924.v1
Dataset updated
Aug 5, 2025
Dataset provided by
Taylor & Francis
Authors
Yubai Yuan; Yijiao Zhang; Babak Shahbaba; Norbert Fortin; Keiland Cooper; Qing Nie; Annie Qu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Detecting dynamic patterns shared across heterogeneous datasets is a critical yet challenging task in many scientific domains, particularly within the biomedical sciences. Systematic heterogeneity inherent in diverse data sources can significantly hinder the effectiveness of existing machine learning methods in uncovering shared underlying dynamics. Additionally, practical and technical constraints in real-world experimental designs often limit data collection to only a small number of subjects, even when rich, time-dependent measurements are available for each individual. These limited sample sizes further diminish the power to detect common dynamic patterns across subjects. In this paper, we propose a novel heterogeneous data integration framework based on optimal transport to extract shared patterns in the conditional mean dynamics of target responses. The key advantage of the proposed method is its ability to enhance discriminative power by reducing heterogeneity unrelated to the signal. This is achieved through the alignment of extracted domain-shared temporal information across multiple datasets from different domains. Our approach is effective regardless of the number of datasets and does not require auxiliary matching information for alignment. Specifically, the method aligns longitudinal data from heterogeneous datasets within a common latent space, capturing shared dynamic patterns while leveraging temporal dependencies within subjects. Theoretically, we establish generalization error bounds for the proposed data integration approach in supervised learning tasks, highlighting a novel trade-off between data alignment and pattern learning. Additionally, we derive convergence rates for the barycentric projection under Gromov-Wasserstein and fused Gromov-Wasserstein distances. Numerical studies on both simulated data and neuroscience applications demonstrate that the proposed data integration framework substantially improves prediction accuracy by effectively aggregating information across diverse data sources and subjects.
d
Data from: Modeling site heterogeneity with posterior mean site frequency...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huai-Chun Wang; Bui Quang Minh; Edward Susko; Andrew J. Roger (2025). Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation [Dataset]. http://doi.org/10.5061/dryad.gv1q5
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.gv1q5
Dataset updated
Jul 2, 2025
Dataset provided by
Dryad Digital Repository
Authors
Huai-Chun Wang; Bui Quang Minh; Edward Susko; Andrew J. Roger
Time period covered
Jan 1, 2017
Description
Proteins have distinct structural and functional constraints at different sites that lead to site-specific preferences for particular amino acid residues as the sequences evolve. Heterogeneity in the amino acid substitution process between sites is not modeled by commonly used empirical amino acid exchange matrices. Such model misspecification can lead to artefacts in phylogenetic estimation such as long-branch attraction. Although sophisticated site-heterogeneous mixture models have been developed to address this problem in both Bayesian and maximum likelihood (ML) frameworks, their formidable computational time and memory usage severely limits their use in large phylogenomic analyses. Here we propose a posterior mean site frequency (PMSF) method as a rapid and efficient approximation to full empirical profile mixture models for ML analysis. The PMSF approach assigns a conditional mean amino acid frequency profile to each site calculated based on a mixture model fitted to the data usin...
AN ITERATIVE ESTIMATOR FOR PREDICTING THE HETEROGENEOUS ATTRIBUTE DATA SETS
figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P. Saravanan (2023). AN ITERATIVE ESTIMATOR FOR PREDICTING THE HETEROGENEOUS ATTRIBUTE DATA SETS [Dataset]. http://doi.org/10.6084/m9.figshare.1030366.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1030366.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
P. Saravanan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The quality of the patterns which are the results of data mining is depends upon the quality ofdata supplied to it. Most of the real time databases which are the sources for data mining posses thedeficiency in terms of completeness, correctness and consistency. Improving the quality of data in termsof completeness is a challenging task. Many methods were proposed for imputing the missing values forhomogenous attributes. This paper proposes a mixed kernel function, which imputes the missing valuesfor the mixed attributes (the independent attributes are heterogeneous). The mixed kernel function is anintegrated unit which adopts the right method to impute the value for right attribute. For the categoricalattribute, our kernel function first assigns the mode value and the iteration continues till the right (mostprobable) value gets converged and for the discrete attribute the mean value gets assigned and theiteration continues till the most probable value is reached. The mixed kernel function is tested with asample database; it proves that it is performing well in terms of accuracy and iterations compared tolinear kernel function.
Raw data of the heterogeneous Hegselmann-Krause model on network ensembles
zenodo.org
bin, mp4, tar
Updated Dec 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rémi Perrier; Rémi Perrier; Hendrik Schawe; Hendrik Schawe; Laura Hernández; Laura Hernández (2022). Raw data of the heterogeneous Hegselmann-Krause model on network ensembles [Dataset]. http://doi.org/10.5281/zenodo.7455641
Explore at:
mp4, tar, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7455641
Dataset updated
Dec 19, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rémi Perrier; Rémi Perrier; Hendrik Schawe; Hendrik Schawe; Laura Hernández; Laura Hernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Raw data of the heterogeneous Hegselmann-Krause model on network ensembles
This is the raw data underlying the results of the article *«On the effects of over-compromising: heterogeneity and network effects on a bounded confidence opinion dynamics model.»*.

For each measured combination of the parameters, there is one gzipped file. The parameters are:

- Lower and upper bounds of the confidence interval, [ε_l, ε_u].
- Topology: the different types of networks and the average degree with which the networks are generated.
- System size
- Number of realizations for the parameter combination

The single files follow a naming scheme of `data_HK_uni[{eps_l},{eps_u}]_topo={topology}_N={N}_trajrecord=0_{m}real.dat.gz`, where:

- `{eps_l},{eps_u}` are the values of the lower and upper bounds of the confidence interval.
- `{topology}` contains the type of network and the average degree. The possibilities are `BA_k=10`, `ER_c=10`, `sl1`, `sl2`, and `sl3`.
- `N` is the system size. The sizes are powers of two.
- `trajrecord=0` signals the fact that file contains only the final state.
- `{m}` is the number of realizations.

# Data format
Each file contains the final state of each realization back to back. Each final state is encoded as three lines:

- The convergence time is a single integer with a line prefix '\# iterations:'
- The positions of all clusters in opinion space with a line prefix '\# ' (unsorted)
- The number of agents in each of the clusters without a line prefix

# Folders structure
The files are organized as follows:

- **`phase_plots.tar`**: contains the data for the different phase plots (full exploration of the [ε_l, ε_u] space) with `N=16384` and `m=100` realizations.
- **`ER`** contains the data for Erdos Renyi with mean degree of 10 (c=10)
- **`BA`** contains the data for Barabasi Albert with a mean degree of 10 (k=10)
- **`SL`** contains the data for Square lattice with first, second and third nearest neighbors (k=4, 8, 12)
- **`swipes.tar`** contains the data for the finite size effects study at fixed ε_l with `m=1000` realizations.
- **`ER`** contains the data for Erdos Renyi with mean degree of 10 (c=10) with ε_l = 0.05
- **`BA`** contains the data for Barabasi Albert with a mean degree of 10 (k=10) with ε_l = 0.05
- **`SL`** contains the data for Square lattice with third nearest neighbors (k=12) with ε_l =0.03
- the different videos referenced in the main text and the SM follow various naming schemes:
- **`scatter3D_el_eu_Smax_uni_{topology}_N=16384.mp4`**: 360° rotation of the 3D visualisation of the data leading the average phase plots.
- **`scatter2D_el={eps_l}_eu_Smax_extremism_{topology}_SizeEffect.mp4`**: evolution of the scatter plot leading the finite size study as a function of N.
- **`scatter3D_el={eps_l}_eu_Smax_extremism_{topology}_SizeEffect.mp4`**: same as before, but in 3D where the Z-axis is the extremism.
- **`scatter_x0_xt_{topology}_N={N}_{realization_type}.mp4`**: time evolution of the scatter plot of the opinion at time `t` versus initial opinion, color-coded with the extremism. {realization_type} can be mild, skewed or U-turn.
- **`traj_2D_SL_k=12_N=16384_{realization_type}.mp4`**: because of the spatial embedding, the time evolution of those realizations on the Square Lattice can be visualized in 2D.

# Python example for reading the format
An example script, which visualizes
d
Data from: Multiple Kernel Learning for Heterogeneous Anomaly Detection:...
catalog.data.gov
datasets.ai
+4more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Multiple Kernel Learning for Heterogeneous Anomaly Detection: Algorithm and Aviation Safety Case Study [Dataset]. https://catalog.data.gov/dataset/multiple-kernel-learning-for-heterogeneous-anomaly-detection-algorithm-and-aviation-safety
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. In this paper, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequences of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also discuss results on real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.
f
Appendix A. Population growth rate bias and mean-square error in the...
wiley.figshare.com
datasetcatalog.nlm.nih.gov
html
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucile Marescot; Roger Pradel; Christophe Duchamp; Sarah Cubaynes; Eric Marboutin; Rémi Choquet; Christian Miquel; Olivier Gimenez (2023). Appendix A. Population growth rate bias and mean-square error in the heterogeneous model and the homogeneous model fitted to simulated data. [Dataset]. http://doi.org/10.6084/m9.figshare.3516713.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3516713.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Lucile Marescot; Roger Pradel; Christophe Duchamp; Sarah Cubaynes; Eric Marboutin; Rémi Choquet; Christian Miquel; Olivier Gimenez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Population growth rate bias and mean-square error in the heterogeneous model and the homogeneous model fitted to simulated data.
OAG Dataset for H2GB
kaggle.com
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junhong Lin (2024). OAG Dataset for H2GB [Dataset]. https://www.kaggle.com/datasets/junhonglin/oag-dataset-for-h2gb/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Junhong Lin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
oag-cs, oag-eng, oag-chem are new heterogeneous networks composed of subsets of the Open Academic Graph (OAG). Each of the datasets contains papers from three different subject domains -- computer science, engineering, and chemistry. These datasets also contain four types of entities -- papers, authors, institutions, and fields of study. Each paper is associated with a 768-dimensional feature vector generated from a pre-trained XLNet applying on the paper titles. The representation of each word in the title are weighted by each word's attention to get the title representation for each paper. Each paper node is labeled with its published venue (paper or conference). We split the papers published up to 2016 as the training set, papers published in 2017 as the validation set, and papers published in 2018 and 2019 as the test set. The publication year of each paper is also included in these datasets. This means those datasets can also be converted to use the publication year as class labels.
A
Data Description Exchange Services for Heterogeneous Vehicle and Spaceport...
data.amerigeoss.org
html
Updated Jan 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2020). Data Description Exchange Services for Heterogeneous Vehicle and Spaceport Control and Monitor Systems, Phase I [Dataset]. https://data.amerigeoss.org/pt_PT/dataset/data-description-exchange-services-for-heterogeneous-vehicle-and-spaceport-control-and-mon-b8ca
Explore at:
htmlAvailable download formats
Dataset updated
Jan 29, 2020
Dataset provided by
United States
Description
CCT proposes an advanced data description exchange approach for space/spaceport systems that will provide a generic platform independent software capability for exchange of semantic control and monitoring information. This new strategy will reduce development, operations, and support costs for legacy and future systems that are part of ground and space based distributed control systems. It will also establish a space systems information exchange model that can support future highly interoperable and mobile software systems. The concept seeks to provide a solution that will ease the adoption of a common data definition and exchange standard for legacy and future systems by minimizing or eliminating the need for custom software modifications. Phase 1 of the research will determine the viability of creating common access services for the space ground systems domain based on use of emerging exchange standards for telemetry, and drive out architecture strategies for cross platform generation of monitoring (e.g. health and status) service middleware. Phase 2 will seek to expand the scope of the target domain to also include control services and create a complete usable suite of services for a broader range of heterogeneous systems.
d
Data from: Does environmental heterogeneity drive functional trait...
search.dataone.org
datadryad.org
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Stark; Rebecca Lehman; Lake Crawford; Brian J. Enquist; Benjamin Blonder (2025). Does environmental heterogeneity drive functional trait variation? A test in montane and alpine meadows [Dataset]. http://doi.org/10.5061/dryad.772h7
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.772h7
Dataset updated
Jun 17, 2025
Dataset provided by
Dryad Digital Repository
Authors
Jordan Stark; Rebecca Lehman; Lake Crawford; Brian J. Enquist; Benjamin Blonder
Time period covered
Jul 13, 2020
Description
While community-weighted means of plant traits have been linked to mean environmental conditions at large scales, the drivers of trait variation within communities are not well understood. Local environmental heterogeneity (such as microclimate variability), in addition to mean environmental conditions, may decrease the strength of environmental filtering and explain why communities support different amounts of trait variation. Here, we assess two hypotheses: first, that more heterogeneous local environments and second, that less extreme environments, should support a broader range of plant strategies and thus higher trait variation. We quantified drivers of trait variation across a range of environmental conditions and spatial scales ranging from sub-meter to tens of kilometers in montane and alpine plant communities. We found that, within communities, both environmental heterogeneity and environmental means are drivers of trait variation. However, the importance of each environmental ...
Poisson MSN instructions
figshare.com
application/x-rar
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FB Shen (2025). Poisson MSN instructions [Dataset]. http://doi.org/10.6084/m9.figshare.28052177.v9
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28052177.v9
Dataset updated
Jul 3, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
FB Shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data and code related to paperEnhancing Spatial Count Data Modeling: A new method for Poisson Means of Stratified NonhomogeneityAbstract: Spatial count data is a prevalent data type in natural and social sciences. As the data present complicated spatial autocorrelation and heterogeneity inherent in geographical analysis, current methods lack a theoretical approach to model and predict the count data, especially with limited spatial samples. To address the gap, this study develops a new method named Poisson Means of Stratified Nonhomogeneity (PoiMSN). It theoretically considers both autocorrelation and heterogeneity, and without any covariate, incorporates local samples and out-stratum neighbors that traditional methods neglected, to accurately model and predict the latent process for Poisson distributed data. PoiMSN, compared to Poisson geostatistics and traditional MSN, was validated by simulation. It demonstrated superior performance, achieving the lowest mean absolute error and root-mean-squared error, with at least 5% improvement in accuracy for autocorrelated and stratified Poisson data. The application to hand, foot, mouth disease data showed PoiMSN could precisely map the disease risks with lower uncertainty. PoiMSN has the ability to accommodate autocorrelated and heterogeneous statistical population and leverage extensive sample information, substantiating its theoretical and empirical superiority in spatially non-stationary count data.
d
Streambed temperature data for the manuscript: Heat as a hydrologic tracer...
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Streambed temperature data for the manuscript: Heat as a hydrologic tracer in shallow and deep heterogeneous media: analytical solution, spreadsheet tool, and field applications: U.S. Geological Survey data release [Dataset]. https://catalog.data.gov/dataset/streambed-temperature-data-for-the-manuscript-heat-as-a-hydrologic-tracer-in-shallow-and-d
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This Data Release includes temperature measurements collected using a wrapped fiber-optic tool in a Cape Cod, MA streambed on 06/06/2016 to demonstrate the application of the manuscript: Kurylyk, B.L., Irvine, D.J, Carey, S., Briggs, M.A., Werkema, D., and Bonham, M., 2017, Heat as a hydrologic tracer in shallow and deep heterogeneous media: analytical solution, spreadsheet tool, and field applications, Hydrological Processes. The directory RAW_DATA contains the measured temperature time series at varied depth in the streambed along the vertical fiber-optic HRTS tool as described in the local read.me file. The OUTPUT directory contains simple statistical analysis (min/max, mean, stdev) of the raw temperature data as described in the local read.me file.
E
Data from: Integration and harmonization of trait data from plant...
live.european-language-grid.eu
zenodo.org
+1more
csv
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Data from: Integration and harmonization of trait data from plant individuals across heterogeneous sources [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/7662
Explore at:
csvAvailable download formats
Dataset updated
Dec 13, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Trait data represent the basis for ecological and evolutionary research and have relevance for biodiversity conservation, ecosystem management and earth system modelling. The collection and mobilization of trait data has strongly increased over the last decade, but many trait databases still provide only species-level, aggregated trait values (e.g. ranges, means) and lack the direct observations on which those data are based. Thus, the vast majority of trait data measured directly from individuals remains hidden and highly heterogeneous, impeding their discoverability, semantic interoperability, digital accessibility and (re-)use. Here, we integrate quantitative measurements of verbatim trait information from plant individuals (e.g. lengths, widths, counts and angles of stems, leaves, fruits and inflorescence parts) from multiple sources such as field observations and herbarium collections. We develop a workflow to harmonize heterogeneous trait measurements (e.g. trait names and their values and units) as well as additional information related to taxonomy, measurement or fact and occurrence. This data integration and harmonization builds on vocabularies and terminology from existing metadata standards and ontologies such as the Ecological Trait-data Standard (ETS), the Darwin Core (DwC), the Thesaurus Of Plant characteristics (TOP) and the Plant Trait Ontology (TO). A metadata form filled out by data providers enables the automated integration of trait information from heterogeneous datasets. We illustrate our tools with data from palms (family Arecaceae), a globally distributed (pantropical), diverse plant family that is considered a good model system for understanding the ecology and evolution of tropical rainforests. We mobilize nearly 140,000 individual palm trait measurements in an interoperable format, identify semantic gaps in existing plant trait terminology and provide suggestions for the future development of a thesaurus of plant characteristics. Our work thereby promotes the semantic integration of plant trait data in a machine-readable way and shows how large amounts of small trait data sets and their metadata can be integrated into standardized data products.
d
Data from: Habitat selection and the value of information in heterogenous...
search.dataone.org
datadryad.org
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth A. Schmidt; Francois Massol (2025). Habitat selection and the value of information in heterogenous landscapes [Dataset]. http://doi.org/10.5061/dryad.1d73qk4
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.1d73qk4
Dataset updated
Apr 2, 2025
Dataset provided by
Dryad Digital Repository
Authors
Kenneth A. Schmidt; Francois Massol
Time period covered
Jan 1, 2018
Description
Despite the wide usage of the term information in evolutionary ecology, there is no general treatise between fitness (i.e., density-dependent population growth) and selection of the environment sensu lato. Here we (1) initiate the building of a quantitative framework with which to examine the relationship between information use in spatially heterogeneous landscapes and density-dependent population growth, and (2) illustrate its utility by applying the framework to an existing model of breeding habitat selection. We begin by linking information, as a process of narrowing choice, to population growth/fitness. Second, we define a measure of a populationâ€™s penalty of ignorance based on the Kullback-Leibler index that combines the contributions of resource selection (i.e., biased use of breeding sites) and density-dependent depletion. Third, we quantify the extent to which environmental heterogeneity (i.e., mean and variance within a landscape) constrains sustainable population growth of un...
f
Data from: Wasserstein-Kaplan-Meier Survival Regression
tandf.figshare.com
pdf
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yidong Zhou; Hans-Georg Müller (2025). Wasserstein-Kaplan-Meier Survival Regression [Dataset]. http://doi.org/10.6084/m9.figshare.27045992.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27045992.v1
Dataset updated
Jan 29, 2025
Dataset provided by
Taylor & Francis
Authors
Yidong Zhou; Hans-Georg Müller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Survival analysis plays a pivotal role in medical research, offering valuable insights into the timing of events such as survival time. One common challenge in survival analysis is the necessity to adjust the survival function to account for additional factors, such as age, gender, and ethnicity. We propose an innovative regression model for right-censored survival data across heterogeneous populations, leveraging the Wasserstein space of probability measures. Our approach models the probability measure of survival time and the corresponding nonparametric Kaplan-Meier estimator for each subgroup as elements of the Wasserstein space. The Wasserstein space provides a flexible framework for modeling heterogeneous populations, allowing us to capture complex relationships between covariates and survival times. We address an underexplored aspect by deriving the non-asymptotic convergence rate of the Kaplan-Meier estimator to the underlying probability measure in terms of the Wasserstein metric. The proposed model is supported with a solid theoretical foundation including pointwise and uniform convergence rates, along with an efficient algorithm for model fitting. The proposed model effectively accommodates random variation that may exist in the probability measures across different subgroups, demonstrating superior performance in both simulations and two case studies compared to the Cox proportional hazards model and other alternative models. Supplementary materials for this article are available online.
qPortal: A platform for data-driven biomedical research
plos.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Mohr; Andreas Friedrich; David Wojnar; Erhan Kenar; Aydin Can Polatkan; Marius Cosmin Codrea; Stefan Czemmel; Oliver Kohlbacher; Sven Nahnsen (2023). qPortal: A platform for data-driven biomedical research [Dataset]. http://doi.org/10.1371/journal.pone.0191603
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0191603
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Christopher Mohr; Andreas Friedrich; David Wojnar; Erhan Kenar; Aydin Can Polatkan; Marius Cosmin Codrea; Stefan Czemmel; Oliver Kohlbacher; Sven Nahnsen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modern biomedical research aims at drawing biological conclusions from large, highly complex biological datasets. It has become common practice to make extensive use of high-throughput technologies that produce big amounts of heterogeneous data. In addition to the ever-improving accuracy, methods are getting faster and cheaper, resulting in a steadily increasing need for scalable data management and easily accessible means of analysis.We present qPortal, a platform providing users with an intuitive way to manage and analyze quantitative biological data. The backend leverages a variety of concepts and technologies, such as relational databases, data stores, data models and means of data transfer, as well as front-end solutions to give users access to data management and easy-to-use analysis options. Users are empowered to conduct their experiments from the experimental design to the visualization of their results through the platform. Here, we illustrate the feature-rich portal by simulating a biomedical study based on publically available data. We demonstrate the software’s strength in supporting the entire project life cycle. The software supports the project design and registration, empowers users to do all-digital project management and finally provides means to perform analysis. We compare our approach to Galaxy, one of the most widely used scientific workflow and analysis platforms in computational biology. Application of both systems to a small case study shows the differences between a data-driven approach (qPortal) and a workflow-driven approach (Galaxy).qPortal, a one-stop-shop solution for biomedical projects offers up-to-date analysis pipelines, quality control workflows, and visualization tools. Through intensive user interactions, appropriate data models have been developed. These models build the foundation of our biological data management system and provide possibilities to annotate data, query metadata for statistics and future re-analysis on high-performance computing systems via coupling of workflow management systems. Integration of project and data management as well as workflow resources in one place present clear advantages over existing solutions.
f
Modeling Species Distributions from Heterogeneous Data for the Biogeographic...
plos.figshare.com
doc
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rubén G. Mateo; Alain Vanderpoorten; Jesús Muñoz; Benjamin Laenen; Aurélie Désamoré (2023). Modeling Species Distributions from Heterogeneous Data for the Biogeographic Regionalization of the European Bryophyte Flora [Dataset]. http://doi.org/10.1371/journal.pone.0055648
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0055648
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Rubén G. Mateo; Alain Vanderpoorten; Jesús Muñoz; Benjamin Laenen; Aurélie Désamoré
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The definition of biogeographic regions provides a fundamental framework for a range of basic and applied questions in biogeography, evolutionary biology, systematics and conservation. Previous research suggested that environmental forcing results in highly congruent regionalization patterns across taxa, but that the size and number of regions depends on the dispersal ability of the taxa considered. We produced a biogeographic regionalization of European bryophytes and hypothesized that (1) regions defined for bryophytes would differ from those defined for other taxa due to the highly specific eco-physiology of the group and (2) their high dispersal ability would result in the resolution of few, large regions. Species distributions were recorded using 10,000 km2 MGRS pixels. Because of the lack of data across large portions of the area, species distribution models employing macroclimatic variables as predictors were used to determine the potential composition of empty pixels. K-means clustering analyses of the pixels based on their potential species composition were employed to define biogeographic regions. The optimal number of regions was determined by v-fold cross-validation and Moran’s I statistic. The spatial congruence of the regions identified from their potential bryophyte assemblages with large-scale vegetation patterns is at odds with our primary hypothesis. This reinforces the notion that post-glacial migration patterns might have been much more similar in bryophytes and vascular plants than previously thought. The substantially lower optimal number of clusters and the absence of nested patterns within the main biogeographic regions, as compared to identical analyses in vascular plants, support our second hypothesis. The modelling approach implemented here is, however, based on many assumptions that are discussed but can only be tested when additional data on species distributions become available, highlighting the substantial importance of developing integrated mapping projects for all taxa in key biogeographically areas of Europe, and the Mediterranean peninsulas in particular.
d
Data from: Mean flow direction modulates non-Fickian transport in a...
datadryad.org
zip
Updated Sep 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rich Pauloo (2020). Mean flow direction modulates non-Fickian transport in a heterogeneous alluvial aquifer-aquitard system [Dataset]. http://doi.org/10.25338/B8H920
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25338/B8H920
Dataset updated
Sep 10, 2020
Dataset provided by
Dryad
Authors
Rich Pauloo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2020
Description
This research uses three models, detailed in the methods of the manuscript, and a brief description of the models and the data they rely on are provided below:

hydraulic conductivity field: a T-PROGS (transition probability geostatistics) heterogeneous hydrofacies model of the Kings River Alluvial Fan. Model dimensions are 15 km 12.6 km x 100.5 m. The model was generated by a former study (Weissmann et al., 1999) and used data from well completion reports, borehole logs, and other geophysical logs.

groundwater flow model: a MODFLOW-2000 groundwater flow model.

a particle transport model: an RW3D model that solves the advection dispersion equation.

The initial and boundary contditions of the models are specified in input files within the provided datasets, and detailed in the manuscript.
n
Data from: Multi-scale heterogeneity in vegetation and soil carbon in...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Feb 26, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William S. Currie; Sarah Kiger; Joan I. Nassauer; Meghan Hutchins; Lauren L. Marshall; Daniel G. Brown; Rick L. Riolo; Derek T. Robinson; Stephanie K. Hart (2016). Multi-scale heterogeneity in vegetation and soil carbon in exurban residential land of southeastern Michigan, USA [Dataset]. http://doi.org/10.5061/dryad.7g6v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7g6v3
Dataset updated
Feb 26, 2016
Dataset provided by
University of Waterloo
University of Michigan
Authors
William S. Currie; Sarah Kiger; Joan I. Nassauer; Meghan Hutchins; Lauren L. Marshall; Daniel G. Brown; Rick L. Riolo; Derek T. Robinson; Stephanie K. Hart
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Michigan
Description
Exurban residential land (one housing unit per 0.2–16.2 ha) is growing in importance as a human-dominated land use. Carbon storage in the soils and vegetation of exurban land is poorly known, as are the effects on C storage of choices made by developers and residents. We studied C storage in exurban yards in southeastern Michigan, USA, across a range of parcel sizes and different types of neighborhoods. We divided each residential parcel into ecological zones (EZ) characterized by vegetation, soil, and human behavior such as mowing, irrigation, and raking. We found a heterogeneous mixture of trees and shrubs, turfgrasses, mulched gardens, old-field vegetation, and impervious surfaces. The most extensive zone type was turfgrass with sparse woody vegetation (mean 26% of parcel area), followed by dense woody vegetation (mean 21% of parcel area). Areas of turfgrass with sparse woody vegetation had trees in larger size classes (> 50 cm dbh) than did areas of dense woody vegetation. Using aerial photointerpretation, we scaled up C storage to neighborhoods. Varying C storage by neighborhood type resulted from differences in impervious area (8–26% of parcel area) and area of dense woody vegetation (11–28%). Averaged and multiplied across areas in differing neighborhood types, exurban residential land contained 5240 ± 865 g C/m2 in vegetation, highly sensitive to large trees, and 13 800 ± 1290 g C/m2 in soils (based on a combined sampling and modeling approach). These contents are greater than for agricultural land in the region, but lower than for mature forest stands. Compared with mature forests, exurban land contained more shrubs and less downed woody debris and it had similar tree size-class distributions up to 40 cm dbh but far fewer trees in larger size classes. If the trees continue to grow, exurban residential land could sequester additional C for decades. Patterns and processes of C storage in exurban residential land were driven by land management practices that affect soil and vegetation, reflecting the choices of designers, developers, and residents. This study provides an example of human-mediated C storage in a coupled human–natural system.
Data from: Additive interaction between heterogeneous environmental quality...
datasets.ai
s.cnmilf.com
+1more
Updated Aug 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2024). Additive interaction between heterogeneous environmental quality domains (air, water, land, sociodemographic and built environment) on preterm birth [Dataset]. https://datasets.ai/datasets/additive-interaction-between-heterogeneous-environmental-quality-domains-air-water-land-so
Explore at:
Dataset updated
Aug 8, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Authors
U.S. Environmental Protection Agency
Description
The study population included live births from the National Center for Health Statistics (NCHS) for the entire United States for the years 2000–2005 for all 3141 counties. Domain-specific EQIs were used to represent environmental exposure at the county-level for the entire U.S. over the 2000–2005 time period. The EQI includes variables representing five environmental domains: air, water, land, built, and sociodemographic (2). The domain-specific indices include both beneficial and detrimental environmental factors. The air domain includes 87 variables representing criteria and hazardous air pollutants. The water domain includes 80 variables representing overall water quality, general water contamination, recreational water quality, drinking water quality, atmospheric deposition, drought, and chemical contamination. The land domain includes 26 variables representing agriculture, pesticides, contaminants, facilities, and radon. The built domain includes 14 variables representing roads, highway/road safety, public transit behavior, business environment, and subsidized housing environment. The sociodemographic environment includes 12 variables representing socioeconomics and crime. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Human health data are not available publicly. EQI data are available at: https://edg.epa.gov/data/Public/ORD/NHEERL/EQI. Format: Data are stored as csv files.

This dataset is associated with the following publication: Grabich, S., K. Rappazzo, C. Gray, J. Jagai, Y. Jian, L. Messer, and D. Lobdell. Additive interaction between heterogeneous environmental quality domains (air, water, land, sociodemographic and built environment) on preterm birth. Frontiers in Public Health. Frontiers, Lausanne, SWITZERLAND, 4: 232, (2016).
Data from: Contribution of Particulate Nitrate Photolysis to Heterogeneous...
catalog.data.gov
datasets.ai
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Contribution of Particulate Nitrate Photolysis to Heterogeneous Sulfate Formation for Winter Haze in China [Dataset]. https://catalog.data.gov/dataset/contribution-of-particulate-nitrate-photolysis-to-heterogeneous-sulfate-formation-for-wint
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
China
Description
Nitrate and sulfate are two key components of airborne particulate matter (PM). While multiple formation mechanisms have been proposed for sulfate, current air quality models commonly underestimate its concentrations and mass fractions during northern China winter haze events. On the other hand, current models usually overestimate the mass fractions of nitrate. Very recently, laboratory studies have proposed that nitrous acid (N(III)) produced by particulate nitrate photolysis can oxidize sulfur dioxide to produce sulfate. Here, for the first time, we parameterize this heterogeneous mechanism into the state-of-the-art Community Multi-scale Air Quality (CMAQ) model and quantify its contributions to sulfate formation. We find that the significance of this mechanism mainly depends on the enhancement effects (by 1–3 orders of magnitude as suggested by the available experimental studies) of nitrate photolysis rate constant ("J" (〖"NO" 〗"3" ^- )) in aerosol liquid water compared to that in the gas phase. Comparisons between model simulations and in-situ observations in Beijing suggest that this pathway can explain about 15% (assuming an enhancement factor (EF) of 10) to 65% (assuming EF = 100) of the model–observation gaps in sulfate concentrations during winter haze. Our study strongly calls for future research on reducing the uncertainty in EF. This dataset is not publicly accessible because: Data sets used in the analysis presented in the manuscript “Contribution of particulate nitrate photolysis to heterogeneous sulfate formation for winter haze in China” Model simulations were conducted by Tsinghua University in China and results are available from the Tsinghua University. Contact: Haotian Zheng, School of Environment, State Key Joint Laboratory of Environment Simulation and Pollution Control, Tsinghua University, Beijing 100084, China. Email: hzheng@g.harvard.edu. It can be accessed through the following means: Data sets used in the analysis presented in the manuscript “Contribution of particulate nitrate photolysis to heterogeneous sulfate formation for winter haze in China” Model simulations were conducted by Tsinghua University in China and results are available from the Tsinghua University. Contact: Haotian Zheng, School of Environment, State Key Joint Laboratory of Environment Simulation and Pollution Control, Tsinghua University, Beijing 100084, China. Email: hzheng@g.harvard.edu. Format: Data sets used in the analysis presented in the manuscript “Contribution of particulate nitrate photolysis to heterogeneous sulfate formation for winter haze in China” Model simulations were conducted by Tsinghua University in China and results are available from the Tsinghua University. Contact: Haotian Zheng, School of Environment, State Key Joint Laboratory of Environment Simulation and Pollution Control, Tsinghua University, Beijing 100084, China. Email: hzheng@g.harvard.edu. This dataset is associated with the following publication: Sarwar, G., H. Zheng, S. Song, M. Gen, S. Wang, D. Ding, X. Chang, J. Xing, Y. Sun, D. Ji, C. Chan, J. Gao, and M. McElroy. Contribution of Particulate Nitrate Photolysis to Heterogeneous Sulfate Formation for Winter Haze in China. Environmental Science & Technology Letters. American Chemical Society, Washington, DC, USA, 7(9): 632-638, (2020).

Facebook

Twitter

Click to copy link

Link copied

Cite

Yubai Yuan; Yijiao Zhang; Babak Shahbaba; Norbert Fortin; Keiland Cooper; Qing Nie; Annie Qu (2025). Optimal Transport based Cross-Domain Integration for Heterogeneous Data [Dataset]. http://doi.org/10.6084/m9.figshare.29828924.v1

Data from: Optimal Transport based Cross-Domain Integration for Heterogeneous Data

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.29828924.v1

Dataset updated

Aug 5, 2025

Dataset provided by

Taylor & Francis

Authors

Yubai Yuan; Yijiao Zhang; Babak Shahbaba; Norbert Fortin; Keiland Cooper; Qing Nie; Annie Qu

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Detecting dynamic patterns shared across heterogeneous datasets is a critical yet challenging task in many scientific domains, particularly within the biomedical sciences. Systematic heterogeneity inherent in diverse data sources can significantly hinder the effectiveness of existing machine learning methods in uncovering shared underlying dynamics. Additionally, practical and technical constraints in real-world experimental designs often limit data collection to only a small number of subjects, even when rich, time-dependent measurements are available for each individual. These limited sample sizes further diminish the power to detect common dynamic patterns across subjects. In this paper, we propose a novel heterogeneous data integration framework based on optimal transport to extract shared patterns in the conditional mean dynamics of target responses. The key advantage of the proposed method is its ability to enhance discriminative power by reducing heterogeneity unrelated to the signal. This is achieved through the alignment of extracted domain-shared temporal information across multiple datasets from different domains. Our approach is effective regardless of the number of datasets and does not require auxiliary matching information for alignment. Specifically, the method aligns longitudinal data from heterogeneous datasets within a common latent space, capturing shared dynamic patterns while leveraging temporal dependencies within subjects. Theoretically, we establish generalization error bounds for the proposed data integration approach in supervised learning tasks, highlighting a novel trade-off between data alignment and pattern learning. Additionally, we derive convergence rates for the barycentric projection under Gromov-Wasserstein and fused Gromov-Wasserstein distances. Numerical studies on both simulated data and neuroscience applications demonstrate that the proposed data integration framework substantially improves prediction accuracy by effectively aggregating information across diverse data sources and subjects.

Clear search

Close search

Google apps

Main menu

Data from: Optimal Transport based Cross-Domain Integration for...

Data from: Modeling site heterogeneity with posterior mean site frequency...

AN ITERATIVE ESTIMATOR FOR PREDICTING THE HETEROGENEOUS ATTRIBUTE DATA SETS

Raw data of the heterogeneous Hegselmann-Krause model on network ensembles

Data from: Multiple Kernel Learning for Heterogeneous Anomaly Detection:...

Appendix A. Population growth rate bias and mean-square error in the...

OAG Dataset for H2GB

Data Description Exchange Services for Heterogeneous Vehicle and Spaceport...

Data from: Does environmental heterogeneity drive functional trait...

Poisson MSN instructions

Streambed temperature data for the manuscript: Heat as a hydrologic tracer...

Data from: Integration and harmonization of trait data from plant...

Data from: Habitat selection and the value of information in heterogenous...

Data from: Wasserstein-Kaplan-Meier Survival Regression

qPortal: A platform for data-driven biomedical research

Modeling Species Distributions from Heterogeneous Data for the Biogeographic...

Data from: Mean flow direction modulates non-Fickian transport in a...

Data from: Multi-scale heterogeneity in vegetation and soil carbon in...

Data from: Additive interaction between heterogeneous environmental quality...

Data from: Contribution of Particulate Nitrate Photolysis to Heterogeneous...

Data from: Optimal Transport based Cross-Domain Integration for Heterogeneous Data