CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The random Lorentz gas (RLG) is a minimal model for transport in disordered media. Despite the broad relevance of the model, theoretical grasp over its properties remains weak. For instance, the scaling with dimension $d$ of its localization transition at the void percolation threshold is not well controlled analytically nor computationally. A recent study [Biroli et al. Phys. Rev. E 103, L030104 (2021)] of the caging behavior of the RLG motivated by the mean-field theory of glasses has uncovered physical inconsistencies in that scaling that heighten the need for guidance. Here, we first extend analytical expectations for asymptotic high-d bounds on the void percolation threshold, and then computationally evaluate both the threshold and its criticality in various d. In high-d systems, we observe that the standard percolation physics is complemented by a dynamical slowdown of the tracer dynamics reminiscent of mean-field caging. A simple modification of the RLG is found to bring the interplay between percolation and mean-field-like caging down to d=3. ... [Read More]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a density-aided clustering method called Skeleton Clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios. Supplementary materials for this article are available online.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For conditional time-varying factor models with high dimensional assets, this article proposes a high dimensional alpha (HDA) test to assess whether there exist abnormal returns on securities (or portfolios) over the theoretical expected returns. To employ this test effectively, a constant coefficient test is also introduced. It examines the validity of constant alphas and factor loadings. Simulation studies and an empirical example are presented to illustrate the finite sample performance and the usefulness of the proposed tests. Using the HDA test, the empirical example demonstrates that the FF three-factor model (Fama and French, 1993) is better than CAPM (Sharpe, 1964) in explaining the mean-variance efficiency of both the Chinese and US stock markets. Furthermore, our results suggest that the US stock market is more efficient in terms of mean-variance efficiency than the Chinese stock market.
https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106
Recent advances in single-cell and multi-omics technologies have enabled high-resolution profiling of cellular states, but also introduced new computational challenges. This dissertation presents machine learning methods to improve data quality and extract insights from high-dimensional, multimodal single-cell datasets.
First, we propose Decaf K-means, a clustering algorithm that accounts for cluster-specific confounding effects, such as batch variation, directly during clustering. This approach improves clustering accuracy in both synthetic and real data.
Second, we develop scPDA, a denoising method for droplet-based single-cell protein data that eliminates the need for empty droplets or null controls. scPDA models protein-protein relationships to enhance denoising accuracy and significantly improves cell-type identification.
Third, we introduce Scouter, a model that predicts transcriptional outcomes of unseen gene perturbations. Scouter combines neural networks with large language models to generalize across perturbations, reducing prediction error by over 50% compared to existing methods.
Finally, we extend this to TranScouter, which predicts transcriptional responses under new biological conditions without direct perturbation data. Using a tailored encoder-decoder architecture, TranScouter achieves accurate cross-condition predictions, paving the way for more generalizable models in perturbation biology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a simulation result of the effectiveness improvement methods for high-dimensional data testing (such as mean testing, linear model testing, and independence testing), including images and tables.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Recent years have seen increased interest in phylogenetic comparative analyses of multivariate datasets, but to date the varied proposed approaches have not been extensively examined. Here we review the mathematical properties required of any multivariate method, and specifically evaluate existing multivariate phylogenetic comparative methods in this context. Phylogenetic comparative methods based on the full multivariate likelihood are robust to levels of covariation among trait dimensions and are insensitive to the orientation of the dataset, but display increasing model misspecification as the number of trait dimensions increases. This is because the expected evolutionary covariance matrix (V) used in the likelihood calculations becomes more ill-conditioned as trait dimensionality increases, and as evolutionary models become more complex. Thus, these approaches are only appropriate for datasets with few traits and many species. Methods that summarize patterns across trait dimensions treated separately (e.g., SURFACE) incorrectly assume independence among trait dimensions, resulting in nearly a 100% model misspecification rate. Methods using pairwise composite likelihood are highly sensitive to levels of trait covariation, the orientation of the dataset, and the number of trait dimensions. The consequences of these debilitating deficiencies is that a user can arrive at differing statistical conclusions, and therefore biological inferences, simply from a dataspace rotation, like principal component analysis. By contrast, algebraic generalizations of the standard phylogenetic comparative toolkit that use the trace of covariance matrices are insensitive to levels of trait covariation, the number of trait dimensions, and the orientation of the dataset. Further, when appropriate permutation tests are used, these approaches display acceptable Type I error and statistical power. We conclude that methods summarizing information across trait dimensions, as well as pairwise composite likelihood methods should be avoided, while algebraic generalizations of the phylogenetic comparative toolkit provide a useful means of assessing macroevolutionary patterns in multivariate data. Finally, we discuss areas in which multivariate phylogenetic comparative methods are still in need of future development; namely highly multivariate Ornstein-Uhlenbeck models and approaches for multivariate evolutionary model comparisons.
The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. Here, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequence of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also briefly discuss results on synthetic and real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Many questions in evolutionary biology require the quantification and comparison of rates of phenotypic evolution. Recently, phylogenetic comparative methods have been developed for comparing evolutionary rates on a phylogeny for single, univariate traits (σ2), and evolutionary rate matrices (R) for sets of traits treated simultaneously. However, high-dimensional traits like shape remain under-examined with this framework, because methods suited for such data have not been fully developed. In this article, I describe a method to quantify phylogenetic evolutionary rates for high-dimensional multivariate data (σ2mult), found from the equivalency between statistical methods based on covariance matrices and those based on distance matrices (R-mode and Q-mode methods). I then use simulations to evaluate the statistical performance of hypothesis testing procedures that compare σ2mult for two or more groups of species on a phylogeny. Under both isotropic and non-isotropic conditions, and for differing numbers of trait dimensions, the proposed method displays appropriate Type I error and high statistical power for detecting known differences in σ2mult among groups. By contrast, the Type I error rate of likelihood tests based on the evolutionary rate matrix (R) increases as the number of trait dimensions (p) increases, and becomes unacceptably large when only a few trait dimensions are considered. Further, likelihood tests based on R cannot be computed when the number of trait dimensions equals or exceeds the number of taxa in the phylogeny (i.e., when p ≥ N). These results demonstrate that tests based on σ2mult provide a useful means of comparing evolutionary rates for high-dimensional data that are otherwise not analytically accessible to methods based on the evolutionary rate matrix. This advance thus expands the phylogenetic comparative toolkit for high-dimensional phenotypic traits like shape. Finally, I illustrate the utility of the new approach by evaluating rates of head shape evolution in a lineage of Plethodon salamanders.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems is a significant task to be solved in many science and engineering fields, but remains still an open challenge. The present paper develops a novel approach, termed ‘fractional moments-based mixture distribution’, to address such challenge. This approach is implemented by capturing the extreme value distribution (EVD) of the system response with the concepts of fractional moment and mixture distribution. In our context, the fractional moment itself is by definition a high-dimensional integral with a complicated integrand. To efficiently compute the fractional moments, a parallel adaptive sampling scheme that allows for sample size extension is developed using the refined Latinized stratified sampling (RLSS). In this manner, both variance reduction and parallel computing are possible for evaluating the fractional moments. From the knowledge of low-order fractional moments, the EVD of interest is then expected to be reconstructed. Based on introducing an extended inverse Gaussian distribution and a log extended skew-normal distribution, one flexible mixture distribution model is proposed, where its fractional moments are derived in analytic form. By fitting a set of fractional moments, the EVD can be recovered via the proposed mixture model. Accordingly, the first-passage probabilities under different thresholds can be obtained from the recovered EVD straightforwardly. The performance of the proposed method is verified by three examples consisting of two test examples and one engineering problem.
RIL RTPCR and CHC dataqRT-PCR data for three genes (AmyPres CL481res NinaDres) and cuticular hydrocarbons (lc1 lc2 lc3 lc5 lc6 lc7 lc8 lc9) for 430 individuals from 41 RILs and 2 parental lines (EUNGC and FORS4). Canonical variates for genes (V1, V2, V3) and CHCs (W1, W2, W3) are also given.15 RIL CHC and gene factor meansRIL line means for 12 genetic factors and 9 cuticular hydrocarbons for the line-mean analysis presented in Table 2.
PlethHead-SVL-DataHead shape and body size data (species means) for 42 species of Plethodon salamanderPlethodon PhylogenyPlethodon phylogeny (from Wiens et al. 2006)plethred.treType I Error Simulation ScriptR script for simulating data to evaluate Type I error rates of two multivariate phylogenetic comparative method approachesPICRand-TypeIError.rSimulation support scriptsSupport scripts for Type I error simulationsprocD.pic.r
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this paper, we establish a high-dimensional CLT for the sample mean of p-dimensional spatial data observed over irregularly spaced sampling sites in Rd, allowing the dimension p to be much larger than the sample size n. We adopt a stochastic sampling scheme that can generate irregularly spaced sampling sites in a flexible manner and include both pure increasing domain and mixed increasing domain frameworks. To facilitate statistical inference, we develop the spatially dependent wild bootstrap (SDWB) and justify its asymptotic validity in high dimensions by deriving error bounds that hold almost surely conditionally on the stochastic sampling sites. Our dependence conditions on the underlying random field cover a wide class of random fields such as Gaussian random fields and continuous autoregressive moving average random fields. Through numerical simulations and a real data analysis, we demonstrate the usefulness of our bootstrap-based inference in several applications, including joint confidence interval construction for high-dimensional spatial data and change-point detection for spatio-temporal data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Benchmarks are an essential driver of progress in scientific disciplines. Ideal benchmarks mimic real-world tasks as closely as possible, where insufficient difficulty or applicability can stunt growth in the field. Benchmarks should also have sufficiently low computational overhead to promote accessibility and repeatability. The goal is then to win a "Turing test" of sorts by creating a surrogate model that is indistinguishable from the ground truth observation (at least within the dataset bounds that were explored), necessitating a large amount of data. In materials science and chemistry, industry-relevant optimization tasks are often hierarchical, noisy, multi-fidelity, multi-objective, high-dimensional, and non-linearly correlated while exhibiting mixed numerical and categorical variables subject to linear and non-linear constraints. To complicate matters, unexpected, failed simulation or experimental regions may be present in the search space. In this study, 173219 quasi-random hyperparameter combinations were generated across 23 hyperparameters and used to train CrabNet on the Matbench experimental band gap dataset. The results were logged to a free-tier shared MongoDB Atlas dataset. This study resulted in a regression dataset mapping hyperparameter combinations (including repeats) to MAE, RMSE, computational runtime, and model size for CrabNet model trained on the Matbench experimental band gap benchmark task1. This dataset is used to create a surrogate model as close as possible to running the actual simulations by incorporating heteroskedastic noise. Failure cases for bad hyperparameter combinations were excluded via careful construction of the hyperparameter search space, and so were not considered as was done in prior work. For the regression dataset, percentile ranks were computed within each of the groups of identical parameter sets to enable capturing heteroskedastic noise. This contrasts with a more traditional approach that imposes a-priori assumptions such as Gaussian noise, e.g., by providing a mean and standard deviation. A similar approach can be applied to other benchmark datasets to bridge the gap between optimization benchmarks with low computational overhead and realistically complex, real-world optimization scenarios.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The empirical quantiles of independent data provide a good summary of the underlying distribution of the observations. For high-dimensional time series defined in two dimensions, such as in space and time, one can define empirical quantiles of all observations at a given time point, but such time-wise quantiles can only reflect properties of the data at that time point. They often fail to capture the dynamic dependence of the data. In this article, we propose a new definition of empirical dynamic quantiles (EDQ) for high-dimensional time series that mitigates this limitation by imposing that the quantile must be one of the observed time series. The word dynamic emphasizes the fact that these newly defined quantiles capture the time evolution of the data. We prove that the EDQ converge to the time-wise quantiles under some weak conditions as the dimension increases. A fast algorithm to compute the dynamic quantiles is presented and the resulting quantiles are used to produce summary plots for a collection of many time series. We illustrate with two real datasets that the time-wise and dynamic quantiles convey different and complementary information. We also briefly compare the visualization provided by EDQ with that obtained by functional depth. The R code and a vignette for computing and plotting EDQ are available athttps://github.com/dpena157/HDts/.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Content:
This dataset contains time-resolved in situ data acquired during the formation of blade-coated perovskite thin films, which were subsequently processed into functional perovskite solar cells. The time series data capture the vacuum quenching process - a critical step in perovskite layer formation - using photoluminescence (PL) and diffuse reflection imaging. The dataset is intended to support deep learning applications for predicting both material-level properties (e.g., precursor composition) and device-level performance metrics (e.g., power conversion efficiency, PCE).
Unlike the previous dataset, this dataset includes perovskite solar cells fabricated under varied process conditions. Specifically, the quenching duration, precursor solution molarity, and molar ratio were systematically changed to enhance the diversity of the data.
To monitor the vacuum quenching process, a PL imaging setup captured four channels of time series image data (2D+t), including one diffuse reflection channel and three PL spectrum channels filtered for different wavelengths. All images were cropped into 65x56 pixel patches, isolating the active area of individual solar cells. However, currently, the dataset provides only the time transients of these four channels, where the spatial mean intensity was calculated for each time step. This dimensionality reduction transforms the high-dimensional video data into compact temporal transients, highlighting the critical dynamics of thin-film formation.
The dataset consists of two parts:
Samples finalized into functional solar cells:
Samples not finalized into functional solar cells:
Further information on the experimental procedure and data processing is detailed in the corresponding paper: Deep learning for augmented process monitoring of scalable perovskite thin-film fabrication. Please cite this paper when using the dataset.
Columns in the data.h5
file:
date
, expID
, patchID
(sample identifiers).ND
, LP725
, LP780
, SP775
(signal transients from vacuum quenching).ratio
, molarity
(precursor solution properties).evac_duration
(vacuum quenching duration).PCE_forward
, PCE_backward
, VOC_forward
, VOC_backward
, JSC_forward
, JSC_backward
, FF_forward
, FF_backward
.plqyWL
, lumFluxDens
(PL spectra after vacuum quenching).RSHUNT_forward
, RSHUNT_backward
, RS_forward
, RS_backward
(shunt and series resistances from jV curves).PLQY
, iVOC
, jscPLQY
, egPLQY
(calculated from PLQY measurements).Usage:
The dataset is structured for machine learning applications to improve understanding of the complex perovskite thin-film formation from solution. The corresponding paper tackles these challenges:
ND
, LP725
, LP780
, and SP775
as inputs to predict ratio
and molarity
.ND
, LP725
, LP780
, and SP775
with a variable process parameter (evac_duration
) as inputs to predict PCE_backward
.ND
, LP725
, LP780
, SP775
) as a function of a variable process parameter (evac_duration
) and predicting the corresponding device performance metric PCE_backward
.Scripts for generating the same train-test splits and cross-validation folds as in the corresponding paper are provided in the GitHub repository:
00a_generate_Material_train_test_folds.ipynb
00b_generate_PCE_train_test_folds.ipynb
Additionally, random forest models used for forecasting are included in forecasting_models.zip
.
The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. In this paper, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequences of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also discuss results on real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.
https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html
Software package for data-driven identification of latent port-Hamiltonian systems. Abstract Conventional physics-based modeling techniques involve high effort, e.g.~time and expert knowledge, while data-driven methods often lack interpretability, structure, and sometimes reliability. To mitigate this, we present a data-driven system identification framework that derives models in the port-Hamiltonian (pH) formulation. This formulation is suitable for multi-physical systems while guaranteeing the useful system theoretical properties of passivity and stability. Our framework combines linear and nonlinear reduction with structured, physics-motivated system identification. In this process, high-dimensional state data obtained from possibly nonlinear systems serves as the input for an autoencoder, which then performs two tasks: (i) nonlinearly transforming and (ii) reducing this data onto a low-dimensional manifold. In the resulting latent space, a pH system is identified by considering the unknown matrix entries as weights of a neural network. The matrices strongly satisfy the pH matrix properties through Cholesky factorizations. In a joint optimization process over the loss term, the pH matrices are adjusted to match the dynamics observed by the data, while defining a linear pH system in the latent space per construction. The learned, low-dimensional pH system can describe even nonlinear systems and is rapidly computable due to its small size. The method is exemplified by a parametric mass-spring-damper and a nonlinear pendulum example as well as the high-dimensional model of a disc brake with linear thermoelastic behavior Features This package implements neural networks that identify linear port-Hamiltonian systems from (potentially high-dimensional) data [1]. Autoencoders (AEs) for dimensionality reduction pH layer to identify system matrices that fullfill the definition of a linear pH system pHIN: identify a (parametric) low-dimensional port-Hamiltonian system directly ApHIN: identify a (parametric) low-dimensional latent port-Hamiltonian system based on coordinate representations found using an autoencoder Examples for the identification of linear pH systems from data One-dimensional mass-spring-damper chain Pendulum discbrake model See documentation for more details.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The random Lorentz gas (RLG) is a minimal model for transport in disordered media. Despite the broad relevance of the model, theoretical grasp over its properties remains weak. For instance, the scaling with dimension $d$ of its localization transition at the void percolation threshold is not well controlled analytically nor computationally. A recent study [Biroli et al. Phys. Rev. E 103, L030104 (2021)] of the caging behavior of the RLG motivated by the mean-field theory of glasses has uncovered physical inconsistencies in that scaling that heighten the need for guidance. Here, we first extend analytical expectations for asymptotic high-d bounds on the void percolation threshold, and then computationally evaluate both the threshold and its criticality in various d. In high-d systems, we observe that the standard percolation physics is complemented by a dynamical slowdown of the tracer dynamics reminiscent of mean-field caging. A simple modification of the RLG is found to bring the interplay between percolation and mean-field-like caging down to d=3. ... [Read More]