Facebook
TwitterThe dataset used in the paper is a bivariate Gaussian likelihood example with uncorrelated priors.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Regression to the mean (RTM) can occur whenever an extreme observation is selected from a population and a later observation is closer to the population mean. A consequence of this phenomenon is that natural variability can be mistaken as real change. Simple expressions are available to quantify RTM when the underlying distribution is bivariate normal. However, there are many real world situations which are better approximated as a Poisson process. Examples include the number of hard disk failures during a year, the number of cargo ships damaged by waves, daily homicide counts in California, and the number of deaths per quarter attributable to AIDS in Australia. In this paper, we derive expressions for quantifying RTM effects for the bivariate Poisson distribution for both the homogeneous and inhomogeneous cases. Statistical properties of our derivations have been evaluated through a simulation study. The asymptotic distributions of RTM estimators have been derived. The RTM effect for the number of people killed in road accidents in different regions of New South Wales (Australia) is estimated using maximum likelihood
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset for: Leipold, B. & Loepthien, T. (2021). Attentive and emotional listening to music: The role of positive and negative affect. Jahrbuch Musikpsychologie, 30. https://doi.org/10.5964/jbdgm.78 In a cross-sectional study associations of global affect with two ways of listening to music โ attentiveโanalytical listening (AL) and emotional listening (EL) were examined. More specifically, the degrees to which AL and EL are differentially correlated with positive and negative affect were examined. In Study 1, a sample of 1,291 individuals responded to questionnaires on listening to music, positive affect (PA), and negative affect (NA). We used the PANAS that measures PA and NA as high arousal dimensions. AL was positively correlated with PA, EL with NA. Moderation analyses showed stronger associations between PA and AL when NA was low. Study 2 (499 participants) differentiated between three facets of affect and focused, in addition to PA and NA, on the role of relaxation. Similar to the findings of Study 1, AL was correlated with PA, EL with NA and PA. Moderation analyses indicated that the degree to which PA is associated with an individualยดs tendency to listen to music attentively depends on their degree of relaxation. In addition, the correlation between pleasant activation and EL was stronger for individuals who were more relaxed; for individuals who were less relaxed the correlation between unpleasant activation and EL was stronger. In sum, the results demonstrate not only simple bivariate correlations, but also that the expected associations vary, depending on the different affective states. We argue that the results reflect a dual function of listening to music, which includes emotional regulation and information processing.: Dataset Study 2
Facebook
TwitterAn example of combining ANOVA terms for bivariate principle component data to create the ANODIS F-statistic where N is the total number of samples drawn and K, the number of assemblages compared.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The story behind the dataset is how to apply LSTM architecture to understand and apply multiple variables together to contribute more accuracy towards forecasting.
Air Pollution Forecasting The Air Quality dataset.
This is a dataset that reports on the weather and the level of pollution each hour for five years at the US embassy in Beijing, China.
The data includes the date-time, the pollution called PM2.5 concentration, and the weather information including dew point, temperature, pressure, wind direction, wind speed and the cumulative number of hours of snow and rain. The complete feature list in the raw data is as follows:
No: row number year: year of data in this row month: month of data in this row day: day of data in this row hour: hour of data in this row pm2.5: PM2.5 concentration DEWP: Dew Point TEMP: Temperature PRES: Pressure cbwd: Combined wind direction Iws: Cumulated wind speed Is: Cumulated hours of snow Ir: Cumulated hours of rain We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour.
Facebook
TwitterMultivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem โ (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: โSebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602โ
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Facebook
Twitterhttps://ec.europa.eu/info/legal-notice_enhttps://ec.europa.eu/info/legal-notice_en
Dataset consists of data in categories walking, running, biking, skiing, and roller skiing (5). Sport activities have been recorded by an individual active (non-competitive) athlete. Data is pre-processed, standardized and splitted in four parts (each dimension in its own file): * HR-DATA_std_1140x69 (heart rate signals) * SPD-DATA_std_1140x69 (speed signals) * ALT-DATA_std_1140x69 (altitude signals) * META-DATA_1140x4 (labels and details)
NOTE: Signal order between the separate files must not be confused when processing the data. Signal order is critical; first index in each of the file comes from the same activity which label corresponds to first index in the target data file, and so on. So, data should be constructed and files combined into the same table while reading the files, ideally using nested data structure. Something like in the picture below:
You may check the related TSC projects in GitHub: - "https://github.com/JABE22/MasterProject">Sport Activity Classification Using Classical Machine Learning and Time Series Methods - Symbolic Representation of Multivariate Time Series Signals in Sport Activity Classification - Kaggle Project
https://mediauploads.data.world/e1ccd4d36522e04c0061d12d05a87407bec80716f6fe7301991eaaccd577baa8_mts_data.png" alt="Nested data structure for multivariate time series classifiers">
In the following picture one can see five signal samples for each dimension (Heart Rate, Speed, Altitude) in standard feature value format. So, each figure contains signal from five different random activities (can be same or different category). However, for example, signal indexes number 1 in each three figure are from the same activity. Figures just visualizes what kind of signals dataset consists. They do not have any particular meaning.
https://mediauploads.data.world/162b7086448d8dbd202d282014bcf12bd95bd3174b41c770aa1044bab22ad655_signal_samples.png" alt="Signals from sport activities (Heart Rate, Speed, and Altitude)">
The original amount of sport activities is 228. From each of them, starting from the index 100 (seconds), have been picked 5 x 69 second consecutive segments, that is expressed as a formula below:
https://mediauploads.data.world/68ce83092ec65f6fbaee90e5de6e12df40498e08fa6725c111f1205835c1a842_segment_equation.png" alt="Data segmentation and augmentation formula">
where ๐ท = ๐๐๐๐๐๐๐๐ ๐๐๐๐ก๐๐๐๐ ๐๐๐ก๐ ,๐ = ๐๐ข๐๐๐๐ ๐๐ ๐๐๐ก๐๐ฃ๐๐ก๐๐๐ , ๐ = ๐ ๐๐๐๐๐๐ก ๐ ๐ก๐๐๐ก ๐๐๐๐๐ฅ , ๐ = ๐ ๐๐๐๐๐๐ก ๐๐๐๐๐กโ, and ๐ = ๐กโ๐ ๐๐ข๐๐๐๐ ๐๐ ๐ ๐๐๐๐๐๐ก๐ from a single original sequence ๐ท๐ , resulting the new set of equal length segments ๐ท๐ ๐๐. And in this certain case the equation takes the form of:
https://mediauploads.data.world/63dd87bf3d0010923ad05a8286224526e241b17bbbce790133030d8e73f3d3a7_data_segmentation_formula.png" alt="Data segmentation and augmentation formula with values">
Thus, dataset has dimesions of 1140 x 69 x 3.
Data has been recorded without knowing it will be used in research, therefore it represents well real-world application of data source and can provide excellent tool to test algorithms in real data.
Recording devices
Data has been recorded using two type of Garmin devices. Models are Forerunner 920XT and vivosport. Vivosport is activity tracker and measures heart rate from the wrist using optical sensor, whereas 920XT requires external sensor belt (hear rate + inertial) installed under chest when doing exercises. Otherwise devices are not essentially different, they uses GPS location to measure speed and inertial barometer to measure elevation changes.
Device manuals - Garmin FR-920XT - Garmin Vivosport
Person profile
Age: 30-31, Weight: 82, Length: 181, Active athlete (non-competitive)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
It is often the case that researchers wish to simultaneously explore the behavior of multiple diseases while accounting for potential spatial and/or temporal correlation. In this paper, we propose a flexible class of multivariate spatio-temporal mixture models to fill this role. Further, these models offer flexibility with the potential for model selection as well as the ability to accommodate lifestyle, socio-economic, and physical environmental variables with spatial, temporal, or both structures. Here, we explore the capability of this approach via a large scale simulation study and examine a real data example. The results which are focused on four model variants suggest that all models possess the ability to recover simulation ground truth and display improved model fit over two baseline Knorr-Held spatio-temporal interaction model variants in a real data application.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports a meta-analytic structural equation modelling (MASEM) study investigating the factors influencing studentsโ behavioural intention to use educational AI (EAI) technologies. The research integrates constructs from the Technology Acceptance Model (TAM), Theory of Planned Behaviour (TPB), and Artificial Intelligence Literacy (AIL), aiming to resolve inconsistencies in previous studies and improve theoretical understanding of EAI technology adoption.
Research Hypotheses The study hypothesized that: Studentsโ behavioural intention (INT) to use EAI technologies is influenced by perceived usefulness (PU), perceived ease of use (PEU), attitude (ATT), subjective norm (SN), and perceived behavioural control (PBC), as described in TAM and TPB. AI literacy (AIL) directly and indirectly predicts PU, PEU, ATT, and INT. These relationships are moderated by contextual factors such as academic level (Kโ12 vs. higher education) and regional economic development (developed vs. developing countries).
What the Data Shows The meta-analytic dataset comprises 166 empirical studies involving over 69,000 participants. It includes pairwise Pearson correlations among seven constructs (PU, PEU, ATT, SN, PBC, INT, AIL) and is used to compute a pooled correlation matrix. This matrix was then used to test three models via MASEM: A baseline TAM-TPB model, An internal-extended model with additional TPB internal paths, An AIL-integrated extended model. The AIL-integrated model achieved the best fit (CFI = 0.997, RMSEA = 0.053) and explained 62.3% of the variance in behavioural intention.
Notable Findings AI literacy (AIL) is the strongest predictor of intention to use EAI technologies (Total Effect = 0.408). PU, ATT, and SN also significantly influence intention. The effect of PEU on intention is fully mediated by PU and ATT. Moderation analysis showed that the relationships differ between developed and developing countries and between Kโ12 and higher education populations.
How the Data Can Be Interpreted and Used The dataset includes bivariate correlations between variables, publication metadata, sample sizes, coding information, and reliability values (e.g., CR scores). Suitable for replication of MASEM procedures, moderation analysis, and meta-regression. Researchers may use it to test additional theoretical models or assess the influence of new moderators (e.g., AI tool type). Educators and policymakers can leverage insights from the meta-analytic results to inform AI literacy training and technology adoption strategies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of studentsโ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and studentsโ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of studentsโ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the studentsโ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a studentsโ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Abstract: This contribution provides MATLAB scripts to assist users in factor analysis, constrained least squares regression, and total inversion techniques. These scripts respond to the increased availability of large datasets generated by modern instrumentation, for example, the SedDB database. The download (.zip) includes one descriptive paper (.pdf) and one file of the scripts and example output (.doc). Other Description: Pisias, N. G., R. W. Murray, and R. P. Scudder (2013), Multivariate statistical analysis and partitioning of sedimentary geochemical data sets: General principles and specific MATLAB scripts, Geochem. Geophys. Geosyst., 14, 4015โ4020, doi:10.1002/ggge.20247.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example of data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides 30-year averaged climate data for both historical and future periods, with a spatial resolution of 0.01ยฐ ร 0.01ยฐ. Historical data (1991โ2020) are based on the China Surface Climate Standard Dataset and were interpolated using ANUSPLIN software. Future climate data are derived from CMIP6 simulations, bias-corrected using the Delta downscaling method. The dataset includes 10 models (9 Global Climate Models, namely, GCMs, and 1 ensemble model), 3 scenarios (SSP1-2.6, SSP2-4.5, and SSP5-8.5), and 3 future periods (2021โ2040, 2041โ2070, 2071โ2100). For each period (or scenario), 28 climate variables are provided, including: 5 monthly basic climate variables (mean temperature, maximum temperature, minimum temperature, precipitation, and percentage of sunshine), and 23 bioclimatic variables based on the basic variables (for details, see the dataset documentation file).The data quality was strictly evaluated. The ANUSPLIN interpolated historical data showed a strong correlation with observations (all correlation coefficients above 0.91). The historical interpolations generated by the ANUSPLIIN software showed a good fit (above 0.91) with observations. The bias correction improved the accuracy of most GCM original simulations, reducing the bias by 0.69%โ58.63%. This dataset aims to provide high-resolution, bias-corrected long-term historical and future climate data for climate and ecological research. All computations were performed using R, and the corresponding code can be found in the dataset folder: โCodeโ.All data are provided in GeoTIFF (.tif) format, where each file for the basic climate variables contains 12 bands, representing monthly data in ascending order (e.g., Band 1 corresponds to January). To facilitate data storage, all files are provided in compressed archives, following a consistent naming convention:(1) Historical data: China_Variable_1km_1991โ2020.tifWhere, Variable represents the abbreviation of the 28 climate variables.Example: China_pr_1km_1991โ2020.tif.(2) Future data: China_Variable_Model_VariantLabel_1km_StartYear-EndYear_Scenario.tifWhere, Variable is the 28 climate variables; Model is the GCM name; VariantLabel is r1i1p1f1 in this study; StartYear-EndYear is the future period; Scenario is the SSP climate scenarioExample: China_tasmin_MRI-ESM2-0_r1i1p1f1_1km_2071โ2100_SSP585.tif.
Facebook
TwitterMicrogrids are small, self-contained power grids that can operate independently of the main grid. They are becoming increasingly popular as a way to improve the reliability and resilience of the power grid. This paper presents a dataset of power data collected from Mesa Del Sol microgrid located in Albuquerque, New Mexico. The dataset includes measurements of voltage, current, power, and energy for microgrid's components. This dataset contains 18 features and was collected over the past 13 months. The dataset is valuable for machine learning applications that can be used to improve the operation and management of microgrids. For example, the data could be used to train machine learning models to predict power outages or to optimise the microgrid's energy consumption., , , # MDS, a multivariate microgrid dataset
https://doi.org/10.5061/dryad.fqz612jzb
This dataset contains power data collected from Mesa Del Sol microgrid located in Albuquerque, New Mexico. The dataset includes measurements of voltage, power, frequency and temperature of different sensors and devices installed at the microgrid. This dataset contains 17 features and was collected over the past 13 months. The dataset is valuable for machine learning applications that can be used to improve the operation and management of microgrids. For example, the data could be used to train machine learning models to predict power outages or to optimise the microgrid's energy consumption.
Dataset is divided into monthly CSV files. Total number of csv files : 15 (1 file per month) Data resolution: 10 Seconds Total number of features: 17 Start date of data collection: 1st May, 2022 End date : 31st July, 2023
Each ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The multilevel hidden Markov model (MHMM) is a promising vehicle to investigate latent dynamics over time in social and behavioral processes. By including continuous individual random effects, the model accommodates variability between individuals, providing individual-specific trajectories and facilitating the study of individual differences. However, the performance of the MHMM has not been sufficiently explored. Currently, there are no practical guidelines on the sample size needed to obtain reliable estimates related to categorical data characteristics We performed an extensive simulation to assess the effect of the number of dependent variables (1-4), the number of individuals (5-90), and the number of observations per individual (100-1600) on the estimation performance of group-level parameters and between-individual variability on a Bayesian MHMM with categorical data of various levels of complexity. We found that using multivariate data generally alleviates the sample size needed and improves the stability of the results. Regarding the estimation of group-level parameters, the number of individuals and observations largely compensate for each other. Meanwhile, only the former drives the estimation of between-individual variability. We conclude with guidelines on the sample size necessary based on the complexity of the data and the study objectives of the practitioners.
This repository contains data generated for the manuscript: "Go multivariate: a Monte Carlo study of a multilevel hidden Markov model with categorical data of varying complexity". It comprehends: (1) model outputs (maximum a posteriori estimates) for each repetition (n=100) of each scenario (n=324) of the main simulation, (2) complete model outputs (including estimates for 4000 MCMC iterations) for two chains of each repetition (n=3) of each scenario (n=324). Please note that the empirical data used in the manuscript is not available as part of this repository. A subsample of the data used in the empirical example are openly available as an example data set in the R package mHMMbayes on CRAN. The full data set is available on request from the authors.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
(from wikipedia)
The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspรฉ Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".
The dataset IRIS.CSV consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
The dataset IRIS1.CSV is a modified version of IRIS.CSV, containing missing values.
The dataset, IRIS.CSV, is free and is publicly available at the UCI Machine Learning Repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The multilevel hidden Markov model (MHMM) is a promising vehicle to investigate latent dynamics over time in social and behavioral processes. By including continuous individual random effects, the model accommodates variability between individuals, providing individual-specific trajectories and facilitating the study of individual differences. However, the performance of the MHMM has not been sufficiently explored. Currently, there are no practical guidelines on the sample size needed to obtain reliable estimates related to categorical data characteristics We performed an extensive simulation to assess the effect of the number of dependent variables (1-4), the number of individuals (5-90), and the number of observations per individual (100-1600) on the estimation performance of group-level parameters and between-individual variability on a Bayesian MHMM with categorical data of various levels of complexity. We found that using multivariate data generally alleviates the sample size needed and improves the stability of the results. Regarding the estimation of group-level parameters, the number of individuals and observations largely compensate for each other. Meanwhile, only the former drives the estimation of between-individual variability. We conclude with guidelines on the sample size necessary based on the complexity of the data and the study objectives of the practitioners.
This repository contains data generated for the manuscript: "Go multivariate: a Monte Carlo study of a multilevel hidden Markov model with categorical data of varying complexity". It comprehends: (1) model outputs (maximum a posteriori estimates) for each repetition (n=100) of each scenario (n=324) of the main simulation, (2) complete model outputs (including estimates for 4000 MCMC iterations) for two chains of each repetition (n=3) of each scenario (n=324). Please note that the empirical data used in the manuscript is not available as part of this repository. A subsample of the data used in the empirical example are openly available as an example data set in the R package mHMMbayes on CRAN. The full data set is available on request from the authors.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The relevant four principal components (PCs) are given in bold font. Without the present method, only PCs #1 - #3 with eigenvalues > 1 [11,12] could be validly retained. The set of three principal allowed to show that all different pain measures shared an important common source of variance (PC1) pain evoked by cold stimuli, with or without sensitization by topical menthol application, by blunt pressure or by electrical stimuli (5 Hz sine waves) shared a common source of variance (PC2), and a further common source of variance e was shared by pain evoked by heat stimuli, with or without sensitization by topical capsaicin application, or by punctate mechanical pressure. However, with applying the here reported method, PC4 can now be also be retained, which singles out heat pain corresponding to the different pathophysiology underlying heat perception.Component loadings for a previously reported real-life example of a principal component analysis performed on the intercorrelation matrix among eight pain threshold measurements ([3]; for comparison, see Table 2 in that publication).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The multilevel hidden Markov model (MHMM) is a promising vehicle to investigate latent dynamics over time in social and behavioral processes. By including continuous individual random effects, the model accommodates variability between individuals, providing individual-specific trajectories and facilitating the study of individual differences. However, the performance of the MHMM has not been sufficiently explored. Currently, there are no practical guidelines on the sample size needed to obtain reliable estimates related to categorical data characteristics We performed an extensive simulation to assess the effect of the number of dependent variables (1-4), the number of individuals (5-90), and the number of observations per individual (100-1600) on the estimation performance of group-level parameters and between-individual variability on a Bayesian MHMM with categorical data of various levels of complexity. We found that using multivariate data generally alleviates the sample size needed and improves the stability of the results. Regarding the estimation of group-level parameters, the number of individuals and observations largely compensate for each other. Meanwhile, only the former drives the estimation of between-individual variability. We conclude with guidelines on the sample size necessary based on the complexity of the data and the study objectives of the practitioners.
This repository contains data generated for the manuscript: "Go multivariate: recommendations on multilevel hidden Markov models with categorical data of varying complexity". It comprehends: (1) model outputs (maximum a posteriori estimates) for each repetition (n=100) of each scenario (n=324) of the main simulation, (2) complete model outputs (including estimates for 4000 MCMC iterations) for two chains of each repetition (n=3) of each scenario (n=324). Please note that the empirical data used in the manuscript is not available as part of this repository. A subsample of the data used in the empirical example are openly available as an example data set in the R package mHMMbayes on CRAN. The full data set is available on request from the authors.
Facebook
TwitterThe dataset used in the paper is a bivariate Gaussian likelihood example with uncorrelated priors.