Facebook
TwitterMultivariate analysis for entire sample using logistic regression.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The relevant four principal components (PCs) are given in bold font. Without the present method, only PCs #1 - #3 with eigenvalues > 1 [11,12] could be validly retained. The set of three principal allowed to show that all different pain measures shared an important common source of variance (PC1) pain evoked by cold stimuli, with or without sensitization by topical menthol application, by blunt pressure or by electrical stimuli (5 Hz sine waves) shared a common source of variance (PC2), and a further common source of variance e was shared by pain evoked by heat stimuli, with or without sensitization by topical capsaicin application, or by punctate mechanical pressure. However, with applying the here reported method, PC4 can now be also be retained, which singles out heat pain corresponding to the different pathophysiology underlying heat perception.Component loadings for a previously reported real-life example of a principal component analysis performed on the intercorrelation matrix among eight pain threshold measurements ([3]; for comparison, see Table 2 in that publication).
Facebook
TwitterAdditional file 1. R-code and example data to perform the statistical tests described in the manuscript.
Facebook
TwitterExisting methods for constructing confidence bands for multivariate impulse response functions may have poor coverage at long lead times when variables are highly persistent. The goal of this paper is to propose a simple method that is not pointwise and that is robust to the presence of highly persistent processes. We use approximations based on local-to-unity asymptotic theory, and allow the horizon to be a fixed fraction of the sample size. We show that our method has better coverage properties at long horizons than existing methods, and may provide different economic conclusions in empirical applications. We also propose a modification of this method which has good coverage properties at both short and long horizons.
Facebook
TwitterBackgroundSmall sample sizes combined with multiple correlated endpoints pose a major challenge in the statistical analysis of preclinical neurotrauma studies. The standard approach of applying univariate tests on individual response variables has the advantage of simplicity of interpretation, but it fails to account for the covariance/correlation in the data. In contrast, multivariate statistical techniques might more adequately capture the multi-dimensional pathophysiological pattern of neurotrauma and therefore provide increased sensitivity to detect treatment effects.ResultsWe systematically evaluated the performance of univariate ANOVA, Welch’s ANOVA and linear mixed effects models versus the multivariate techniques, ANOVA on principal component scores and MANOVA tests by manipulating factors such as sample and effect size, normality and homogeneity of variance in computer simulations. Linear mixed effects models demonstrated the highest power when variance between groups was equal or variance ratio was 1:2. In contrast, Welch’s ANOVA outperformed the remaining methods with extreme variance heterogeneity. However, power only reached acceptable levels of 80% in the case of large simulated effect sizes and at least 20 measurements per group or moderate effects with at least 40 replicates per group. In addition, we evaluated the capacity of the ordination techniques, principal component analysis (PCA), redundancy analysis (RDA), linear discriminant analysis (LDA), and partial least squares discriminant analysis (PLS-DA) to capture patterns of treatment effects without formal hypothesis testing. While LDA suffered from a high false positive rate due to multicollinearity, PCA, RDA, and PLS-DA were robust and PLS-DA outperformed PCA and RDA in capturing a true treatment effect pattern.ConclusionsMultivariate tests do not provide an appreciable increase in power compared to univariate techniques to detect group differences in preclinical studies. However, PLS-DA seems to be a useful ordination technique to explore treatment effect patterns without formal hypothesis testing.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The story behind the dataset is how to apply LSTM architecture to understand and apply multiple variables together to contribute more accuracy towards forecasting.
Air Pollution Forecasting The Air Quality dataset.
This is a dataset that reports on the weather and the level of pollution each hour for five years at the US embassy in Beijing, China.
The data includes the date-time, the pollution called PM2.5 concentration, and the weather information including dew point, temperature, pressure, wind direction, wind speed and the cumulative number of hours of snow and rain. The complete feature list in the raw data is as follows:
No: row number year: year of data in this row month: month of data in this row day: day of data in this row hour: hour of data in this row pm2.5: PM2.5 concentration DEWP: Dew Point TEMP: Temperature PRES: Pressure cbwd: Combined wind direction Iws: Cumulated wind speed Is: Cumulated hours of snow Ir: Cumulated hours of rain We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour.
Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The multilevel hidden Markov model (MHMM) is a promising vehicle to investigate latent dynamics over time in social and behavioral processes. By including continuous individual random effects, the model accommodates variability between individuals, providing individual-specific trajectories and facilitating the study of individual differences. However, the performance of the MHMM has not been sufficiently explored. Currently, there are no practical guidelines on the sample size needed to obtain reliable estimates related to categorical data characteristics We performed an extensive simulation to assess the effect of the number of dependent variables (1-4), the number of individuals (5-90), and the number of observations per individual (100-1600) on the estimation performance of group-level parameters and between-individual variability on a Bayesian MHMM with categorical data of various levels of complexity. We found that using multivariate data generally alleviates the sample size needed and improves the stability of the results. Regarding the estimation of group-level parameters, the number of individuals and observations largely compensate for each other. Meanwhile, only the former drives the estimation of between-individual variability. We conclude with guidelines on the sample size necessary based on the complexity of the data and the study objectives of the practitioners.
This repository contains data generated for the manuscript: "Go multivariate: a Monte Carlo study of a multilevel hidden Markov model with categorical data of varying complexity". It comprehends: (1) model outputs (maximum a posteriori estimates) for each repetition (n=100) of each scenario (n=324) of the main simulation, (2) complete model outputs (including estimates for 4000 MCMC iterations) for two chains of each repetition (n=3) of each scenario (n=324). Please note that the empirical data used in the manuscript is not available as part of this repository. A subsample of the data used in the empirical example are openly available as an example data set in the R package mHMMbayes on CRAN. The full data set is available on request from the authors.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Abstract: This contribution provides MATLAB scripts to assist users in factor analysis, constrained least squares regression, and total inversion techniques. These scripts respond to the increased availability of large datasets generated by modern instrumentation, for example, the SedDB database. The download (.zip) includes one descriptive paper (.pdf) and one file of the scripts and example output (.doc). Other Description: Pisias, N. G., R. W. Murray, and R. P. Scudder (2013), Multivariate statistical analysis and partitioning of sedimentary geochemical data sets: General principles and specific MATLAB scripts, Geochem. Geophys. Geosyst., 14, 4015–4020, doi:10.1002/ggge.20247.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The multilevel hidden Markov model (MHMM) is a promising vehicle to investigate latent dynamics over time in social and behavioral processes. By including continuous individual random effects, the model accommodates variability between individuals, providing individual-specific trajectories and facilitating the study of individual differences. However, the performance of the MHMM has not been sufficiently explored. Currently, there are no practical guidelines on the sample size needed to obtain reliable estimates related to categorical data characteristics We performed an extensive simulation to assess the effect of the number of dependent variables (1-4), the number of individuals (5-90), and the number of observations per individual (100-1600) on the estimation performance of group-level parameters and between-individual variability on a Bayesian MHMM with categorical data of various levels of complexity. We found that using multivariate data generally alleviates the sample size needed and improves the stability of the results. Regarding the estimation of group-level parameters, the number of individuals and observations largely compensate for each other. Meanwhile, only the former drives the estimation of between-individual variability. We conclude with guidelines on the sample size necessary based on the complexity of the data and the study objectives of the practitioners.
This repository contains data generated for the manuscript: "Go multivariate: recommendations on multilevel hidden Markov models with categorical data of varying complexity". It comprehends: (1) model outputs (maximum a posteriori estimates) for each repetition (n=100) of each scenario (n=324) of the main simulation, (2) complete model outputs (including estimates for 4000 MCMC iterations) for two chains of each repetition (n=3) of each scenario (n=324). Please note that the empirical data used in the manuscript is not available as part of this repository. A subsample of the data used in the empirical example are openly available as an example data set in the R package mHMMbayes on CRAN. The full data set is available on request from the authors.
Facebook
TwitterMultivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
We are building an HR analytics data set that can be used for building useful reports, understanding the difference between data and information, and multivariate analysis. The data set we are building is similar to that used in several academic reports and what may be found in ERP HR subsystems.
We will update the sample data set as we gain a better understanding of the data elements using the calculations that exist in scholarly journals. Specifically, we will use the correlation tables to rebuild the data sets.
The fields represent a fictitious data set where a survey was taken and actual employee metrics exist for a particular organization. None of this data is real.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Prabhjot Singh contributed a portion of the data (the columns on the right before the survey data was added). https://www.kaggle.com/prabhjotindia https://www.kaggle.com/prabhjotindia/visualizing-employee-data/data
About this Dataset Why are our best and most experienced employees leaving prematurely? Have fun with this database and try to predict which valuable employees will leave next. Fields in the dataset include:
Satisfaction Level Last evaluation Number of projects Average monthly hours Time spent at the company Whether they have had a work accident Whether they have had a promotion in the last 5 years Departments Salary
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains four examples of merging crystallographic intensities with a bivariate prior:
Additionally, we provide several auxilliary examples:
Every example includes scripts to run Careless as well as to analyze the outputs in order to reproduce the figures in the double-Wilson manuscript. For every example, there is a `README.md` that describes the contents of each example folder.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Abstract: We present here annotated MATLAB scripts (and specific guidelines for their use) for Q-mode factor analysis, a constrained least squares multiple linear regression technique, and a total inversion protocol, that are based on the well-known approaches taken by Dymond (1981), Leinen and Pisias (1984), Kyte et al. (1993), and their predecessors. Although these techniques have been used by investigators for the past decades, their application has been neither consistent nor transparent, as their code has remained in-house or in formats not commonly used by many of today's researchers (e.g., FORTRAN). In addition to providing the annotated scripts and instructions for use, we include a sample data set for the user to test their own manipulation of the scripts. Other Description: Pisias, N. G., R. W. Murray, and R. P. Scudder (2013), Multivariate statistical analysis and partitioning of sedimentary geochemical data sets: General principles and specific MATLAB scripts, Geochem. Geophys. Geosyst., 14, 4015–4020, doi:10.1002/ggge.20247.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example of data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme from "QLD DNRM Hydrochemistry with QA/QC" and "NSW Office of Water Groundwater Quality extract 28_nov_2013" data provided by the Qld DNRM and NSW Office of Water. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
The dataset contains the outputs of a multivariate statistical analysis conducted on groundwater chemistry data for different river basins or sub-regions within the CLM bioregion. The analysis was conducted using Statgraphics software. Only samples that passed the QA/QC checks (e.g. charge balances within +-5%) were included in the analysis.
The original datasets were clipped to the CLM bioregion. After an initial data quality check, only those samples that met the criteria (e.g. charge balance between + and - 5%) were included in the multivariate statistical analysis. Multivariate statistical analysis was conducted on the remaining dataset (i.e. those samples that did not meet the QA/QC criteria removed), resulting in different groundwater chemistry groups.
The methodology is described in more detail by Raiber et al., (2012).
M Raiber, PA White, CJ Daughney, C Tschritter, P Davidson (2012). Three-dimensional geological modelling and multivariate statistical analysis of water chemistry data to analyse and visualise aquifer structure and groundwater composition in the Wairau Plain, Marlborough District, New Zealand, Journal of Hydrology 436, 13-34
Bioregional Assessment Programme (2014) CLM - Groundwater Chemistry outputs from multivariate statistics. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/4c128b86-1089-4ba9-85f8-76bbd65db396.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The multilevel hidden Markov model (MHMM) is a promising vehicle to investigate latent dynamics over time in social and behavioral processes. By including continuous individual random effects, the model accommodates variability between individuals, providing individual-specific trajectories and facilitating the study of individual differences. However, the performance of the MHMM has not been sufficiently explored. Currently, there are no practical guidelines on the sample size needed to obtain reliable estimates related to categorical data characteristics We performed an extensive simulation to assess the effect of the number of dependent variables (1-4), the number of individuals (5-90), and the number of observations per individual (100-1600) on the estimation performance of group-level parameters and between-individual variability on a Bayesian MHMM with categorical data of various levels of complexity. We found that using multivariate data generally alleviates the sample size needed and improves the stability of the results. Regarding the estimation of group-level parameters, the number of individuals and observations largely compensate for each other. Meanwhile, only the former drives the estimation of between-individual variability. We conclude with guidelines on the sample size necessary based on the complexity of the data and the study objectives of the practitioners.
This repository contains data generated for the manuscript: "Go multivariate: a Monte Carlo study of a multilevel hidden Markov model with categorical data of varying complexity". It comprehends: (1) model outputs (maximum a posteriori estimates) for each repetition (n=100) of each scenario (n=324) of the main simulation, (2) complete model outputs (including estimates for 4000 MCMC iterations) for two chains of each repetition (n=3) of each scenario (n=324). Please note that the empirical data used in the manuscript is not available as part of this repository. A subsample of the data used in the empirical example are openly available as an example data set in the R package mHMMbayes on CRAN. The full data set is available on request from the authors.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of students’ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and students’ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of students’ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the students’ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a students’ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides simulated learning behavior data for 15 classes over 148 days, from August 31, 2023, to January 25, 2024. The data includes basic learner information (a total of 1,364 learners), basic information about the exercises (a total of 44 items), and learner submission behavior logs (a total of 232,818 records). All data is provided in CSV format. The dataset contains noise such as missing values, outliers, or data inconsistencies (e.g., invalid classes, missing log entries, etc.), which participants need to identify and handle. The specific fields of the three data tables are described as follows:
Data_StudentInfo.csv
| Field Name | Description | Remarks |
|---|---|---|
| index | Learner index | |
| student_ID | Learner ID | Unique identifier |
| sex | Gender | |
| age | Age | |
| major | Major |
Data_TitleInfo.csv
| Field Name | Description | Remarks |
|---|---|---|
| index | Exercise index | |
| title_ID | Exercise ID | Unique identifier |
| score | Exercise score | |
| knowledge | Knowledge points | Each exercise may test multiple knowledge points |
| sub_knowledge | Sub-knowledge points | Knowledge points may have multiple sub-knowledge points |
The Data_SubmitRecord folder contains the learner submission behavior log data for 15 classes (Class1~Class15). For example, the file SubmitRecord-Class1.csv contains the submission logs for Class 1.
| Field Name | Description | Remarks |
|---|---|---|
| index | Record index | |
| class | Class | |
| time | Log generation time | Timestamp, accurate to the second |
| state | Submission state | Examples include fully correct, partially correct, etc., with a total of 12 statuses |
| score | Submission score | Score obtained from test cases |
| title_ID | Exercise ID | References title_ID in the exercise basic information table |
| method | Language | Programming language used by the learner |
| memory | Memory | Unit: KB |
| timeconsume | Time consumed | Unit: milliseconds |
| student_ID | Learner ID | References student_ID in the learner basic information table |
NorthClass is a renowned higher education training institution offering over 100 courses across a wide range of disciplines, including literature, science, engineering, medicine, economics, and management. With approximately 300,000 registered learners, the institution has created a flexible and convenient learning environment by providing high-quality educational services.
To keep up with the trends of the digital age and enhance its market competitiveness in the technology sector, NorthClass has developed a programming course. Learners are required to complete designated programming tasks during the course, with the opportunity for multiple attempts and submissions to ensure mastery and application of the learned knowledge. At the end of the course, the institution collected learners' time-series learning data to evaluate whether the teaching outcomes met predefined standards and requirements.
To optimize teaching resources and improve the quality of instruction, the institution plans to establish a specialized "Innovative Learning Development Group." This group will explore how to leverage next-generation AI technologies to empower education and better cultivate innovative talent capable of meeting the demands of the modern era.
Visualization and visual analysis utilize the high-bandwidth capabilities of human visual perception to transform complex time-series learning behavior data into graphical representations. These techniques enable the diagnosis and analysis of learners' knowledge mastery levels, dynamic tracking of the evolution of learning behaviors, and identification and analysis of potential factors causing learning difficulties.
As a member of the Innovative Learning Development Group, your task is to design and implement a ...
Facebook
TwitterMultivariate analysis on attitudes binary variable (positive versus negative attitudes) (N = 4,670, weighted sample).
Facebook
TwitterMultivariate analysis for entire sample using logistic regression.