Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics (e.g., mean, median, standard deviation, percentile) of each variable extracted from Principal Component Analysis for each exercise cluster in elite professional female futsal players.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reduction of size of rule sets for supervised discretisation [%].
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundBehçet’s disease (BD) is a chronic multi-systemic vasculitis with a considerable prevalence in Asian countries. There are many genes associated with a higher risk of developing BD, one of which is endoplasmic reticulum aminopeptidase-1 (ERAP1). In this study, we aimed to investigate the interactions of ERAP1 single nucleotide polymorphisms (SNPs) using a novel data mining method called Model-based multifactor dimensionality reduction (MB-MDR).MethodsWe have included 748 BD patients and 776 healthy controls. A peripheral blood sample was collected, and eleven SNPs were assessed. Furthermore, we have applied the MB-MDR method to evaluate the interactions of ERAP1 gene polymorphisms.ResultsThe TT genotype of rs1065407 had a synergistic effect on BD susceptibility, considering the significant main effect. In the second order of interactions, CC genotype of rs2287987 and GG genotype of rs1065407 had the most prominent synergistic effect (β = 12.74). The mentioned genotypes also had significant interactions with CC genotype of rs26653 and TT genotype of rs30187 in the third-order (β = 12.74 and β = 12.73, respectively).ConclusionTo the best of our knowledge, this is the first study investigating the interaction of a particular gene’s SNPs in BD patients by applying a novel data mining method. However, future studies investigating the interactions of various genes could clarify this issue.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a collection of higher-order mobility datasets, primarily aimed at trajectory data mining applications. These datasets have been created using the Point2Hex tool, allowing us to transform traditional GPS-based geolocations and check-in data into sequences of higher-order geometric elements, particularly hexagons. This transformation has various advantages, including reduced sparsity, analysis at different levels of granularity, improved compatibility with common machine learning architectures, enhanced generalization and overfitting reduction, and efficient visualization.
Seven popular mobility datasets, typically utilized in various trajectory-related tasks and technical problems, were subjected to this transformation process. These include applications like trajectory prediction, classification, clustering, imputation, and anomaly detection, among others.
To foster the culture of reusability and reproducibility, we are providing not only the transformed higher-order mobility flow datasets but also the source code for the Point2Hex tool and comprehensive documentation. This offering aims to streamline the generation process, ensuring that users have clear guidance on how to reproduce curated or customized versions of these datasets. The material is stored in publicly accessible repositories, ensuring its widespread accessibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the stock market is attractive, and it is challenging to develop an efficient investment model with high accuracy due to changes in the values of the shares for political, economic, and social reasons. This paper presents an innovative proposal for a short-term, automatic investment model to reduce capital loss during trading, applying a reinforcement learning (RL) model. On the other hand, we propose an adaptable data window structure to enhance the learning and accuracy of investment agents in three foreign exchange markets: crude oil, gold, and the Euro. In addition, the RL model employs an actor-critic neural network with rectified linear unit (ReLU) neurons to generate specialized investment agents, enabling more efficient trading, minimizing investment losses across different time periods, and reducing the model's learning time. The proposed RL model obtained a reduction average loss of 0.03% in Euro, 0.25% in Gold, and 0.13% in Crude Oil in the test phase with varying initial conditions.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Motivation/Problem Statement: DEREChOS is a natural advancement of the existing, highly-successful Automated Event Service (AES) project. AES is an advanced system that facilitates efficient exploration and analysis of Earth science data. While AES is well-suited for the original purpose of searching for phenomena in regularly gridded data (e.g., reanalyses), targeted extensions would enable a much broader class of Earth science investigations to exploit the performance and flexibility of this service. We present a relevancy scenario, Event-based Hydrometeorological Science Data Analysis, which highlights the need for these features that would maximize the potential of DEREChOS for scientific research.
Proposed solution: We propose to develop DEREChOS, an extension of AES, that: (1) generalizes the underlying representation to support irregularly spaced observations such as point and swath data, (2) incorporates appropriate re-gridding and interpolation utilities to enable analysis across data from different sources, (3) introduces nonlinear dimensionality reduction (NDR) to facilitate identification of scientific relationships among high-dimensional datasets, and (4) integrates Moving Object Database technology to improve treatment of continuity for the events with coarse representation in time. With these features, DEREChOS will become a powerful environment that is appropriate for a very wide variety of Earth science analysis scenarios.
Research strategy: DEREChOS will be created by integrating various separately developed technologies. In most cases this will require some re-implementation to exploit SciDB, the underlying database that has strong support for multidimensional scientific data. Where possible, synthetic data/inputs will be generated to facilitate independent testing of new components. A scientific use case will be used to derive specific interface requirements and to demonstrate integration success.
Significance: Freshwater resources are predicted to be a major focus of contention and conflict in the 21st century. Thus, hydrometeorology and hydrology communities are particularly attracted by the superior research productivity through AES, which has been demonstrated for two real-world use cases. This interest is reflected by the participation in DEREChOS of our esteemed collaborators, who include the Project Scientist of NASA SMAP, the Principal Scientist of NOAA MRMS, and lead algorithm developers of NASA GPM.
Relevance to the Program Element: This proposal responds to the core AIST program topic: 2.1.3 Data-Centric-Technologies. DEREChOS specifically addresses the request for big data analytics, including tools and techniques for data fusion and data mining, applied to the substantial data and metadata that result from Earth science observation and the use of other data-centric technologies.
TRL: Although AES will have achieved an exit TRL of 5 by the start date of this proposed project, DEREChOS will have an entry TRL of 3 due to the new innovations that have not previously been implemented within the underlying SciDB database. We expect that DEREChOS will have an exit TRL of 5 corresponding to an end-to-end test of the full system in a relevant environment.
The land remote sensing community has a long history of using supervised and unsupervised methods to help interpret and analyze remote sensing data sets. Until relatively recently, most remote sensing studies have used fairly conventional image processing and pattern recognition methodologies. In the past decade, NASA has launched a series of remote sensing missions known as the Earth Observing System (EOS). The data sets acquired by EOS instruments provide an extremely rich source of information related to the properties and dynamics of the Earth’s terrestrial ecosystems. However, these data are also characterized by large volumes and complex spectral, spatial and temporal attributes. Because of the volume and complexity of EOS data sets, efficient and effective analysis of them presents significant challenges that are difficult to address using conventional remote sensing approaches. In this paper we discuss results from applying a variety of different data mining approaches to global remote sensing data sets. Specifically, we describe three main problem domains and sets of analyses: (1) supervised classification of global land cover from using data from NASA’s Moderate Resolution Imaging Spectroradiometer; (2) the use of linear and non-linear cluster and dimensionality reduction methods to examine coupled climate-vegetation dynamics using a twenty year time series of data from the Advanced Very High Resolution Radiometer; and (3) the use of functional models, non-parametric clustering, and mixture models to help interpret and understand the feature space and class structure of high dimensional remote sensing data sets. The paper will not focus on specific details of algorithms. Instead we describe key results, successes, and lessons learned from ten years of research focusing on the use of data mining and machine learning methods for remote sensing and Earth science problems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract This article aims to perform an analysis of the factors that determine the self-perception of oral health of Brazilians, based on a multidimensional methodology basis. This is a cross-sectional study with data from a national survey. A household interview was conducted with a sample of 60,202 adults. Self-perception of oral health was considered the outcome variable and sociodemographic characteristics, self-care and oral health condition, use of dental services, general health and work condition as independent variables. The dimensionality reduction test was used and the variables that showed a relationship were submitted to logistic regression. The negative oral health condition was related to difficulty feeding, negative evaluation of the last dental appointment, negative self-perception of general health condition, not flossing, upper dental loss, and reason for the last dental appointment. The use of a multidimensional methodological basis was able to design explanatory models for the self-perception of oral health of Brazilian adults, and these results should be considered in the implementation, evaluation, and qualification of the oral health network.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the dataset and accompanying scripts used in the study titled "A Machine Learning-Based Modeling Approach for Dye Removal Using Modified Natural Adsorbents."
The dataset used in this investigation includes maximum adsorption capacity (qe) and removal percentage (%) values obtained by removing MB dye from waste water using different adsorbent types. These adsorbents were modified by incorporating LA into almond, walnut, and apricot kernel powders. The data set under consideration encompasses pH, adsorbent dose, concentration, temperature, and time values.
The output variables used for modeling are maximum adsorption capacity (qe) and removal percentage (%).
This research received no external funding. However, the datasets used in this study were obtained from the peer-reviewed publication by Süheyla Kocaman (2020), titled "Removal of methylene blue dye from aqueous solutions by adsorption on levulinic acid-modified natural shells", published in Environmental Technology (Vol. 22, pp. 885–895). DOI: https://doi.org/10.1080/15226514.2020.1736512. The original data was reused in this work for machine learning-based modeling purposes with proper attribution.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Mountaintop removal coal mining (MTR) has been a major source of landscape change in the Central Appalachians of the United States (US). Changes in stream hydrology, channel geomorphology and water quality caused by MTR coal mining can lead to severe impairment of stream ecological integrity. The objective of the Clean Water Act (CWA) is to restore and maintain the ecological integrity of the Nation’s waters. Sensitive, readily measured indicators of ecosystem structure and function are needed for the assessment of stream ecological integrity. Most such assessments rely on structural indicators; inclusion of functional indicators could make these assessments more holistic and effective. The goals of this study were: (1) test the efficacy of selected carbon (C) and nitrogen (N) cycling and microbial structural and functional indicators for assessing MTR coal mining impacts on streams; (2) determine whether indicators respond to impacts in a predictable manner and (3) determine if functional indicators are less likely to change than are structural indicators in response to stressors associated with MTR coal mining.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The objective of this work was to apply the random forest (RF) algorithm to the modelling of the aboveground carbon (AGC) stock of a tropical forest by testing three feature selection procedures – recursive removal and the uniobjective and multiobjective genetic algorithms (GAs). The used database covered 1,007 plots sampled in the Rio Grande watershed, in the state of Minas Gerais state, Brazil, and 114 environmental variables (climatic, edaphic, geographic, terrain, and spectral). The best feature selection strategy – RF with multiobjective GA – reaches the minor root-square error of 17.75 Mg ha-1 with only four spectral variables – normalized difference moisture index, normalized burnratio 2 correlation text ure, treecover, and latent heat flux –, which represents a reduction of 96.5% in the size of the database. Feature selection strategies assist in obtaining a better RF performance, by improving the accuracy and reducing the volume of the data. Although the recursive removal and multiobjective GA showed a similar performance as feature selection strategies, the latter presents the smallest subset of variables, with the highest accuracy. The findings of this study highlight the importance of using near infrared, short wavelengths, and derived vegetation indices for the remote-sense-based estimation of AGC. The MODIS products show a significant relationship with the AGC stock and should be further explored by the scientific community for the modelling of this stock.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the development of information technology construction in schools, predicting student grades has become a hot area of application in current educational research. Using data mining to analyze the influencing factors of students’ performance and predict their grades can help students identify their shortcomings, optimize teachers’ teaching methods and enable parents to guide their children’s progress. However, there are no models that can achieve satisfactory predictions for education-related public datasets, and most of these weakly correlated factors in the datasets can still adversely affect the predictive effect of the model. To solve this issue and provide effective policy recommendations for the modernization of education, this paper seeks to find the best grade prediction model based on data mining. Firstly, the study uses the Factor Analyze (FA) model to extract features from the original data and achieve dimension reduction. Then, the Bidirectional Gate Recurrent Unit (BiGRU) model and attention mechanism are utilized to predict grades. Lastly, Comparing the prediction results of ablation experiments and other single models, such as linear regression (LR), back propagation neural network (BP), random forest (RF), and Gate Recurrent Unit (GRU), the FA-BiGRU-attention model achieves the best prediction effect and performs equally well in different multi-step predictions. Previously, problems with students’ grades were only detected when they had already appeared. However, the methods presented in this paper enable the prediction of students’ learning in advance and the identification of factors affecting their grades. Therefore, this study has great potential to provide data support for the improvement of educational programs, transform the traditional education industry, and ensure the sustainable development of national talents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.
Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.
[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.
[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006
Diachronica models
Training data
Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:
Classical subcorpus
Hellenistic subcorpus
Whole corpus
Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.
Word2Vec
Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.
Syntactic word embeddings
Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.
ALP models
Training data
Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.
Word2Vec
Software used: Gensim library (Řehůřek and Sojka, 2010)
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.
References
Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.
Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.
Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).
Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.
Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.
Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013
Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 7 datasets with each set containing 3D shapes with varying topological complexity. The datasets can be used to compare different metrics of geometric dissimilarity. Two of the datasets have topologically complex shapes that resemble designs obtained from topology optimization, a widely used design optimization method for engineering structures.
We used this dataset for a related journal article with the following abstract: "In the early stages of engineering design, multitudes of feasible designs can be generated using structural optimization methods by varying the design requirements or user preferences for different performance objectives. Data mining such potentially large datasets is a challenging task. An unsupervised data-centric approach for exploring designs is to find clusters of similar designs and recommend only the cluster representatives for review. Design similarity can be defined not only on a purely functional level but also based on geometric properties, such as size, shape, and topology. While metrics such as chamfer distance measure the geometrical differences intuitively, it is more useful for design exploration to use metrics based on geometric features, which are extracted from high-dimensional 3D geometric data using dimensionality reduction techniques. If the Euclidean distance in the geometric features is meaningful, the features can be combined with performance attributes resulting in an aggregate feature vector that can potentially be useful in design exploration based on both geometry and performance. We propose a novel approach to evaluate such derived metrics by measuring their similarity with the metrics commonly used in 3D object classification. Furthermore, we measure clustering accuracy, which is a state-of-the-art unsupervised approach to evaluate metrics. For this purpose, we use a labeled, synthetic dataset with topologically complex designs. From our results, we conclude that Pointcloud Autoencoder is promising in encoding geometric features and developing a comprehensive design exploration method."
For each dataset, shapes/designs are saved as surface mesh files (extension: stl) and point cloud files (extension: ply) in the folders "stls" and "plys" respectively. A brief description of the 7 different datasets is in the following table. For each dataset, the designs are named using numbers starting from 0, e.g., “0.stl, 1.stl, …, 19.stl” in the folder for the surface mesh files. Some of the datasets are labeled, i.e., each design belongs to a class. In a labeled dataset, all classes have the same number of designs, and the designs are named in the order of their class. For example, a labeled dataset with 4 designs and 2 classes contains files whose names start with {0, 1, 2, 3} where the designs {0, 1} belong to class 1, and {2, 3} belong to class 2.
Dataset name
Directory name
Number of designs
Number of classes
Beam-rotation
"rotate_beam"
20
None
Beam-elongation
"elongate_beam"
20
None
Beam-translation
"move_beam"
20
None
Three cube trusses
"three_cube_truss"
150
6
Single cube trusses
"single_cube_truss"
275
11
Random topologies
"three_cube_truss_random"
1000
50
Topologically optimized designs
"cube_opt_shapes"
1500
None
Aim: Population dynamics are often tightly linked to the condition of the landscape. Focusing on a landscape impacted by mountaintop removal coal mining (MTR), we ask the following questions: (1) How does MTR influence vital rates including occupancy, colonization and persistence probabilities, and conditional abundance of stream salamander species and life stages? (2) Do species and life stages respond similar to MTR mining or is there significant variation among species and life stages?
Location: Freshwater and terrestrial habitats in Central Appalachia (South‐eastern Kentucky, USA).
Methods: We conducted salamander counts for three consecutive years in 23 headwater stream reaches in forested or previously mined landscapes. We used a hierarchical, N‐mixture model with dynamic occupancy to calculate species‐ and life stage‐specific occupancy, colonization and persistence rates, and abundance given occupancy. We examined the coefficients of the hierarchical priors to determine populat...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model-based multifactor dimensionality reduction algorithm for assessing the main and interaction effects of 11 ERAP1 SNPs on Behçet’s disease risk (748 Iranian BD patients and776 healthy individuals).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.