67 datasets found
  1. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Carolina
    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  2. Supplementary file

    • zenodo.org
    • data.niaid.nih.gov
    Updated Mar 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markel Rico-González; Daniel Puche-Ortuño; Filipe Manuel Clemente; Rodrigo Aquino; José Pino-Ortega; Markel Rico-González; Daniel Puche-Ortuño; Filipe Manuel Clemente; Rodrigo Aquino; José Pino-Ortega (2022). Supplementary file [Dataset]. http://doi.org/10.5281/zenodo.6383007
    Explore at:
    Dataset updated
    Mar 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Markel Rico-González; Daniel Puche-Ortuño; Filipe Manuel Clemente; Rodrigo Aquino; José Pino-Ortega; Markel Rico-González; Daniel Puche-Ortuño; Filipe Manuel Clemente; Rodrigo Aquino; José Pino-Ortega
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Descriptive statistics (e.g., mean, median, standard deviation, percentile) of each variable extracted from Principal Component Analysis for each exercise cluster in elite professional female futsal players.

  3. f

    Table_1_Data Mining and Machine Learning Models for Predicting Drug Likeness...

    • frontiersin.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abraham Yosipof; Rita C. Guedes; Alfonso T. García-Sosa (2023). Table_1_Data Mining and Machine Learning Models for Predicting Drug Likeness and Their Disease or Organ Category.CSV [Dataset]. http://doi.org/10.3389/fchem.2018.00162.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Abraham Yosipof; Rita C. Guedes; Alfonso T. García-Sosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule.

  4. Reduction of size of rule sets for supervised discretisation [%].

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urszula Stańczyk; Beata Zielosko; Grzegorz Baron (2023). Reduction of size of rule sets for supervised discretisation [%]. [Dataset]. http://doi.org/10.1371/journal.pone.0231788.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Urszula Stańczyk; Beata Zielosko; Grzegorz Baron
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reduction of size of rule sets for supervised discretisation [%].

  5. ERAP1 polymorphisms interactions and their association with Behçet’s disease...

    • plos.figshare.com
    docx
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parisa Riahi; Anoshirvan Kazemnejad; Shayan Mostafaei; Akira Meguro; Nobuhisa Mizuki; Amir Ashraf-Ganjouei; Ali Javinani; Seyedeh Tahereh Faezi; Farhad Shahram; Mahdi Mahmoudi (2023). ERAP1 polymorphisms interactions and their association with Behçet’s disease susceptibly: Application of Model-Based Multifactor Dimension Reduction Algorithm (MB-MDR) [Dataset]. http://doi.org/10.1371/journal.pone.0227997
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Parisa Riahi; Anoshirvan Kazemnejad; Shayan Mostafaei; Akira Meguro; Nobuhisa Mizuki; Amir Ashraf-Ganjouei; Ali Javinani; Seyedeh Tahereh Faezi; Farhad Shahram; Mahdi Mahmoudi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundBehçet’s disease (BD) is a chronic multi-systemic vasculitis with a considerable prevalence in Asian countries. There are many genes associated with a higher risk of developing BD, one of which is endoplasmic reticulum aminopeptidase-1 (ERAP1). In this study, we aimed to investigate the interactions of ERAP1 single nucleotide polymorphisms (SNPs) using a novel data mining method called Model-based multifactor dimensionality reduction (MB-MDR).MethodsWe have included 748 BD patients and 776 healthy controls. A peripheral blood sample was collected, and eleven SNPs were assessed. Furthermore, we have applied the MB-MDR method to evaluate the interactions of ERAP1 gene polymorphisms.ResultsThe TT genotype of rs1065407 had a synergistic effect on BD susceptibility, considering the significant main effect. In the second order of interactions, CC genotype of rs2287987 and GG genotype of rs1065407 had the most prominent synergistic effect (β = 12.74). The mentioned genotypes also had significant interactions with CC genotype of rs26653 and TT genotype of rs30187 in the third-order (β = 12.74 and β = 12.73, respectively).ConclusionTo the best of our knowledge, this is the first study investigating the interaction of a particular gene’s SNPs in BD patients by applying a novel data mining method. However, future studies investigating the interactions of various genes could clarify this issue.

  6. Higher-order Mobility Flow Data

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Faraji; Ali Faraji; Jing Li; Jing Li; Gian Alix; Gian Alix; Mahmoud Alsaeed; Nina Yanin; Amirhossein Nadiri; Amirhossein Nadiri; Manos Papagelis; Mahmoud Alsaeed; Nina Yanin; Manos Papagelis (2023). Higher-order Mobility Flow Data [Dataset]. http://doi.org/10.5281/zenodo.8076553
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ali Faraji; Ali Faraji; Jing Li; Jing Li; Gian Alix; Gian Alix; Mahmoud Alsaeed; Nina Yanin; Amirhossein Nadiri; Amirhossein Nadiri; Manos Papagelis; Mahmoud Alsaeed; Nina Yanin; Manos Papagelis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a collection of higher-order mobility datasets, primarily aimed at trajectory data mining applications. These datasets have been created using the Point2Hex tool, allowing us to transform traditional GPS-based geolocations and check-in data into sequences of higher-order geometric elements, particularly hexagons. This transformation has various advantages, including reduced sparsity, analysis at different levels of granularity, improved compatibility with common machine learning architectures, enhanced generalization and overfitting reduction, and efficient visualization.

    Seven popular mobility datasets, typically utilized in various trajectory-related tasks and technical problems, were subjected to this transformation process. These include applications like trajectory prediction, classification, clustering, imputation, and anomaly detection, among others.

    To foster the culture of reusability and reproducibility, we are providing not only the transformed higher-order mobility flow datasets but also the source code for the Point2Hex tool and comprehensive documentation. This offering aims to streamline the generation process, ensuring that users have clear guidance on how to reproduce curated or customized versions of these datasets. The material is stored in publicly accessible repositories, ensuring its widespread accessibility.

  7. m

    Data from: Stock market trading via actor-critic reinforcement learning and...

    • data.mendeley.com
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cesar Guevara (2024). Stock market trading via actor-critic reinforcement learning and adaptable data structure [Dataset]. http://doi.org/10.17632/9bp5bd7gn4.1
    Explore at:
    Dataset updated
    Sep 12, 2024
    Authors
    Cesar Guevara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Currently, the stock market is attractive, and it is challenging to develop an efficient investment model with high accuracy due to changes in the values of the shares for political, economic, and social reasons. This paper presents an innovative proposal for a short-term, automatic investment model to reduce capital loss during trading, applying a reinforcement learning (RL) model. On the other hand, we propose an adaptable data window structure to enhance the learning and accuracy of investment agents in three foreign exchange markets: crude oil, gold, and the Euro. In addition, the RL model employs an actor-critic neural network with rectified linear unit (ReLU) neurons to generate specialized investment agents, enabling more efficient trading, minimizing investment losses across different time periods, and reducing the model's learning time. The proposed RL model obtained a reduction average loss of 0.03% in Euro, 0.25% in Gold, and 0.13% in Crude Oil in the test phase with varying initial conditions.

  8. A

    DEREChOS: Data Environment for Rapid Exploration and Characterization of...

    • data.amerigeoss.org
    html
    Updated Jul 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). DEREChOS: Data Environment for Rapid Exploration and Characterization of Organized Systems [Dataset]. https://data.amerigeoss.org/dataset/derechos-data-environment-for-rapid-exploration-and-characterization-of-organized-systems
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 26, 2019
    Dataset provided by
    United States[old]
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Motivation/Problem Statement: DEREChOS is a natural advancement of the existing, highly-successful Automated Event Service (AES) project. AES is an advanced system that facilitates efficient exploration and analysis of Earth science data. While AES is well-suited for the original purpose of searching for phenomena in regularly gridded data (e.g., reanalyses), targeted extensions would enable a much broader class of Earth science investigations to exploit the performance and flexibility of this service. We present a relevancy scenario, Event-based Hydrometeorological Science Data Analysis, which highlights the need for these features that would maximize the potential of DEREChOS for scientific research.

    Proposed solution: We propose to develop DEREChOS, an extension of AES, that: (1) generalizes the underlying representation to support irregularly spaced observations such as point and swath data, (2) incorporates appropriate re-gridding and interpolation utilities to enable analysis across data from different sources, (3) introduces nonlinear dimensionality reduction (NDR) to facilitate identification of scientific relationships among high-dimensional datasets, and (4) integrates Moving Object Database technology to improve treatment of continuity for the events with coarse representation in time. With these features, DEREChOS will become a powerful environment that is appropriate for a very wide variety of Earth science analysis scenarios.

    Research strategy: DEREChOS will be created by integrating various separately developed technologies. In most cases this will require some re-implementation to exploit SciDB, the underlying database that has strong support for multidimensional scientific data. Where possible, synthetic data/inputs will be generated to facilitate independent testing of new components. A scientific use case will be used to derive specific interface requirements and to demonstrate integration success.

    Significance: Freshwater resources are predicted to be a major focus of contention and conflict in the 21st century. Thus, hydrometeorology and hydrology communities are particularly attracted by the superior research productivity through AES, which has been demonstrated for two real-world use cases. This interest is reflected by the participation in DEREChOS of our esteemed collaborators, who include the Project Scientist of NASA SMAP, the Principal Scientist of NOAA MRMS, and lead algorithm developers of NASA GPM.

    Relevance to the Program Element: This proposal responds to the core AIST program topic: 2.1.3 Data-Centric-Technologies. DEREChOS specifically addresses the request for big data analytics, including tools and techniques for data fusion and data mining, applied to the substantial data and metadata that result from Earth science observation and the use of other data-centric technologies.

    TRL: Although AES will have achieved an exit TRL of 5 by the start date of this proposed project, DEREChOS will have an entry TRL of 3 due to the new innovations that have not previously been implemented within the underlying SciDB database. We expect that DEREChOS will have an exit TRL of 5 corresponding to an end-to-end test of the full system in a relevant environment.

  9. d

    Friedl presentation at CIDU

    • catalog.data.gov
    • data.nasa.gov
    • +1more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Friedl presentation at CIDU [Dataset]. https://catalog.data.gov/dataset/friedl-presentation-at-cidu
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    The land remote sensing community has a long history of using supervised and unsupervised methods to help interpret and analyze remote sensing data sets. Until relatively recently, most remote sensing studies have used fairly conventional image processing and pattern recognition methodologies. In the past decade, NASA has launched a series of remote sensing missions known as the Earth Observing System (EOS). The data sets acquired by EOS instruments provide an extremely rich source of information related to the properties and dynamics of the Earth’s terrestrial ecosystems. However, these data are also characterized by large volumes and complex spectral, spatial and temporal attributes. Because of the volume and complexity of EOS data sets, efficient and effective analysis of them presents significant challenges that are difficult to address using conventional remote sensing approaches. In this paper we discuss results from applying a variety of different data mining approaches to global remote sensing data sets. Specifically, we describe three main problem domains and sets of analyses: (1) supervised classification of global land cover from using data from NASA’s Moderate Resolution Imaging Spectroradiometer; (2) the use of linear and non-linear cluster and dimensionality reduction methods to examine coupled climate-vegetation dynamics using a twenty year time series of data from the Advanced Very High Resolution Radiometer; and (3) the use of functional models, non-parametric clustering, and mixture models to help interpret and understand the feature space and class structure of high dimensional remote sensing data sets. The paper will not focus on specific details of algorithms. Instead we describe key results, successes, and lessons learned from ten years of research focusing on the use of data mining and machine learning methods for remote sensing and Earth science problems.

  10. f

    Data from: Characterization of the self-perception of oral health in the...

    • scielo.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danielle Bordin; Cristina Berger Fadel; Suzely Adas Saliba Moimaz; Celso Bilynkievycz dos Santos; Cléa Adas Saliba Garbin; Nemre Adas Saliba (2023). Characterization of the self-perception of oral health in the Brazilian adult population [Dataset]. http://doi.org/10.6084/m9.figshare.14284268.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    SciELO journals
    Authors
    Danielle Bordin; Cristina Berger Fadel; Suzely Adas Saliba Moimaz; Celso Bilynkievycz dos Santos; Cléa Adas Saliba Garbin; Nemre Adas Saliba
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract This article aims to perform an analysis of the factors that determine the self-perception of oral health of Brazilians, based on a multidimensional methodology basis. This is a cross-sectional study with data from a national survey. A household interview was conducted with a sample of 60,202 adults. Self-perception of oral health was considered the outcome variable and sociodemographic characteristics, self-care and oral health condition, use of dental services, general health and work condition as independent variables. The dimensionality reduction test was used and the variables that showed a relationship were submitted to logistic regression. The negative oral health condition was related to difficulty feeding, negative evaluation of the last dental appointment, negative self-perception of general health condition, not flossing, upper dental loss, and reason for the last dental appointment. The use of a multidimensional methodological basis was able to design explanatory models for the self-perception of oral health of Brazilian adults, and these results should be considered in the implementation, evaluation, and qualification of the oral health network.

  11. A MACHINE LEARNING-BASED MODELING APPROACH FOR DYE REMOVAL USING MODIFIED...

    • zenodo.org
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Betül Uzbaş; Betül Uzbaş; Suheyla Kocaman; Suheyla Kocaman (2025). A MACHINE LEARNING-BASED MODELING APPROACH FOR DYE REMOVAL USING MODIFIED NATURAL ADSORBENTS [Dataset]. http://doi.org/10.5281/zenodo.15655679
    Explore at:
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Betül Uzbaş; Betül Uzbaş; Suheyla Kocaman; Suheyla Kocaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 13, 2025
    Description

    This repository contains the dataset and accompanying scripts used in the study titled "A Machine Learning-Based Modeling Approach for Dye Removal Using Modified Natural Adsorbents."

    The dataset used in this investigation includes maximum adsorption capacity (qe) and removal percentage (%) values obtained by removing MB dye from waste water using different adsorbent types. These adsorbents were modified by incorporating LA into almond, walnut, and apricot kernel powders. The data set under consideration encompasses pH, adsorbent dose, concentration, temperature, and time values.

    The output variables used for modeling are maximum adsorption capacity (qe) and removal percentage (%).

    This research received no external funding. However, the datasets used in this study were obtained from the peer-reviewed publication by Süheyla Kocaman (2020), titled "Removal of methylene blue dye from aqueous solutions by adsorption on levulinic acid-modified natural shells", published in Environmental Technology (Vol. 22, pp. 885–895). DOI: https://doi.org/10.1080/15226514.2020.1736512. The original data was reused in this work for machine learning-based modeling purposes with proper attribution.

  12. a

    Stanford CS229 - Machine Learning - Andrew Ng

    • academictorrents.com
    bittorrent
    Updated Apr 24, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Ng (2015). Stanford CS229 - Machine Learning - Andrew Ng [Dataset]. https://academictorrents.com/details/da90dedfb78190e5c62af1ad40a2413cb918457f
    Explore at:
    bittorrent(4211379788)Available download formats
    Dataset updated
    Apr 24, 2015
    Dataset authored and provided by
    Andrew Ng
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Course Description This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include: supervised learning (generative/discriminative learning, parametric/non-parametric learning, neural networks, support vector machines); unsupervised learning (clustering, dimensionality reduction, kernel methods); learning theory (bias/variance tradeoffs; VC theory; large margins); reinforcement learning and adaptive control. The course will also discuss recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing. # Prerequisites Students are expected to have the following background: Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program. Familiarity with the basic probability theory. (CS109 or Stat116 is sufficient but not necessary.) Familiarity with the basic l

  13. Data from: Mountaintop Removal Coal Mining Impacts on Structural and...

    • catalog.data.gov
    Updated Jan 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Mountaintop Removal Coal Mining Impacts on Structural and Functional Indicators in Central Appalachian Streams [Dataset]. https://catalog.data.gov/dataset/mountaintop-removal-coal-mining-impacts-on-structural-and-functional-indicators-in-central
    Explore at:
    Dataset updated
    Jan 8, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    Appalachian Mountains
    Description

    Mountaintop removal coal mining (MTR) has been a major source of landscape change in the Central Appalachians of the United States (US). Changes in stream hydrology, channel geomorphology and water quality caused by MTR coal mining can lead to severe impairment of stream ecological integrity. The objective of the Clean Water Act (CWA) is to restore and maintain the ecological integrity of the Nation’s waters. Sensitive, readily measured indicators of ecosystem structure and function are needed for the assessment of stream ecological integrity. Most such assessments rely on structural indicators; inclusion of functional indicators could make these assessments more holistic and effective. The goals of this study were: (1) test the efficacy of selected carbon (C) and nitrogen (N) cycling and microbial structural and functional indicators for assessing MTR coal mining impacts on streams; (2) determine whether indicators respond to impacts in a predictable manner and (3) determine if functional indicators are less likely to change than are structural indicators in response to stressors associated with MTR coal mining.

  14. f

    Data mining applied to feature selection methods for aboveground carbon...

    • scielo.figshare.com
    tiff
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mônica Canaan Carvalho; Lucas Rezende Gomide; José Roberto Soares Scolforo; Kalill José Viana da Páscoa; Laís Almeida Araújo; Isáira Leite e Lopes (2023). Data mining applied to feature selection methods for aboveground carbon stock modelling [Dataset]. http://doi.org/10.6084/m9.figshare.21679161.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    SciELO journals
    Authors
    Mônica Canaan Carvalho; Lucas Rezende Gomide; José Roberto Soares Scolforo; Kalill José Viana da Páscoa; Laís Almeida Araújo; Isáira Leite e Lopes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The objective of this work was to apply the random forest (RF) algorithm to the modelling of the aboveground carbon (AGC) stock of a tropical forest by testing three feature selection procedures – recursive removal and the uniobjective and multiobjective genetic algorithms (GAs). The used database covered 1,007 plots sampled in the Rio Grande watershed, in the state of Minas Gerais state, Brazil, and 114 environmental variables (climatic, edaphic, geographic, terrain, and spectral). The best feature selection strategy – RF with multiobjective GA – reaches the minor root-square error of 17.75 Mg ha-1 with only four spectral variables – normalized difference moisture index, normalized burnratio 2 correlation text ure, treecover, and latent heat flux –, which represents a reduction of 96.5% in the size of the database. Feature selection strategies assist in obtaining a better RF performance, by improving the accuracy and reducing the volume of the data. Although the recursive removal and multiobjective GA showed a similar performance as feature selection strategies, the latter presents the smallest subset of variables, with the highest accuracy. The findings of this study highlight the importance of using near infrared, short wavelengths, and derived vegetation indices for the remote-sense-based estimation of AGC. The MODIS products show a significant relationship with the AGC stock and should be further explored by the scientific community for the modelling of this stock.

  15. f

    Parameters of each model.

    • plos.figshare.com
    xls
    Updated Oct 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chengxin Yin; Dezhao Tang; Fang Zhang; Qichao Tang; Yang Feng; Zhen He (2023). Parameters of each model. [Dataset]. http://doi.org/10.1371/journal.pone.0286156.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 25, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Chengxin Yin; Dezhao Tang; Fang Zhang; Qichao Tang; Yang Feng; Zhen He
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the development of information technology construction in schools, predicting student grades has become a hot area of application in current educational research. Using data mining to analyze the influencing factors of students’ performance and predict their grades can help students identify their shortcomings, optimize teachers’ teaching methods and enable parents to guide their children’s progress. However, there are no models that can achieve satisfactory predictions for education-related public datasets, and most of these weakly correlated factors in the datasets can still adversely affect the predictive effect of the model. To solve this issue and provide effective policy recommendations for the modernization of education, this paper seeks to find the best grade prediction model based on data mining. Firstly, the study uses the Factor Analyze (FA) model to extract features from the original data and achieve dimension reduction. Then, the Bidirectional Gate Recurrent Unit (BiGRU) model and attention mechanism are utilized to predict grades. Lastly, Comparing the prediction results of ablation experiments and other single models, such as linear regression (LR), back propagation neural network (BP), random forest (RF), and Gate Recurrent Unit (GRU), the FA-BiGRU-attention model achieves the best prediction effect and performs equally well in different multi-step predictions. Previously, problems with students’ grades were only detected when they had already appeared. However, the methods presented in this paper enable the prediction of students’ learning in advance and the identification of factors affecting their grades. Therefore, this study has great potential to provide data support for the improvement of educational programs, transform the traditional education industry, and ensure the sustainable development of national talents.

  16. Z

    Data from: Ancient Greek language models

    • data.niaid.nih.gov
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedrazzini (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
    Explore at:
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    Peels-Matthey
    Nissim
    McGillivray
    Pedrazzini
    Stopponi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

    Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

    [Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

    [ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

    Diachronica models

    Training data

    Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

    Classical subcorpus

    Hellenistic subcorpus

    Whole corpus

    Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

    Word2Vec

    Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

    Syntactic word embeddings

    Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

    ALP models

    Training data

    Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

    Word2Vec

    Software used: Gensim library (Řehůřek and Sojka, 2010)

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

    References

    Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

    Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

    Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

    Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

    Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

    Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

    Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

    Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

    Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

    Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.

  17. Data from: When does extreme drought elicit extreme ecological responses?

    • zenodo.org
    • datadryad.org
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fangyue Zhang; Quan Quan; Fangfang Ma; Dashuan Tian; David L. Hoover; Qingping Zhou; Shuli Niu; Fangyue Zhang; Quan Quan; Fangfang Ma; Dashuan Tian; David L. Hoover; Qingping Zhou; Shuli Niu (2022). Data from: When does extreme drought elicit extreme ecological responses? [Dataset]. http://doi.org/10.5061/dryad.rj6h2q7
    Explore at:
    Dataset updated
    Jun 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fangyue Zhang; Quan Quan; Fangfang Ma; Dashuan Tian; David L. Hoover; Qingping Zhou; Shuli Niu; Fangyue Zhang; Quan Quan; Fangfang Ma; Dashuan Tian; David L. Hoover; Qingping Zhou; Shuli Niu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
    1. Global climate change models predict an increase in the frequency and intensity of extreme droughts, with uncertain ecological impacts across ecosystems. In particular, it is not clear when extreme droughts will elicit extreme ecological responses. 2. For this study, we employed three complementary approaches to explore the relationships between extreme drought and ecosystem responses. First, we used global data mining to evaluate the relationship between extreme gross primary productivity (GPP) and extreme precipitation from 1980 to 2013. Second, we conducted a meta‐analysis using 132 drought experiments across the globe to assess the response ratios of aboveground net primary productivity (ANPP) to extreme vs. non‐extreme drought treatments. Third, we examined community and ecosystem responses in an alpine meadow to a drought gradient experiment, which included five precipitation treatment levels (1/12 P, 1/4 P, 1/2 P, 3/4 P, and P, where P is the growing season precipitation). 3. This study had three key results. In our historical data mining, we found that extreme droughts elicited extreme ecological responses only 15.1% of the time. The meta‐analysis results indicated that there were no significant differences in the response ratios of ANPP between the extreme vs. non‐extreme drought treatments. The drought gradient experiment results revealed that although the four drought treatments were statistically extreme, only the most extreme drought treatment (1/12 P) significantly reduced ANPP over the three years. Meanwhile, species richness and asynchrony were significantly reduced under the 1/12 P treatment, which led to a significant reduction in productivity. 4. Synthesis. These results suggest that extreme ecological responses to extreme drought may be less frequent than previously thought. But when they do occur, extreme ecological responses may be driven by plant community changes such as species asynchrony, species loss or species reordering. Our experimental results highlight the key role of community dynamics in determining the resistance of ecosystem productivity to extreme drought, which should be assessed when predicting ecological responses to climate change.
  18. Z

    Diverse Topologies for Evaluation of Geometric Similarity Metrics

    • data.niaid.nih.gov
    Updated Mar 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Olhofer (2022). Diverse Topologies for Evaluation of Geometric Similarity Metrics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6323250
    Explore at:
    Dataset updated
    Mar 16, 2022
    Dataset provided by
    Mariusz Bujny
    Markus Olhofer
    Fabian Duddeck
    Stefan Menzel
    Nivesh Dommaraju
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 7 datasets with each set containing 3D shapes with varying topological complexity. The datasets can be used to compare different metrics of geometric dissimilarity. Two of the datasets have topologically complex shapes that resemble designs obtained from topology optimization, a widely used design optimization method for engineering structures.

    We used this dataset for a related journal article with the following abstract: "In the early stages of engineering design, multitudes of feasible designs can be generated using structural optimization methods by varying the design requirements or user preferences for different performance objectives. Data mining such potentially large datasets is a challenging task. An unsupervised data-centric approach for exploring designs is to find clusters of similar designs and recommend only the cluster representatives for review. Design similarity can be defined not only on a purely functional level but also based on geometric properties, such as size, shape, and topology. While metrics such as chamfer distance measure the geometrical differences intuitively, it is more useful for design exploration to use metrics based on geometric features, which are extracted from high-dimensional 3D geometric data using dimensionality reduction techniques. If the Euclidean distance in the geometric features is meaningful, the features can be combined with performance attributes resulting in an aggregate feature vector that can potentially be useful in design exploration based on both geometry and performance. We propose a novel approach to evaluate such derived metrics by measuring their similarity with the metrics commonly used in 3D object classification. Furthermore, we measure clustering accuracy, which is a state-of-the-art unsupervised approach to evaluate metrics. For this purpose, we use a labeled, synthetic dataset with topologically complex designs. From our results, we conclude that Pointcloud Autoencoder is promising in encoding geometric features and developing a comprehensive design exploration method."

    For each dataset, shapes/designs are saved as surface mesh files (extension: stl) and point cloud files (extension: ply) in the folders "stls" and "plys" respectively. A brief description of the 7 different datasets is in the following table. For each dataset, the designs are named using numbers starting from 0, e.g., “0.stl, 1.stl, …, 19.stl” in the folder for the surface mesh files. Some of the datasets are labeled, i.e., each design belongs to a class. In a labeled dataset, all classes have the same number of designs, and the designs are named in the order of their class. For example, a labeled dataset with 4 designs and 2 classes contains files whose names start with {0, 1, 2, 3} where the designs {0, 1} belong to class 1, and {2, 3} belong to class 2.

        Dataset name
        Directory name
        Number of designs
        Number of classes
    
    
    
    
        Beam-rotation
        "rotate_beam"
        20
        None
    
    
        Beam-elongation
        "elongate_beam"
        20
        None
    
    
        Beam-translation
        "move_beam"
        20
        None
    
    
        Three cube trusses
        "three_cube_truss"
        150
        6
    
    
        Single cube trusses
        "single_cube_truss"
        275
        11
    
    
        Random topologies
        "three_cube_truss_random"
        1000
        50
    
    
        Topologically optimized designs
        "cube_opt_shapes"
        1500
        None
    
  19. d

    Data from: Mountaintop removal mining alters stream salamander population...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Sep 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven J. Price; Sara Beth Freytag; Simon J. Bonner; Andrea N. Drayer; Brenee' L. Muncy; Jacob M. Hutton; Christopher D. Barton (2018). Mountaintop removal mining alters stream salamander population dynamics [Dataset]. http://doi.org/10.5061/dryad.7278k2n
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 26, 2018
    Dataset provided by
    Dryad
    Authors
    Steven J. Price; Sara Beth Freytag; Simon J. Bonner; Andrea N. Drayer; Brenee' L. Muncy; Jacob M. Hutton; Christopher D. Barton
    Time period covered
    Sep 25, 2018
    Description

    Aim: Population dynamics are often tightly linked to the condition of the landscape. Focusing on a landscape impacted by mountaintop removal coal mining (MTR), we ask the following questions: (1) How does MTR influence vital rates including occupancy, colonization and persistence probabilities, and conditional abundance of stream salamander species and life stages? (2) Do species and life stages respond similar to MTR mining or is there significant variation among species and life stages?

    Location: Freshwater and terrestrial habitats in Central Appalachia (South‐eastern Kentucky, USA).

    Methods: We conducted salamander counts for three consecutive years in 23 headwater stream reaches in forested or previously mined landscapes. We used a hierarchical, N‐mixture model with dynamic occupancy to calculate species‐ and life stage‐specific occupancy, colonization and persistence rates, and abundance given occupancy. We examined the coefficients of the hierarchical priors to determine populat...

  20. Model-based multifactor dimensionality reduction algorithm for assessing the...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parisa Riahi; Anoshirvan Kazemnejad; Shayan Mostafaei; Akira Meguro; Nobuhisa Mizuki; Amir Ashraf-Ganjouei; Ali Javinani; Seyedeh Tahereh Faezi; Farhad Shahram; Mahdi Mahmoudi (2023). Model-based multifactor dimensionality reduction algorithm for assessing the main and interaction effects of 11 ERAP1 SNPs on Behçet’s disease risk (748 Iranian BD patients and776 healthy individuals). [Dataset]. http://doi.org/10.1371/journal.pone.0227997.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Parisa Riahi; Anoshirvan Kazemnejad; Shayan Mostafaei; Akira Meguro; Nobuhisa Mizuki; Amir Ashraf-Ganjouei; Ali Javinani; Seyedeh Tahereh Faezi; Farhad Shahram; Mahdi Mahmoudi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Iran
    Description

    Model-based multifactor dimensionality reduction algorithm for assessing the main and interaction effects of 11 ERAP1 SNPs on Behçet’s disease risk (748 Iranian BD patients and776 healthy individuals).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1

Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.

Explore at:
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
North Carolina
Description

The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

Search
Clear search
Close search
Google apps
Main menu