Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reduction of size of rule sets for supervised discretisation [%].
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
As per our latest research, the global Vision-Based ADAS Data Mining market size stood at USD 2.84 billion in 2024 and is anticipated to reach USD 12.67 billion by 2033, growing at a robust CAGR of 17.8% during the forecast period. The primary growth factor driving this market is the rapid adoption of advanced driver assistance systems (ADAS) in modern vehicles, propelled by stringent safety regulations and the surging demand for enhanced driving experiences globally.
One of the most significant growth drivers for the Vision-Based ADAS Data Mining market is the increasing focus on vehicular safety and the reduction of road accidents. Governments worldwide are imposing stricter mandates on automotive OEMs to integrate advanced safety features such as lane departure warning, traffic sign recognition, and collision avoidance in all new vehicles. The integration of vision-based ADAS, powered by sophisticated data mining techniques, is enabling real-time analysis of driving environments, which significantly reduces the likelihood of human error and enhances overall road safety. As a result, automakers are investing heavily in research and development to improve the accuracy and reliability of these systems, further fueling market expansion.
Another pivotal factor contributing to market growth is the rapid technological advancements in machine learning, deep learning, and computer vision. These technologies are at the core of vision-based ADAS data mining, enabling systems to process vast amounts of visual data from cameras and sensors in real-time. The evolution of high-performance hardware and the proliferation of cloud-based analytics platforms have empowered ADAS solutions to become more intelligent, adaptive, and scalable. This technological leap has made it feasible to deploy sophisticated data mining algorithms even in cost-sensitive vehicle segments, accelerating the democratization of advanced safety features across a broader range of vehicles.
Additionally, the growing consumer inclination towards connected and autonomous vehicles is acting as a catalyst for the Vision-Based ADAS Data Mining market. As automotive manufacturers race to develop next-generation vehicles with semi-autonomous and autonomous capabilities, the demand for robust data mining solutions that can interpret complex traffic scenarios and driver behaviors is escalating. The ability of vision-based ADAS systems to seamlessly integrate with other in-vehicle technologies, such as infotainment and telematics, is further enhancing their value proposition. This convergence is not only improving the overall driving experience but also opening up new avenues for data-driven services and applications within the automotive ecosystem.
From a regional perspective, Asia Pacific is emerging as the most dynamic market for Vision-Based ADAS Data Mining, driven by the rapid expansion of the automotive sector in countries like China, Japan, and South Korea. The region's strong manufacturing base, coupled with supportive government initiatives aimed at promoting vehicle safety, is fostering a fertile environment for the adoption of advanced ADAS technologies. North America and Europe, on the other hand, continue to lead in terms of technological innovation and regulatory enforcement, ensuring that these markets remain at the forefront of global market share and revenue generation.
The Vision-Based ADAS Data Mining market is segmented by component into hardware, software, and services, each playing a distinct yet interconnected role in enabling advanced driver assistance functionalities. Hardware components such as cameras, sensors, and onboard processing units form the backbone of vision-based ADAS systems. These hardware elements are responsible for capturing and relaying vast amounts of visual and environmental data, which are subsequently processed and analyzed to generate actionable insights for drivers. The continuous evolution of high-resolution cameras and advanced sensor technologies is enhancing the precision and reliability of data collection, thereby improving the overall effectiveness of ADAS solutions.
Software is the core enabler of data mining within vision-based ADAS systems. This segment includes sophisticated algorithms for image processing, pattern recognition, and machine learning, which interpret raw sensor data to id
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.
Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.
[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.
[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006
Diachronica models
Training data
Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:
Classical subcorpus
Hellenistic subcorpus
Whole corpus
Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.
Word2Vec
Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.
Syntactic word embeddings
Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.
ALP models
Training data
Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.
Word2Vec
Software used: Gensim library (Řehůřek and Sojka, 2010)
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.
References
Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.
Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.
Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).
Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.
Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.
Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013
Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The objective of this work was to apply the random forest (RF) algorithm to the modelling of the aboveground carbon (AGC) stock of a tropical forest by testing three feature selection procedures – recursive removal and the uniobjective and multiobjective genetic algorithms (GAs). The used database covered 1,007 plots sampled in the Rio Grande watershed, in the state of Minas Gerais state, Brazil, and 114 environmental variables (climatic, edaphic, geographic, terrain, and spectral). The best feature selection strategy – RF with multiobjective GA – reaches the minor root-square error of 17.75 Mg ha-1 with only four spectral variables – normalized difference moisture index, normalized burnratio 2 correlation text ure, treecover, and latent heat flux –, which represents a reduction of 96.5% in the size of the database. Feature selection strategies assist in obtaining a better RF performance, by improving the accuracy and reducing the volume of the data. Although the recursive removal and multiobjective GA showed a similar performance as feature selection strategies, the latter presents the smallest subset of variables, with the highest accuracy. The findings of this study highlight the importance of using near infrared, short wavelengths, and derived vegetation indices for the remote-sense-based estimation of AGC. The MODIS products show a significant relationship with the AGC stock and should be further explored by the scientific community for the modelling of this stock.
Facebook
TwitterThe land remote sensing community has a long history of using supervised and unsupervised methods to help interpret and analyze remote sensing data sets. Until relatively recently, most remote sensing studies have used fairly conventional image processing and pattern recognition methodologies. In the past decade, NASA has launched a series of remote sensing missions known as the Earth Observing System (EOS). The data sets acquired by EOS instruments provide an extremely rich source of information related to the properties and dynamics of the Earth’s terrestrial ecosystems. However, these data are also characterized by large volumes and complex spectral, spatial and temporal attributes. Because of the volume and complexity of EOS data sets, efficient and effective analysis of them presents significant challenges that are difficult to address using conventional remote sensing approaches. In this paper we discuss results from applying a variety of different data mining approaches to global remote sensing data sets. Specifically, we describe three main problem domains and sets of analyses: (1) supervised classification of global land cover from using data from NASA’s Moderate Resolution Imaging Spectroradiometer; (2) the use of linear and non-linear cluster and dimensionality reduction methods to examine coupled climate-vegetation dynamics using a twenty year time series of data from the Advanced Very High Resolution Radiometer; and (3) the use of functional models, non-parametric clustering, and mixture models to help interpret and understand the feature space and class structure of high dimensional remote sensing data sets. The paper will not focus on specific details of algorithms. Instead we describe key results, successes, and lessons learned from ten years of research focusing on the use of data mining and machine learning methods for remote sensing and Earth science problems.
Facebook
TwitterThe land remote sensing community has a long history of using supervised and unsupervised methods to help interpret and analyze remote sensing data sets. Until relatively recently, most remote sensing studies have used fairly conventional image processing and pattern recognition methodologies. In the past decade, NASA has launched a series of remote sensing missions known as the Earth Observing System (EOS). The data sets acquired by EOS instruments provide an extremely rich source of information related to the properties and dynamics of the Earth’s terrestrial ecosystems. However, these data are also characterized by large volumes and complex spectral, spatial and temporal attributes. Because of the volume and complexity of EOS data sets, efficient and effective analysis of them presents significant challenges that are difficult to address using conventional remote sensing approaches. In this paper we discuss results from applying a variety of different data mining approaches to global remote sensing data sets. Specifically, we describe three main problem domains and sets of analyses: (1) supervised classification of global land cover from using data from NASA’s Moderate Resolution Imaging Spectroradiometer; (2) the use of linear and non-linear cluster and dimensionality reduction methods to examine coupled climate-vegetation dynamics using a twenty year time series of data from the Advanced Very High Resolution Radiometer; and (3) the use of functional models, non-parametric clustering, and mixture models to help interpret and understand the feature space and class structure of high dimensional remote sensing data sets. The paper will not focus on specific details of algorithms. Instead we describe key results, successes, and lessons learned from ten years of research focusing on the use of data mining and machine learning methods for remote sensing and Earth science problems.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundBehçet’s disease (BD) is a chronic multi-systemic vasculitis with a considerable prevalence in Asian countries. There are many genes associated with a higher risk of developing BD, one of which is endoplasmic reticulum aminopeptidase-1 (ERAP1). In this study, we aimed to investigate the interactions of ERAP1 single nucleotide polymorphisms (SNPs) using a novel data mining method called Model-based multifactor dimensionality reduction (MB-MDR).MethodsWe have included 748 BD patients and 776 healthy controls. A peripheral blood sample was collected, and eleven SNPs were assessed. Furthermore, we have applied the MB-MDR method to evaluate the interactions of ERAP1 gene polymorphisms.ResultsThe TT genotype of rs1065407 had a synergistic effect on BD susceptibility, considering the significant main effect. In the second order of interactions, CC genotype of rs2287987 and GG genotype of rs1065407 had the most prominent synergistic effect (β = 12.74). The mentioned genotypes also had significant interactions with CC genotype of rs26653 and TT genotype of rs30187 in the third-order (β = 12.74 and β = 12.73, respectively).ConclusionTo the best of our knowledge, this is the first study investigating the interaction of a particular gene’s SNPs in BD patients by applying a novel data mining method. However, future studies investigating the interactions of various genes could clarify this issue.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract This article aims to perform an analysis of the factors that determine the self-perception of oral health of Brazilians, based on a multidimensional methodology basis. This is a cross-sectional study with data from a national survey. A household interview was conducted with a sample of 60,202 adults. Self-perception of oral health was considered the outcome variable and sociodemographic characteristics, self-care and oral health condition, use of dental services, general health and work condition as independent variables. The dimensionality reduction test was used and the variables that showed a relationship were submitted to logistic regression. The negative oral health condition was related to difficulty feeding, negative evaluation of the last dental appointment, negative self-perception of general health condition, not flossing, upper dental loss, and reason for the last dental appointment. The use of a multidimensional methodological basis was able to design explanatory models for the self-perception of oral health of Brazilian adults, and these results should be considered in the implementation, evaluation, and qualification of the oral health network.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset provides a comprehensive study of age prediction using machine learning based on multi-omics markers. It contains data from twenty-one different genes (RPA2_3, ZYG11A_4, F5_2, HOXC4_1, NKIRAS2_2, MEIS1_1, SAMD10_2, GRM2_9, TRIM59_5, LDB2
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset collects multi-omics information from individuals to predict their age. It includes markers from various subtypes of cells, such as RPA2_3, ZYG11A_4, F5_2, HOXC4_1, NKIRAS2_2, MEIS1_1, SAMD10_2, GRM 2_9 TRIM59-5 LDB2-3 ELOVL 2-6 DDO 1 KLF14-2.
To use this dataset effectively requires knowledge in both genetic and machine learning techniques. For the former category the user must understand data mining approaches used in gene expression resolution while for the latter they must familiarize themselves with techniques such as regression methods and decision tree methods.
To get started working with this data set it is advised that users familiarize themselves with basic concepts such as multi variate analysis (PCA) and feature selection algorithms that may render dimensionality reduction easier before attempting more sophisticated methodology e.g., neural networks or support vector machines - these later techniques can provide more accurate predictions when properly tuned but require a greater learning curve than simpler models due to their complexity . Additionally utilize hyperparameter optimization processes which allow users to test multiple models quickly and see which approach yields the best results (given user’s computing resources). Last but not least once a good model has been identified save it for future use , either through serializing it or saving its weights –don't forget!
- Analyzing the correlation between gene expression levels and age to identify key biomarkers associated with certain life stages.
- Building machine learning models that can predict a person's age from their multi-omics data.
- Identifying potential drug targets based on changes in gene expression associated with age-related diseases
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: test_rows.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------| | RPA2_3 | Expression level of the RPA2 gene in the third sample. (Numeric) | | ZYG11A_4 | Expression level of the ZYG11A gene in the fourth sample. (Numeric) | | F5_2 | Expression level of the F5 gene in the second sample. (Numeric) | | HOXC4_1 | Expression level of the HOXC4 gene in the first sample. (Numeric) | | NKIRAS2_2 | Expression level of the NKIRAS2 gene in the second sample. (Numeric) | | MEIS1_1 | Expression level of the MEIS1 gene in the first sample. (Numeric) | | SAMD10_2 | Expression level of the SAMD10 gene in the second sample. (Numeric) | | GRM2_9 | Expression level of the GRM2 gene in the ninth sample. (Numeric) | | TRIM59_5 | Expression level of the TRIM59 gene in the fifth sample. (Numeric) | | LDB2_3 | Expression level of the LDB2 gene in the third sample. (Numeric) | | ELOVL2_6 | Expression level of the ELOVL2 gene in the sixth sample. (Numeric) | | DDO_1 | Expression level of the DDO gene in the first sample. (Numeric) | | KLF14_2 | Expression level of the KLF14 gene in the second sample. (Numeric) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
Twitterhttps://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for All Employees: Mining and Logging: Coal Mining in Wyoming (SMU56000001021210001SA) from Jan 1990 to Aug 2025 about coal, WY, mining, employment, and USA.
Facebook
TwitterOur team chose the A04 topic in the service outsourcing competition, which is a big data mining topic. In order to facilitate the code management of our team, we use the kaggle platform. If there is any violation of the original data set usage regulations, please correct it to us!
There are a total of 20,972 unique values for pax_name, and a total of 21,075 for pax_passport. These two values are actually the personal tags of an air passenger. Of course, the different numbers of the two values do not mean that the data set is wrong. One of them may be that the passenger has the same name, which causes pax_name to be less than pax_passport.
The data comes from "Competition data and instructions.xlsx" provided by Neusoft We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterMountaintop removal coal mining (MTR) has been a major source of landscape change in the Central Appalachians of the United States (US). Changes in stream hydrology, channel geomorphology and water quality caused by MTR coal mining can lead to severe impairment of stream ecological integrity. The objective of the Clean Water Act (CWA) is to restore and maintain the ecological integrity of the Nation’s waters. Sensitive, readily measured indicators of ecosystem structure and function are needed for the assessment of stream ecological integrity. Most such assessments rely on structural indicators; inclusion of functional indicators could make these assessments more holistic and effective. The goals of this study were: (1) test the efficacy of selected carbon (C) and nitrogen (N) cycling and microbial structural and functional indicators for assessing MTR coal mining impacts on streams; (2) determine whether indicators respond to impacts in a predictable manner and (3) determine if functional indicators are less likely to change than are structural indicators in response to stressors associated with MTR coal mining.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Anomaly Detection Market Size 2025-2029
The anomaly detection market size is valued to increase by USD 4.44 billion, at a CAGR of 14.4% from 2024 to 2029. Anomaly detection tools gaining traction in BFSI will drive the anomaly detection market.
Major Market Trends & Insights
North America dominated the market and accounted for a 43% growth during the forecast period.
By Deployment - Cloud segment was valued at USD 1.75 billion in 2023
By Component - Solution segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 173.26 million
Market Future Opportunities: USD 4441.70 million
CAGR from 2024 to 2029 : 14.4%
Market Summary
Anomaly detection, a critical component of advanced analytics, is witnessing significant adoption across various industries, with the financial services sector leading the charge. The increasing incidence of internal threats and cybersecurity frauds necessitates the need for robust anomaly detection solutions. These tools help organizations identify unusual patterns and deviations from normal behavior, enabling proactive response to potential threats and ensuring operational efficiency. For instance, in a supply chain context, anomaly detection can help identify discrepancies in inventory levels or delivery schedules, leading to cost savings and improved customer satisfaction. In the realm of compliance, anomaly detection can assist in maintaining regulatory adherence by flagging unusual transactions or activities, thereby reducing the risk of penalties and reputational damage.
According to recent research, organizations that implement anomaly detection solutions experience a reduction in error rates by up to 25%. This improvement not only enhances operational efficiency but also contributes to increased customer trust and satisfaction. Despite these benefits, challenges persist, including data quality and the need for real-time processing capabilities. As the market continues to evolve, advancements in machine learning and artificial intelligence are expected to address these challenges and drive further growth.
What will be the Size of the Anomaly Detection Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free Sample
How is the Anomaly Detection Market Segmented ?
The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Deployment
Cloud
On-premises
Component
Solution
Services
End-user
BFSI
IT and telecom
Retail and e-commerce
Manufacturing
Others
Technology
Big data analytics
AI and ML
Data mining and business intelligence
Geography
North America
US
Canada
Mexico
Europe
France
Germany
Spain
UK
APAC
China
India
Japan
Rest of World (ROW)
By Deployment Insights
The cloud segment is estimated to witness significant growth during the forecast period.
The market is witnessing significant growth, driven by the increasing adoption of advanced technologies such as machine learning algorithms, predictive modeling tools, and real-time monitoring systems. Businesses are increasingly relying on anomaly detection solutions to enhance their root cause analysis, improve system health indicators, and reduce false positives. This is particularly true in sectors where data is generated in real-time, such as cybersecurity threat detection, network intrusion detection, and fraud detection systems. Cloud-based anomaly detection solutions are gaining popularity due to their flexibility, scalability, and cost-effectiveness.
This growth is attributed to cloud-based solutions' quick deployment, real-time data visibility, and customization capabilities, which are offered at flexible payment options like monthly subscriptions and pay-as-you-go models. Companies like Anodot, Ltd, Cisco Systems Inc, IBM Corp, and SAS Institute Inc provide both cloud-based and on-premise anomaly detection solutions. Anomaly detection methods include outlier detection, change point detection, and statistical process control. Data preprocessing steps, such as data mining techniques and feature engineering processes, are crucial in ensuring accurate anomaly detection. Data visualization dashboards and alert fatigue mitigation techniques help in managing and interpreting the vast amounts of data generated.
Network traffic analysis, log file analysis, and sensor data integration are essential components of anomaly detection systems. Additionally, risk management frameworks, drift detection algorithms, time series forecasting, and performance degradation detection are vital in maintaining system performance and capacity planning.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Prescriptive Analytics Market Size 2025-2029
The prescriptive analytics market size is valued to increase by USD 10.96 billion, at a CAGR of 23.3% from 2024 to 2029. Rising demand for predictive analytics will drive the prescriptive analytics market.
Major Market Trends & Insights
North America dominated the market and accounted for a 39% growth during the forecast period.
By Solution - Services segment was valued at USD 3 billion in 2023
By Deployment - Cloud-based segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 359.55 million
Market Future Opportunities: USD 10962.00 million
CAGR from 2024 to 2029 : 23.3%
Market Summary
Prescriptive analytics, an advanced form of business intelligence, is gaining significant traction in today's data-driven business landscape. This analytical approach goes beyond traditional business intelligence and predictive analytics by providing actionable recommendations to optimize business processes and enhance operational efficiency. The market's growth is fueled by the increasing availability of real-time data, the rise of machine learning algorithms, and the growing demand for data-driven decision-making. One area where prescriptive analytics is making a significant impact is in supply chain optimization. For instance, a manufacturing company can use prescriptive analytics to analyze historical data and real-time market trends to optimize production schedules, minimize inventory costs, and improve delivery times.
In a recent study, a leading manufacturing firm implemented prescriptive analytics and achieved a 15% reduction in inventory holding costs and a 12% improvement in on-time delivery rates. However, the adoption of prescriptive analytics is not without challenges. Data privacy and regulatory compliance are major concerns, particularly in industries such as healthcare and finance. Companies must ensure that they have robust data security measures in place to protect sensitive customer information and comply with regulations such as HIPAA and GDPR. Despite these challenges, the benefits of prescriptive analytics far outweigh the costs, making it an essential tool for businesses looking to gain a competitive edge in their respective markets.
What will be the Size of the Prescriptive Analytics Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free Sample
How is the Prescriptive Analytics Market Segmented ?
The prescriptive analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Solution
Services
Product
Deployment
Cloud-based
On-premises
Sector
Large enterprises
Small and medium-sized enterprises (SMEs)
Geography
North America
US
Canada
Mexico
Europe
France
Germany
Italy
UK
APAC
China
India
Japan
Rest of World (ROW)
By Solution Insights
The services segment is estimated to witness significant growth during the forecast period.
In 2024, The market continues to evolve, becoming a pivotal force in data-driven decision-making across industries. With a projected growth of 15.2% annually, this market is transforming business landscapes by delivering actionable recommendations that align with strategic objectives. From enhancing customer satisfaction to optimizing operational efficiency and reducing costs, prescriptive analytics services are increasingly indispensable. Advanced optimization engines and AI-driven models now handle intricate decision variables, constraints, and trade-offs in real time. This real-time capability supports complex decision-making scenarios across strategic, tactical, and operational levels. Industries like healthcare, retail, manufacturing, and logistics are harnessing prescriptive analytics in unique ways.
Monte Carlo simulation, scenario planning, and neural networks are just a few techniques used to optimize supply chain operations. Data visualization dashboards, what-if analysis, and natural language processing facilitate better understanding of complex data. Reinforcement learning, time series forecasting, and inventory management are essential components of prescriptive modeling, enabling AI-driven recommendations. Decision support systems, dynamic programming, causal inference, and multi-objective optimization are integral to the decision-making process. Machine learning models, statistical modeling, and optimization algorithms power these advanced systems. Real-time analytics, risk assessment modeling, and linear programming are crucial for managing uncertainty and mitigating risks. Data mining techniques and expert systems provide valuable insights, while c
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Generative AI In Data Analytics Market Size 2025-2029
The generative ai in data analytics market size is valued to increase by USD 4.62 billion, at a CAGR of 35.5% from 2024 to 2029. Democratization of data analytics and increased accessibility will drive the generative ai in data analytics market.
Market Insights
North America dominated the market and accounted for a 37% growth during the 2025-2029.
By Deployment - Cloud-based segment was valued at USD 510.60 billion in 2023
By Technology - Machine learning segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 621.84 million
Market Future Opportunities 2024: USD 4624.00 million
CAGR from 2024 to 2029 : 35.5%
Market Summary
The market is experiencing significant growth as businesses worldwide seek to unlock new insights from their data through advanced technologies. This trend is driven by the democratization of data analytics and increased accessibility of AI models, which are now available in domain-specific and enterprise-tuned versions. Generative AI, a subset of artificial intelligence, uses deep learning algorithms to create new data based on existing data sets. This capability is particularly valuable in data analytics, where it can be used to generate predictions, recommendations, and even new data points. One real-world business scenario where generative AI is making a significant impact is in supply chain optimization. In this context, generative AI models can analyze historical data and generate forecasts for demand, inventory levels, and production schedules. This enables businesses to optimize their supply chain operations, reduce costs, and improve customer satisfaction. However, the adoption of generative AI in data analytics also presents challenges, particularly around data privacy, security, and governance. As businesses continue to generate and analyze increasingly large volumes of data, ensuring that it is protected and used in compliance with regulations is paramount. Despite these challenges, the benefits of generative AI in data analytics are clear, and its use is set to grow as businesses seek to gain a competitive edge through data-driven insights.
What will be the size of the Generative AI In Data Analytics Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free SampleGenerative AI, a subset of artificial intelligence, is revolutionizing data analytics by automating data processing and analysis, enabling businesses to derive valuable insights faster and more accurately. Synthetic data generation, a key application of generative AI, allows for the creation of large, realistic datasets, addressing the challenge of insufficient data in analytics. Parallel processing methods and high-performance computing power the rapid analysis of vast datasets. Automated machine learning and hyperparameter optimization streamline model development, while model monitoring systems ensure continuous model performance. Real-time data processing and scalable data solutions facilitate data-driven decision-making, enabling businesses to respond swiftly to market trends. One significant trend in the market is the integration of AI-powered insights into business operations. For instance, probabilistic graphical models and backpropagation techniques are used to predict customer churn and optimize marketing strategies. Ensemble learning methods and transfer learning techniques enhance predictive analytics, leading to improved customer segmentation and targeted marketing. According to recent studies, businesses have achieved a 30% reduction in processing time and a 25% increase in predictive accuracy by implementing generative AI in their data analytics processes. This translates to substantial cost savings and improved operational efficiency. By embracing this technology, businesses can gain a competitive edge, making informed decisions with greater accuracy and agility.
Unpacking the Generative AI In Data Analytics Market Landscape
In the dynamic realm of data analytics, Generative AI algorithms have emerged as a game-changer, revolutionizing data processing and insights generation. Compared to traditional data mining techniques, Generative AI models can create new data points that mirror the original dataset, enabling more comprehensive data exploration and analysis (Source: Gartner). This innovation leads to a 30% increase in identified patterns and trends, resulting in improved ROI and enhanced business decision-making (IDC).
Data security protocols are paramount in this context, with Classification Algorithms and Clustering Algorithms ensuring data privacy and compliance alignment. Machine Learning Pipelines and Deep Learning Frameworks facilitate seamless integration with Predictive Modeling Tools and Automated Report Generation on Cloud
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the dataset and accompanying scripts used in the study titled "A Machine Learning-Based Modeling Approach for Dye Removal Using Modified Natural Adsorbents."
The dataset used in this investigation includes maximum adsorption capacity (qe) and removal percentage (%) values obtained by removing MB dye from waste water using different adsorbent types. These adsorbents were modified by incorporating LA into almond, walnut, and apricot kernel powders. The data set under consideration encompasses pH, adsorbent dose, concentration, temperature, and time values.
The output variables used for modeling are maximum adsorption capacity (qe) and removal percentage (%).
NOTE: This first version contains a labeling error in the last two column headers. Please use the (updated) version v2 for accurate data.
This research received no external funding. However, the datasets used in this study were obtained from the peer-reviewed publication by Süheyla Kocaman (2020), titled "Removal of methylene blue dye from aqueous solutions by adsorption on levulinic acid-modified natural shells", published in Environmental Technology (Vol. 22, pp. 885–895). DOI: https://doi.org/10.1080/15226514.2020.1736512. The original data was reused in this work for machine learning-based modeling purposes with proper attribution.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single-cell expression analysis is an effective tool for studying the dynamics of cell population profiles. However, the majority of statistical methods are applied to individual profiles and the methods for comparing multiple profiles simultaneously are limited. In this study, we propose a nonparametric statistical method, called Decomposition into Extended Exponential Family (DEEF), that embeds a set of single-cell expression profiles of several markers into a low-dimensional space and identifies the principal distributions that describe their heterogeneity. We demonstrate that DEEF can appropriately decompose and embed sets of theoretical probability distributions. We then apply DEEF to a cytometry dataset to examine the effects of epidermal growth factor stimulation on an adult human mammary gland. It is shown that DEEF can describe the complex dynamics of cell population profiles using two parameters and visualize them as a trajectory. The two parameters identified the principal patterns of the cell population profile without prior biological assumptions. As a further application, we perform a dimensionality reduction and a time series reconstruction. DEEF can reconstruct the distributions based on the top coordinates, which enables the creation of an artificial dataset based on an actual single-cell expression dataset. Using the coordinate system assigned by DEEF, it is possible to analyze the relationship between the attributes of the distribution sample and the features or shape of the distribution using conventional data mining methods.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.