Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterThe worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.
Facebook
Twitter
According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 8.7 billion in 2024, driven by the exponential growth of data across industries and increasing demand for advanced analytics solutions. The market is experiencing a robust expansion, registering a CAGR of 18.5% during the forecast period. By 2033, the Knowledge Discovery in Databases market is projected to attain a value of USD 44.9 billion. This remarkable growth is primarily attributed to the rising adoption of artificial intelligence (AI), machine learning (ML), and big data analytics, which are transforming how organizations extract actionable insights from vast and complex datasets.
The surge in data generation from digital transformation initiatives, IoT devices, and cloud-based applications is a major growth driver for the Knowledge Discovery in Databases market. As organizations increasingly digitize their operations and customer interactions, the volume, variety, and velocity of data have soared, making traditional data analysis methods insufficient. KDD platforms and solutions are essential for uncovering hidden patterns, correlations, and trends within large datasets, enabling businesses to make data-driven decisions and gain a competitive edge. Furthermore, the proliferation of unstructured data from sources such as social media, emails, and multimedia content has heightened the need for advanced mining techniques, further fueling market growth.
Another significant factor propelling the Knowledge Discovery in Databases market is the integration of AI and ML technologies into KDD solutions. These intelligent algorithms enhance the automation, accuracy, and scalability of data mining processes, allowing organizations to extract deeper insights in real time. The increasing availability of cloud-based KDD solutions has democratized access to advanced analytics, enabling small and medium enterprises (SMEs) to leverage sophisticated tools without the need for extensive infrastructure investments. Additionally, the growing emphasis on regulatory compliance, risk management, and fraud detection in sectors such as BFSI and healthcare is driving the adoption of KDD technologies to ensure data integrity and security.
The evolving landscape of digital businesses and the rising importance of customer-centric strategies have also contributed to the expansion of the Knowledge Discovery in Databases market. Enterprises across retail, telecommunications, and manufacturing are harnessing KDD tools to personalize offerings, optimize supply chains, and enhance operational efficiency. The ability of KDD platforms to handle diverse data types, including text, images, and video, has broadened their applicability across various domains. Moreover, the increasing focus on predictive analytics and real-time decision-making is encouraging organizations to invest in KDD solutions that provide timely and actionable insights, thereby driving sustained market growth through 2033.
From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, supported by the presence of leading technology vendors, high digital adoption rates, and substantial investments in AI and analytics infrastructure. However, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT ecosystems, and government initiatives promoting data-driven innovation. Europe remains a significant market, characterized by strong regulatory frameworks and a focus on data privacy and security. Latin America and the Middle East & Africa are also emerging as promising markets, driven by increasing awareness of the benefits of KDD and growing investments in digital transformation across industries.
The Knowledge Discovery in Databases market is segmented by component into Software, Services, and Platforms, each playing a crucial role in the overall ecosystem. Software solutions form the backbone of the KDD ma
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1-Turkish comments for 128 venues in Foursquare Social Network Platform (binary and ternary classified) 2-Turkish adjectives and polarities 3-Turkish food and drink names 4- All comments without tagging 5-Venues, liked meals/foods
Facebook
Twitter
According to our latest research, the global Knowledge Discovery Platform market size in 2024 stands at USD 17.2 billion, reflecting robust adoption across industries. The market is experiencing a strong growth momentum, with a compound annual growth rate (CAGR) of 18.5% projected from 2025 to 2033. By the end of 2033, the market is forecasted to reach an impressive USD 89.7 billion. This rapid expansion is primarily driven by escalating data volumes, the imperative for actionable business intelligence, and the proliferation of artificial intelligence and machine learning technologies. As organizations seek to harness the power of big data for competitive advantage, the demand for advanced Knowledge Discovery Platforms continues to surge globally.
One of the principal growth factors propelling the Knowledge Discovery Platform market is the exponential increase in data generated by enterprises, governments, and consumers. The digital transformation wave has resulted in data being produced at an unprecedented rate, from social media interactions to IoT devices, transactional records, and digital documents. Organizations are under mounting pressure to extract meaningful insights from this sea of information to inform strategic decisions, optimize operations, and enhance customer experiences. Knowledge Discovery Platforms, equipped with sophisticated data mining, text analytics, and visualization tools, enable businesses to uncover hidden patterns, trends, and correlations within massive datasets. This capability is particularly vital in sectors such as BFSI, healthcare, and retail, where timely and accurate insights can directly impact profitability and risk management.
Another significant driver is the growing integration of artificial intelligence and machine learning algorithms into Knowledge Discovery Platforms. These intelligent systems automate complex analytical processes, reducing the reliance on manual data exploration and accelerating time-to-insight. Predictive analytics functionalities, for example, empower organizations to anticipate market trends, customer behaviors, and operational risks with greater precision. As AI and ML technologies mature, their seamless incorporation into knowledge discovery workflows enhances the platforms' ability to handle unstructured data, perform sentiment analysis, and support real-time decision-making. The increasing availability of cloud-based solutions further democratizes access, enabling even small and medium enterprises to leverage advanced analytics without heavy upfront investments in infrastructure.
The regulatory landscape and the need for compliance are also fueling the adoption of Knowledge Discovery Platforms. Industries such as banking, healthcare, and government face stringent requirements around data governance, privacy, and reporting. Advanced platforms help organizations maintain compliance by providing traceable, auditable insights and supporting data lineage tracking. Moreover, the rise of explainable AI and transparent analytics has become crucial for organizations seeking to build trust with regulators, partners, and customers. As regulations evolve to address new data privacy and security concerns, the role of robust knowledge discovery solutions in ensuring organizational resilience and accountability becomes even more pronounced.
From a regional perspective, North America leads the market, driven by early technology adoption, a strong presence of leading vendors, and high enterprise IT spending. Europe follows closely, with substantial investments in digital transformation and data-driven initiatives across key sectors. The Asia Pacific region is witnessing the fastest growth, propelled by rapid industrialization, expanding digital infrastructure, and government-led smart initiatives. Latin America and the Middle East & Africa are also emerging as promising markets, supported by increasing awareness of data-driven decision-making and the gradual modernization of business processes. Each region presents unique opportunities and challenges, shaped by local regulatory environments, technological readiness, and industry dynamics.
Data Mining Tools are integral to the functionality of Knowledge Discovery Platforms, offering organizations the ability to process and analyze vast amoun
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT: Sugarcane mills in Brazil collect a vast amount of data relating to production on an annual basis. The analysis of this type of database is complex, especially when factors relating to varieties, climate, detailed management techniques, and edaphic conditions are taken into account. The aim of this paper was to perform a decision tree analysis of a detailed database from a production unit and to evaluate the actionable patterns found in terms of their usefulness for increasing production. The decision tree revealed interpretable patterns relating to sugarcane yield (R2 = 0.617), certain of which were actionable and had been previously studied and reported in the literature. Based on two actionable patterns relating to soil chemistry, intervention which will increase production by almost 2 % were suitable for recommendation. The method was successful in reproducing the knowledge of experts of the factors which influence sugarcane yield, and the decision trees can support the decision-making process in the context of production and the formulation of hypotheses for specific experiments.
Facebook
Twitterhttps://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
According to our latest research, the AI in Knowledge Discovery market size reached USD 12.6 billion in 2024 globally, with a robust CAGR of 27.8% expected during the forecast period from 2025 to 2033. By the end of 2033, the market is projected to achieve a value of USD 124.3 billion, reflecting the rapid adoption of artificial intelligence technologies across industries to extract actionable insights from vast and complex datasets. This growth is primarily driven by the increasing demand for advanced analytics, the proliferation of big data, and the need for intelligent decision-making processes in enterprise environments.
The primary growth factor for the AI in Knowledge Discovery market is the exponential increase in data generated by businesses, consumers, and connected devices. Organizations are under immense pressure to leverage this data efficiently to remain competitive, fueling investments in AI-driven knowledge discovery solutions. These solutions enable companies to automate the extraction of patterns, trends, and relationships from structured and unstructured data sources. The integration of AI technologies, such as machine learning, natural language processing, and deep learning, has significantly enhanced the capability of knowledge discovery platforms, allowing for real-time analysis and more accurate predictions. This is particularly evident in sectors such as finance, healthcare, and retail, where the ability to make data-driven decisions rapidly is crucial for success.
Another significant driver is the growing adoption of cloud-based AI solutions, which offer scalability, flexibility, and cost-effectiveness. Cloud deployment models have democratized access to powerful AI tools, making them accessible to small and medium-sized enterprises as well as large corporations. The cloud also facilitates collaboration and integration with other enterprise systems, enabling seamless data flow and improved analytics. As organizations continue to migrate their operations to the cloud, the demand for AI-powered knowledge discovery tools is expected to surge, further accelerating market growth. Additionally, advancements in AI algorithms and the increasing availability of pre-trained models have reduced the barriers to entry, allowing businesses to deploy sophisticated knowledge discovery applications with minimal technical expertise.
The proliferation of AI in knowledge discovery is also being bolstered by the need for regulatory compliance and risk management. Industries such as BFSI and healthcare are subject to stringent regulations that require accurate data analysis and reporting. AI-driven knowledge discovery tools help organizations comply with these regulations by automating data extraction, validation, and reporting processes. Furthermore, the ability to identify anomalies and potential risks in real-time enhances operational efficiency and reduces the likelihood of compliance breaches. This regulatory push, combined with the ongoing digital transformation across industries, is expected to sustain the high growth trajectory of the AI in Knowledge Discovery market over the next decade.
From a regional perspective, North America currently dominates the global market, accounting for the largest share due to the early adoption of AI technologies, a strong presence of leading technology companies, and significant investments in research and development. Europe follows closely, driven by supportive government initiatives and a growing focus on digital innovation. The Asia Pacific region is expected to exhibit the highest CAGR during the forecast period, fueled by rapid economic growth, increasing digitalization, and the rising adoption of AI solutions in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also witnessing steady growth, albeit at a slower pace, as organizations in these regions begin to recognize the benefits of AI-powered knowledge discovery for business transformation.
The Component segment in the AI in Knowledge Discovery market is categorized into Software, Hardware, and Services, each playing a pivotal role in the deployment and effectiveness of knowledge discovery solutions. The Software segment is the largest contributor, driven by the increasing demand for advanced analytics platforms, machine learning frameworks, and AI-powered data mining tools. These software sol
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The predictability of epidemiological indicators can help estimate dependent variables, assist in decision-making to support public policies, and explain the scenarios experienced by different countries worldwide. This study aimed to forecast the Human Development Index (HDI) and life expectancy (LE) for Latin American countries for the period of 2015-2020 using data mining techniques. All stages of the process of knowledge discovery in databases were covered. The SMOReg data mining algorithm was used in the models with multivariate time series to make predictions; this algorithm performed the best in the tests developed during the evaluation period. The average HDI and LE for Latin American countries showed an increasing trend in the period evaluated, corresponding to 4.99 ± 3.90% and 2.65 ± 0.06 years, respectively. Multivariate models allow for a greater evaluation of algorithms, thus increasing their accuracy. Data mining techniques have a better predictive quality relative to the most popular technique, Autoregressive Integrated Moving Average (ARIMA). In addition, the predictions suggest that there will be a higher increase in the mean HDI and LE for Latin American countries compared to the mean values for the rest of the world.
Facebook
Twitter
According to our latest research, the global Enterprise Knowledge Discovery AI market size reached USD 6.2 billion in 2024, reflecting robust adoption across key industries. The market is expanding at a CAGR of 24.1% and is forecasted to reach USD 48.7 billion by 2033. This strong growth trajectory is primarily driven by the increasing need for intelligent data processing, advanced analytics, and automation solutions across large enterprises and SMEs. As organizations face exponential growth in unstructured and structured data, the demand for sophisticated AI-powered knowledge discovery tools is rising rapidly, enabling businesses to unlock actionable insights and gain a competitive edge.
One of the primary growth factors fueling the Enterprise Knowledge Discovery AI market is the proliferation of big data and the exponential increase in enterprise data volumes. Modern organizations generate and collect massive amounts of information from diverse sources including internal databases, customer interactions, IoT devices, and external digital channels. Traditional data management and search tools are no longer sufficient to sift through these vast data sets efficiently. The integration of AI-driven knowledge discovery solutions provides advanced capabilities such as semantic search, natural language processing, and automated data mining, enabling organizations to extract relevant insights quickly and accurately. This growing need to harness data for strategic decision-making is pushing enterprises to invest heavily in AI-based knowledge discovery platforms.
Another significant growth driver is the rising emphasis on digital transformation initiatives across industries. As enterprises strive to enhance operational efficiency, customer experience, and innovation, AI-powered knowledge discovery tools are becoming essential components of their digital strategies. These solutions enable seamless access to organizational knowledge, facilitate collaboration, and support automation of complex workflows. Furthermore, the increasing adoption of cloud-based deployment models is making advanced AI capabilities more accessible to organizations of all sizes, reducing the barriers to entry and accelerating market growth. The convergence of AI, machine learning, and knowledge management technologies is fostering a new era of intelligent enterprise solutions, further propelling the demand for knowledge discovery AI.
The Enterprise Knowledge Discovery AI market is also benefiting from the growing regulatory and compliance requirements in sectors such as BFSI, healthcare, and government. Organizations are under increasing pressure to manage, secure, and audit their data assets effectively. AI-driven knowledge discovery platforms offer robust features for information governance, risk management, and compliance monitoring, helping enterprises meet stringent regulatory standards. Additionally, the rise of remote work and distributed teams has amplified the need for efficient knowledge sharing and retrieval, making AI-powered solutions indispensable for maintaining productivity and business continuity in a dynamic environment.
Regionally, North America dominates the Enterprise Knowledge Discovery AI market, accounting for the largest revenue share in 2024, followed by Europe and Asia Pacific. The regionÂ’s leadership is attributed to the presence of major AI technology providers, high digital maturity among enterprises, and strong investment in research and development. Asia Pacific, however, is witnessing the fastest growth rate, driven by rapid digitalization, expanding IT infrastructure, and increasing government support for AI adoption. The market in Europe is characterized by robust regulatory frameworks and a focus on data privacy, which is shaping the deployment of AI-based knowledge discovery solutions. Latin America and the Middle East & Africa are emerging as promising markets, supported by growing enterprise IT spending and digital transformation initiatives.
As enterprises continue to navigate the complexities of digital transformation, the role of a Knowledge Discovery Platform becomes increasingly crucial. These platforms serve as the backbone for organizations seeking to harness the power of AI and data analytics to drive business innovation. By integrating v
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB – a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.
Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.
[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.
[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006
Diachronica models
Training data
Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:
Classical subcorpus
Hellenistic subcorpus
Whole corpus
Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.
Word2Vec
Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.
Syntactic word embeddings
Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.
ALP models
Training data
Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.
Word2Vec
Software used: Gensim library (Řehůřek and Sojka, 2010)
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.
References
Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.
Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.
Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).
Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.
Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.
Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013
Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Data Information: WISDM (WIireless Sensor Data Mining) smart phone-based sensor , collecting data from 36 different users in six different activities.
Number of examples: 1,098,207
Number of attributes: 6
Missing attribute values: None
Data processing:
1.Replace the nanoseconds with seconds in the timestamp column, and remove the user column, because each user will perform the same action.
2.Use the sliding window method to transform the data into sequences, and then split each label into training and testing sets, ensuring each label has 8:2 ratio in both the training and testing sets.
3.Shuffle the order of the labels in both training and testing sets and interleave them to prevent two sequences with the same label from being consecutively lined up.
Activity:
0 = Downstairs 100,427 (9.1%)
1 = Jogging 342,177 (31.2%)
2 = Sitting 59,939 (5.5%)
3 = Standing 48,395 (4.4%)
4 = Upstair 122,869 (11.2%)
5 = Walking 424,400 (38.6%)
Resource:
The dataset are collected by WISDM Lab [https://www.cis.fordham.edu/wisdm/dataset.php]
Jeffrey W. Lockhart, Gary M. Weiss, Jack C. Xue, Shaun T. Gallagher, Andrew B. Grosner, and Tony T. Pulickal (2011). "Design Considerations for the WISDM Smart Phone-Based Sensor Mining Architecture," Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data (at KDD-11), San Diego, CA. [https://www.cis.fordham.edu/wisdm/includes/files/Lockhart-Design-SensorKDD11.pdf]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data discretization aims to transform a set of continuous features into discrete features, thus simplifying the representation of information and making it easier to understand, use, and explain. In practice, users can take advantage of the discretization process to improve knowledge discovery and data analysis on medical domain problem datasets containing continuous features. However, certain feature values were frequently missing. Many data-mining algorithms cannot handle incomplete datasets. In this study, we considered the use of both discretization and missing-value imputation to process incomplete medical datasets, examining how the order of discretization and missing-value imputation combined influenced performance. The experimental results were obtained using seven different medical domain problem datasets: two discretizers, including the minimum description length principle (MDLP) and ChiMerge; three imputation methods, including the mean/mode, classification and regression tree (CART), and k-nearest neighbor (KNN) methods; and two classifiers, including support vector machines (SVM) and the C4.5 decision tree. The results show that a better performance can be obtained by first performing discretization followed by imputation, rather than vice versa. Furthermore, the highest classification accuracy rate was achieved by combining ChiMerge and KNN with SVM.
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 2.4(USD Billion) |
| MARKET SIZE 2025 | 2.64(USD Billion) |
| MARKET SIZE 2035 | 6.8(USD Billion) |
| SEGMENTS COVERED | Application, Deployment Mode, End User, Functionality, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | Rapid data growth, Increasing AI adoption, Enhanced decision-making capabilities, Integration with existing systems, Rising demand for actionable insights |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | Informatica, Tableau, Microsoft, Google, Alteryx, Oracle, Domo, SAP, SAS Institute, Clarivate Analytics, Qlik, RapidMiner, TIBCO Software, Palantir Technologies, Salesforce, IBM |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | AI integration for enhanced analysis, Growing demand for big data solutions, Increased investment in machine learning, Rise of cloud-based platforms, Expansion in healthcare and life sciences. |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 9.9% (2025 - 2035) |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
USPTO-2M is a dataset which download from United State Patent Trademark Office. It contains 2 million records which have cleaned and organized into JSON format.It could work as a benchmark dataset for patent classification task.Provided by Jie Hu from Guzhou University of Finance and Economics and Dr. Jianjun Hu at University of South Carolina.Citation: Li, Shaobo, Jie Hu, Yuxin Cui, and Jianjun Hu. "DeepPatent: patent classification with convolutional neural networks and word embedding." Scientometrics 117 (2018): 721-744.a sample of our data. { "Subclass_labels": [ "A43B", "A41D", "A43C" ], "Abstract": "a decorative and or promotional accessory to be secured to a lace such as a shoe lace includes a molded plastic body having a passage longitudinally extending therethrough from a first opening to a second opening the passage is sized and shaped to receive the lace therethrough and to frictionally secure the body in a desired position along the lace the accessory also includes indicia provided on an exterior surface of the accessory which can be in the form of any desired message name number logo graphic or the like an alternative embodiment of the accessory is disclosed which is to be secured to a cap bill this embodiment includes a slot radially extending to the passage which is sized and shaped to receive the cap brim therein and to resiliently grip the bill and removably secure the accessory in a desired position along the bill", "Title": "accessory for shoe laces hat brims and the like", "No": "US08925116" }
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This study is dedicated to the introduction of a novel method that automatically extracts potential structural alerts from a data set of molecules. These triggering structures can be further used for knowledge discovery and classification purposes. Computation of the structural alerts results from an implementation of a sophisticated workflow that integrates a graph mining tool guided by growth rate and stability. The growth rate is a well-established measurement of contrast between classes. Moreover, the extracted patterns correspond to formal concepts; the most robust patterns, named the stable emerging patterns (SEPs), can then be identified thanks to their stability, a new notion originating from the domain of formal concept analysis. All of these elements are explained in the paper from the point of view of computation. The method was applied to a molecular data set on mutagenicity. The experimental results demonstrate its efficiency: it automatically outputs a manageable number of structural patterns that are strongly related to mutagenicity. Moreover, a part of the resulting structures corresponds to already known structural alerts. Finally, an in-depth chemical analysis relying on these structures demonstrates how the method can initiate promising processes of chemical knowledge discovery.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Knowledge-based systems for toxicity prediction are typically based on rules, known as structural alerts, that describe relationships between structural features and different toxic effects. The identification of structural features associated with toxicological activity can be a time-consuming process and often requires significant input from domain experts. Here, we describe an emerging pattern mining method for the automated identification of activating structural features in toxicity data sets that is designed to help expedite the process of alert development. We apply the contrast pattern tree mining algorithm to generate a set of emerging patterns of structural fragment descriptors. Using the emerging patterns it is possible to form hierarchical clusters of compounds that are defined by the presence of common structural features and represent distinct chemical classes. The method has been tested on a large public in vitro mutagenicity data set and a public hERG channel inhibition data set and is shown to be effective at identifying common toxic features and recognizable classes of toxicants. We also describe how knowledge developers can use emerging patterns to improve the specificity and sensitivity of an existing expert system.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recent scientific advances have accumulated a tremendous amount of biomedical knowledge providing novel insights into the relationship between molecular and cellular processes and diseases. Literature mining is one of the commonly used methods to retrieve and extract information from scientific publications for understanding these associations. However, due to large data volume and complicated associations with noises, the interpretability of such association data for semantic knowledge discovery is challenging. In this study, we describe an integrative computational framework aiming to expedite the discovery of latent disease mechanisms by dissecting 146,245 disease-gene associations from over 25 million of PubMed indexed articles. We take advantage of both Latent Dirichlet Allocation (LDA) modeling and network-based analysis for their capabilities of detecting latent associations and reducing noises for large volume data respectively. Our results demonstrate that (1) the LDA-based modeling is able to group similar diseases into disease topics; (2) the disease-specific association networks follow the scale-free network property; (3) certain subnetwork patterns were enriched in the disease-specific association networks; and (4) genes were enriched in topic-specific biological processes. Our approach offers promising opportunities for latent disease-gene knowledge discovery in biomedical research.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.