Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The graph shows the changes in the impact factor of ^ and its corresponding percentile for the sake of comparison with the entire literature. Impact Factor is the most common scientometric index, which is defined by the number of citations of papers in two preceding years divided by the number of papers published in those years.
Facebook
Twitter
According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 8.7 billion in 2024, driven by the exponential growth of data across industries and increasing demand for advanced analytics solutions. The market is experiencing a robust expansion, registering a CAGR of 18.5% during the forecast period. By 2033, the Knowledge Discovery in Databases market is projected to attain a value of USD 44.9 billion. This remarkable growth is primarily attributed to the rising adoption of artificial intelligence (AI), machine learning (ML), and big data analytics, which are transforming how organizations extract actionable insights from vast and complex datasets.
The surge in data generation from digital transformation initiatives, IoT devices, and cloud-based applications is a major growth driver for the Knowledge Discovery in Databases market. As organizations increasingly digitize their operations and customer interactions, the volume, variety, and velocity of data have soared, making traditional data analysis methods insufficient. KDD platforms and solutions are essential for uncovering hidden patterns, correlations, and trends within large datasets, enabling businesses to make data-driven decisions and gain a competitive edge. Furthermore, the proliferation of unstructured data from sources such as social media, emails, and multimedia content has heightened the need for advanced mining techniques, further fueling market growth.
Another significant factor propelling the Knowledge Discovery in Databases market is the integration of AI and ML technologies into KDD solutions. These intelligent algorithms enhance the automation, accuracy, and scalability of data mining processes, allowing organizations to extract deeper insights in real time. The increasing availability of cloud-based KDD solutions has democratized access to advanced analytics, enabling small and medium enterprises (SMEs) to leverage sophisticated tools without the need for extensive infrastructure investments. Additionally, the growing emphasis on regulatory compliance, risk management, and fraud detection in sectors such as BFSI and healthcare is driving the adoption of KDD technologies to ensure data integrity and security.
The evolving landscape of digital businesses and the rising importance of customer-centric strategies have also contributed to the expansion of the Knowledge Discovery in Databases market. Enterprises across retail, telecommunications, and manufacturing are harnessing KDD tools to personalize offerings, optimize supply chains, and enhance operational efficiency. The ability of KDD platforms to handle diverse data types, including text, images, and video, has broadened their applicability across various domains. Moreover, the increasing focus on predictive analytics and real-time decision-making is encouraging organizations to invest in KDD solutions that provide timely and actionable insights, thereby driving sustained market growth through 2033.
From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, supported by the presence of leading technology vendors, high digital adoption rates, and substantial investments in AI and analytics infrastructure. However, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT ecosystems, and government initiatives promoting data-driven innovation. Europe remains a significant market, characterized by strong regulatory frameworks and a focus on data privacy and security. Latin America and the Middle East & Africa are also emerging as promising markets, driven by increasing awareness of the benefits of KDD and growing investments in digital transformation across industries.
The Knowledge Discovery in Databases market is segmented by component into Software, Services, and Platforms, each playing a crucial role in the overall ecosystem. Software solutions form the backbone of the KDD ma
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 9.6 billion in 2024, propelled by the growing demand for advanced data analytics and intelligent decision-making across industries. The market is expanding at a robust CAGR of 18.7% and is forecasted to reach USD 53.2 billion by 2033. This remarkable growth is driven primarily by the exponential rise in data generation, the adoption of artificial intelligence and machine learning, and the increasing need for actionable insights in real-time environments. As per our latest research, organizations worldwide are leveraging KDD solutions to extract valuable information from massive datasets, thereby fostering innovation, operational efficiency, and competitive advantage.
A significant growth factor for the Knowledge Discovery in Databases market is the rapid digital transformation witnessed across various sectors. Enterprises are increasingly migrating their core operations to digital platforms, resulting in the accumulation of vast amounts of structured and unstructured data. This surge in data volume necessitates advanced analytics tools capable of sifting through complex datasets to uncover hidden patterns, correlations, and anomalies. KDD solutions, encompassing data mining, machine learning algorithms, and visualization tools, are being widely deployed to convert raw data into strategic assets. Furthermore, the integration of KDD with emerging technologies such as big data analytics, Internet of Things (IoT), and cloud computing is further amplifying its adoption, enabling organizations to harness data-driven insights for enhanced decision-making and innovation.
Another major driver fueling the growth of the KDD market is the increasing emphasis on fraud detection, risk management, and regulatory compliance, particularly in sectors like BFSI, healthcare, and government. The proliferation of cyber threats, financial crimes, and regulatory mandates has compelled organizations to invest in sophisticated KDD platforms that can proactively identify suspicious activities and ensure compliance with evolving standards. These solutions leverage advanced algorithms to analyze transactional data in real-time, flagging anomalies and potential risks before they escalate. As a result, businesses are able to mitigate financial losses, safeguard sensitive information, and uphold their reputational integrity in an increasingly complex regulatory landscape.
The widespread adoption of KDD solutions is also being driven by the growing demand for personalized customer experiences and predictive analytics. In highly competitive markets such as retail, e-commerce, and telecommunications, organizations are leveraging KDD to analyze customer behavior, preferences, and purchasing patterns. This enables them to tailor their offerings, optimize marketing strategies, and enhance customer engagement. The ability to anticipate market trends, forecast demand, and identify emerging opportunities is proving invaluable for businesses seeking to maintain a competitive edge. Additionally, the shift towards cloud-based KDD solutions is making advanced analytics accessible to small and medium enterprises, democratizing the benefits of knowledge discovery and leveling the playing field.
From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, accounting for the largest share in 2024. This leadership can be attributed to the strong presence of technology giants, advanced IT infrastructure, and early adoption of analytics solutions across key industries. However, the Asia Pacific region is emerging as the fastest-growing market, driven by rapid digitization, government initiatives promoting data-driven innovation, and the proliferation of SMEs embracing cloud-based KDD platforms. Europe also represents a significant market, characterized by stringent data protection regulations and a focus on industrial automation. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by increasing investments in digital infrastructure and a growing recognition of the value of data analytics.
The component segment of the Knowledge Discovery in Databases market is categorized into software, services, and platforms, each playing a pivotal role in the
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
List of Top Authors of Data Mining and Knowledge Discovery sorted by articles.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
For more information about the contents refer to this link http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
The dataset is shared on Kaggle on behalf of KDD's work.
Build a classifier capable of distinguishing between attacks, and normal connections
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
List of Top Institutions of Data Mining and Knowledge Discovery sorted by citations.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterThis is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.
The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.
Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.
The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPARQL query example 1. This text file contains the SPARQL query we apply on our PGx linked data to obtain the data graph represented in Fig. 3. This query includes the definition of prefixes mentioned in Figs. 2 and 3. This query takes about 30 s on our https://pgxlod.loria.fr server. (TXT 2 kb)
Facebook
TwitterThe worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPARQL query example 2. This text file contains an example of SPARQL query that enable to explore the vicinity of an entity. This particular query returns the RDF graph surrounding, within a lenght of 4, the node pharmgkb:PA451906 that represents the warfarin, an anticoagulant drug. (TXT 392 bytes)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The KDD Cup 1999 dataset is one of the earliest and most widely used benchmark datasets for network intrusion detection research.
It was created for the Third International Knowledge Discovery and Data Mining Tools Competition, hosted by the UCI KDD Archive, using network traffic captured in a simulated military environment at the MIT Lincoln Laboratory. The dataset contains both normal and malicious traffic, with attacks grouped into four main categories: Denial of Service (DoS), Probe, Remote to Local (R2L), and User to Root (U2R).
The KDD Cup 1999 dataset has been extensively used in academia for evaluating IDS algorithms due to its: - Large size and labeled structure - Multiple attack types - Historical significance in the development of intrusion detection systems
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Biological data analysis is the key to new discoveries in disease biology and drug discovery. The rapid proliferation of high-throughput ‘omics’ data has necessitated a need for tools and platforms that allow the researchers to combine and analyse different types of biological data and obtain biologically relevant knowledge. We had previously developed TargetMine, an integrative data analysis platform for target prioritisation and broad-based biological knowledge discovery. Here, we describe the newly modelled biological data types and the enhanced visual and analytical features of TargetMine. These enhancements have included: an enhanced coverage of gene–gene relations, small molecule metabolite to pathway mappings, an improved literature survey feature, and in silico prediction of gene functional associations such as protein–protein interactions and global gene co-expression. We have also described two usage examples on trans-omics data analysis and extraction of gene-disease associations using MeSH term descriptors. These examples have demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations. TargetMine with its auxiliary toolkit is available at https://targetmine.mizuguchilab.org. The TargetMine source code is available at https://github.com/chenyian-nibio/targetmine-gradle.
Facebook
Twitter
According to our latest research, the global Knowledge Discovery Platform market size in 2024 stands at USD 17.2 billion, reflecting robust adoption across industries. The market is experiencing a strong growth momentum, with a compound annual growth rate (CAGR) of 18.5% projected from 2025 to 2033. By the end of 2033, the market is forecasted to reach an impressive USD 89.7 billion. This rapid expansion is primarily driven by escalating data volumes, the imperative for actionable business intelligence, and the proliferation of artificial intelligence and machine learning technologies. As organizations seek to harness the power of big data for competitive advantage, the demand for advanced Knowledge Discovery Platforms continues to surge globally.
One of the principal growth factors propelling the Knowledge Discovery Platform market is the exponential increase in data generated by enterprises, governments, and consumers. The digital transformation wave has resulted in data being produced at an unprecedented rate, from social media interactions to IoT devices, transactional records, and digital documents. Organizations are under mounting pressure to extract meaningful insights from this sea of information to inform strategic decisions, optimize operations, and enhance customer experiences. Knowledge Discovery Platforms, equipped with sophisticated data mining, text analytics, and visualization tools, enable businesses to uncover hidden patterns, trends, and correlations within massive datasets. This capability is particularly vital in sectors such as BFSI, healthcare, and retail, where timely and accurate insights can directly impact profitability and risk management.
Another significant driver is the growing integration of artificial intelligence and machine learning algorithms into Knowledge Discovery Platforms. These intelligent systems automate complex analytical processes, reducing the reliance on manual data exploration and accelerating time-to-insight. Predictive analytics functionalities, for example, empower organizations to anticipate market trends, customer behaviors, and operational risks with greater precision. As AI and ML technologies mature, their seamless incorporation into knowledge discovery workflows enhances the platforms' ability to handle unstructured data, perform sentiment analysis, and support real-time decision-making. The increasing availability of cloud-based solutions further democratizes access, enabling even small and medium enterprises to leverage advanced analytics without heavy upfront investments in infrastructure.
The regulatory landscape and the need for compliance are also fueling the adoption of Knowledge Discovery Platforms. Industries such as banking, healthcare, and government face stringent requirements around data governance, privacy, and reporting. Advanced platforms help organizations maintain compliance by providing traceable, auditable insights and supporting data lineage tracking. Moreover, the rise of explainable AI and transparent analytics has become crucial for organizations seeking to build trust with regulators, partners, and customers. As regulations evolve to address new data privacy and security concerns, the role of robust knowledge discovery solutions in ensuring organizational resilience and accountability becomes even more pronounced.
From a regional perspective, North America leads the market, driven by early technology adoption, a strong presence of leading vendors, and high enterprise IT spending. Europe follows closely, with substantial investments in digital transformation and data-driven initiatives across key sectors. The Asia Pacific region is witnessing the fastest growth, propelled by rapid industrialization, expanding digital infrastructure, and government-led smart initiatives. Latin America and the Middle East & Africa are also emerging as promising markets, supported by increasing awareness of data-driven decision-making and the gradual modernization of business processes. Each region presents unique opportunities and challenges, shaped by local regulatory environments, technological readiness, and industry dynamics.
Data Mining Tools are integral to the functionality of Knowledge Discovery Platforms, offering organizations the ability to process and analyze vast amoun
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT: Sugarcane mills in Brazil collect a vast amount of data relating to production on an annual basis. The analysis of this type of database is complex, especially when factors relating to varieties, climate, detailed management techniques, and edaphic conditions are taken into account. The aim of this paper was to perform a decision tree analysis of a detailed database from a production unit and to evaluate the actionable patterns found in terms of their usefulness for increasing production. The decision tree revealed interpretable patterns relating to sugarcane yield (R2 = 0.617), certain of which were actionable and had been previously studied and reported in the literature. Based on two actionable patterns relating to soil chemistry, intervention which will increase production by almost 2 % were suitable for recommendation. The method was successful in reproducing the knowledge of experts of the factors which influence sugarcane yield, and the decision trees can support the decision-making process in the context of production and the formulation of hypotheses for specific experiments.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.
Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.
[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.
[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006
Diachronica models
Training data
Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:
Classical subcorpus
Hellenistic subcorpus
Whole corpus
Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.
Word2Vec
Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.
Syntactic word embeddings
Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.
ALP models
Training data
Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.
Word2Vec
Software used: Gensim library (Řehůřek and Sojka, 2010)
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.
References
Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.
Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.
Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).
Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.
Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.
Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013
Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data from the article "Unraveling spatial, structural, and social country-level conditions for the emergence of the foreign fighter phenomenon: an exploratory data mining approach to the case of ISIS", by Agustin Pájaro, Ignacio J. Duran and Pablo Rodrigo, published in Revista DADOS, v. 65, n. 3, 2022.
Facebook
TwitterBitcoin is the first implementation of a technology that has become known as a 'public permissionless' blockchain. Such systems allow public read/write access to an append-only blockchain database without the need for any mediating central authority. Instead they guarantee access, security and protocol conformity through an elegant combination of cryptographic assurances and game theoretic economic incentives. Not until the advent of the Bitcoin blockchain has such a trusted, transparent, comprehensive and granular data set of digital economic behaviours been available for public network analysis. In this article, by translating the cumbersome binary data structure of the Bitcoin blockchain into a high fidelity graph model, we demonstrate through various analyses the often overlooked social and econometric benefits of employing such a novel open data architecture. Specifically we show (a) how repeated patterns of transaction behaviours can be revealed to link user activity across t...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.