81 datasets found

Data from: Results obtained in a data mining process applied to a database...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E.M. Ruiz Lobaina; C. P. Romero Suárez (2023). Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science. [Dataset]. http://doi.org/10.6084/m9.figshare.20011798.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20011798.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
E.M. Ruiz Lobaina; C. P. Romero Suárez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.
e
Data Mining and Knowledge Discovery - impact-factor
exaly.com
csv, json
Updated Nov 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Data Mining and Knowledge Discovery - impact-factor [Dataset]. https://exaly.com/journal/23379/data-mining-and-knowledge-discovery
Explore at:
csv, jsonAvailable download formats
Dataset updated
Nov 1, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The graph shows the changes in the impact factor of ^ and its corresponding percentile for the sake of comparison with the entire literature. Impact Factor is the most common scientometric index, which is defined by the number of citations of papers in two preceding years divided by the number of papers published in those years.
G
Knowledge Discovery in Databases Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Knowledge Discovery in Databases Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/knowledge-discovery-in-databases-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Aug 22, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Knowledge Discovery in Databases Market Outlook

According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 8.7 billion in 2024, driven by the exponential growth of data across industries and increasing demand for advanced analytics solutions. The market is experiencing a robust expansion, registering a CAGR of 18.5% during the forecast period. By 2033, the Knowledge Discovery in Databases market is projected to attain a value of USD 44.9 billion. This remarkable growth is primarily attributed to the rising adoption of artificial intelligence (AI), machine learning (ML), and big data analytics, which are transforming how organizations extract actionable insights from vast and complex datasets.

The surge in data generation from digital transformation initiatives, IoT devices, and cloud-based applications is a major growth driver for the Knowledge Discovery in Databases market. As organizations increasingly digitize their operations and customer interactions, the volume, variety, and velocity of data have soared, making traditional data analysis methods insufficient. KDD platforms and solutions are essential for uncovering hidden patterns, correlations, and trends within large datasets, enabling businesses to make data-driven decisions and gain a competitive edge. Furthermore, the proliferation of unstructured data from sources such as social media, emails, and multimedia content has heightened the need for advanced mining techniques, further fueling market growth.

Another significant factor propelling the Knowledge Discovery in Databases market is the integration of AI and ML technologies into KDD solutions. These intelligent algorithms enhance the automation, accuracy, and scalability of data mining processes, allowing organizations to extract deeper insights in real time. The increasing availability of cloud-based KDD solutions has democratized access to advanced analytics, enabling small and medium enterprises (SMEs) to leverage sophisticated tools without the need for extensive infrastructure investments. Additionally, the growing emphasis on regulatory compliance, risk management, and fraud detection in sectors such as BFSI and healthcare is driving the adoption of KDD technologies to ensure data integrity and security.

The evolving landscape of digital businesses and the rising importance of customer-centric strategies have also contributed to the expansion of the Knowledge Discovery in Databases market. Enterprises across retail, telecommunications, and manufacturing are harnessing KDD tools to personalize offerings, optimize supply chains, and enhance operational efficiency. The ability of KDD platforms to handle diverse data types, including text, images, and video, has broadened their applicability across various domains. Moreover, the increasing focus on predictive analytics and real-time decision-making is encouraging organizations to invest in KDD solutions that provide timely and actionable insights, thereby driving sustained market growth through 2033.

From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, supported by the presence of leading technology vendors, high digital adoption rates, and substantial investments in AI and analytics infrastructure. However, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT ecosystems, and government initiatives promoting data-driven innovation. Europe remains a significant market, characterized by strong regulatory frameworks and a focus on data privacy and security. Latin America and the Middle East & Africa are also emerging as promising markets, driven by increasing awareness of the benefits of KDD and growing investments in digital transformation across industries.

Component Analysis

The Knowledge Discovery in Databases market is segmented by component into Software, Services, and Platforms, each playing a crucial role in the overall ecosystem. Software solutions form the backbone of the KDD ma
D
Knowledge Discovery In Databases Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Knowledge Discovery In Databases Market Research Report 2033 [Dataset]. https://dataintelo.com/report/knowledge-discovery-in-databases-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Knowledge Discovery in Databases (KDD) Market Outlook

According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 9.6 billion in 2024, propelled by the growing demand for advanced data analytics and intelligent decision-making across industries. The market is expanding at a robust CAGR of 18.7% and is forecasted to reach USD 53.2 billion by 2033. This remarkable growth is driven primarily by the exponential rise in data generation, the adoption of artificial intelligence and machine learning, and the increasing need for actionable insights in real-time environments. As per our latest research, organizations worldwide are leveraging KDD solutions to extract valuable information from massive datasets, thereby fostering innovation, operational efficiency, and competitive advantage.

A significant growth factor for the Knowledge Discovery in Databases market is the rapid digital transformation witnessed across various sectors. Enterprises are increasingly migrating their core operations to digital platforms, resulting in the accumulation of vast amounts of structured and unstructured data. This surge in data volume necessitates advanced analytics tools capable of sifting through complex datasets to uncover hidden patterns, correlations, and anomalies. KDD solutions, encompassing data mining, machine learning algorithms, and visualization tools, are being widely deployed to convert raw data into strategic assets. Furthermore, the integration of KDD with emerging technologies such as big data analytics, Internet of Things (IoT), and cloud computing is further amplifying its adoption, enabling organizations to harness data-driven insights for enhanced decision-making and innovation.

Another major driver fueling the growth of the KDD market is the increasing emphasis on fraud detection, risk management, and regulatory compliance, particularly in sectors like BFSI, healthcare, and government. The proliferation of cyber threats, financial crimes, and regulatory mandates has compelled organizations to invest in sophisticated KDD platforms that can proactively identify suspicious activities and ensure compliance with evolving standards. These solutions leverage advanced algorithms to analyze transactional data in real-time, flagging anomalies and potential risks before they escalate. As a result, businesses are able to mitigate financial losses, safeguard sensitive information, and uphold their reputational integrity in an increasingly complex regulatory landscape.

The widespread adoption of KDD solutions is also being driven by the growing demand for personalized customer experiences and predictive analytics. In highly competitive markets such as retail, e-commerce, and telecommunications, organizations are leveraging KDD to analyze customer behavior, preferences, and purchasing patterns. This enables them to tailor their offerings, optimize marketing strategies, and enhance customer engagement. The ability to anticipate market trends, forecast demand, and identify emerging opportunities is proving invaluable for businesses seeking to maintain a competitive edge. Additionally, the shift towards cloud-based KDD solutions is making advanced analytics accessible to small and medium enterprises, democratizing the benefits of knowledge discovery and leveling the playing field.

From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, accounting for the largest share in 2024. This leadership can be attributed to the strong presence of technology giants, advanced IT infrastructure, and early adoption of analytics solutions across key industries. However, the Asia Pacific region is emerging as the fastest-growing market, driven by rapid digitization, government initiatives promoting data-driven innovation, and the proliferation of SMEs embracing cloud-based KDD platforms. Europe also represents a significant market, characterized by stringent data protection regulations and a focus on industrial automation. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by increasing investments in digital infrastructure and a growing recognition of the value of data analytics.

Component Analysis

The component segment of the Knowledge Discovery in Databases market is categorized into software, services, and platforms, each playing a pivotal role in the
e
List of Top Authors of Data Mining and Knowledge Discovery sorted by...
exaly.com
csv, json
Updated Nov 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). List of Top Authors of Data Mining and Knowledge Discovery sorted by articles [Dataset]. https://exaly.com/journal/23379/data-mining-and-knowledge-discovery
Explore at:
json, csvAvailable download formats
Dataset updated
Nov 1, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
List of Top Authors of Data Mining and Knowledge Discovery sorted by articles.
kdd cyberattack
kaggle.com
zip
Updated Jul 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziyad Mestour (2018). kdd cyberattack [Dataset]. https://www.kaggle.com/slashtea/kdd-cyberattack
Explore at:
zip(2298343 bytes)Available download formats
Dataset updated
Jul 28, 2018
Authors
Ziyad Mestour
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Context

This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

Content

For more information about the contents refer to this link http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Acknowledgements

The dataset is shared on Kaggle on behalf of KDD's work.

Inspiration

Build a classifier capable of distinguishing between attacks, and normal connections
e
List of Top Institutions of Data Mining and Knowledge Discovery sorted by...
exaly.com
csv, json
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). List of Top Institutions of Data Mining and Knowledge Discovery sorted by citations [Dataset]. https://exaly.com/journal/23379/data-mining-and-knowledge-discovery/top-institutions
Explore at:
json, csvAvailable download formats
Dataset updated
Nov 1, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
List of Top Institutions of Data Mining and Knowledge Discovery sorted by citations.
f
Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...
acs.figshare.com
figshare.com
xlsx
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krittapong Deshsorn; Panwad Chavalekvirat; Somrudee Deepaisarn; Ho-Chiao Chuang; Pawin Iamprasertkun (2025). Historical Data Mining Deep Dive into Machine Learning-Aided 2D Materials Research in Electrochemical Applications [Dataset]. http://doi.org/10.1021/acsmaterialsau.5c00030.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acsmaterialsau.5c00030.s001
Dataset updated
Jun 23, 2025
Dataset provided by
ACS Publications
Authors
Krittapong Deshsorn; Panwad Chavalekvirat; Somrudee Deepaisarn; Ho-Chiao Chuang; Pawin Iamprasertkun
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Washington and Lee University
College of William and Mary
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
i
Data from: KDD Cup 1999 Data
impactcybertrust.org
kaggle.com
Updated Jan 19, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
External Data Source (2019). KDD Cup 1999 Data [Dataset]. http://doi.org/10.23721/100/1478801
Explore at:
Unique identifier
https://doi.org/10.23721/100/1478801
Dataset updated
Jan 19, 2019
Authors
External Data Source
Description
This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.

The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu
Additional file 1 of Learning from biomedical linked data to suggest valid...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet (2023). Additional file 1 of Learning from biomedical linked data to suggest valid pharmacogenes [Dataset]. http://doi.org/10.6084/m9.figshare.c.3747806_D1.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3747806_D1.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SPARQL query example 1. This text file contains the SPARQL query we apply on our PGx linked data to obtain the data graph represented in Fig. 3. This query includes the definition of prefixes mentioned in Figs. 2 and 3. This query takes about 30 s on our https://pgxlod.loria.fr server. (TXT 2 kb)
d
Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...
catalog.data.gov
s.cnmilf.com
+3more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms [Dataset]. https://catalog.data.gov/dataset/discovering-anomalous-aviation-safety-events-using-scalable-data-mining-algorithms
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
The worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.
Additional file 2 of Learning from biomedical linked data to suggest valid...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet (2023). Additional file 2 of Learning from biomedical linked data to suggest valid pharmacogenes [Dataset]. http://doi.org/10.6084/m9.figshare.c.3747806_D2.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3747806_D2.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SPARQL query example 2. This text file contains an example of SPARQL query that enable to explore the vicinity of an entity. This particular query returns the RDF graph surrounding, within a lenght of 4, the node pharmgkb:PA451906 that represents the warfarin, an anticoagulant drug. (TXT 392 bytes)
KDD-99 Original dataset
kaggle.com
zip
Updated Aug 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nagi (2025). KDD-99 Original dataset [Dataset]. https://www.kaggle.com/datasets/primus11/kdd-99-original-dataset
Explore at:
zip(19081776 bytes)Available download formats
Dataset updated
Aug 13, 2025
Authors
nagi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
KDD Cup 1999 Dataset

The KDD Cup 1999 dataset is one of the earliest and most widely used benchmark datasets for network intrusion detection research.
It was created for the Third International Knowledge Discovery and Data Mining Tools Competition, hosted by the UCI KDD Archive, using network traffic captured in a simulated military environment at the MIT Lincoln Laboratory. The dataset contains both normal and malicious traffic, with attacks grouped into four main categories: Denial of Service (DoS), Probe, Remote to Local (R2L), and User to Root (U2R).

Key Characteristics

Simulated Traffic Environment: Network traffic was generated in a controlled environment to replicate a military network under attack.

Attack Categories:

DoS: e.g., smurf, neptune, teardrop

Probe: e.g., satan, nmap, ipsweep

R2L: e.g., guess_passwd, ftp_write, imap

U2R: e.g., buffer_overflow, rootkit, perl

Data Capture: Raw TCP dump data was processed into connection records.

Feature Extraction: Each record contains 41 features, including:

Basic features: Duration, protocol type, service, flag

Content features: Failed login counts, number of file creations

Traffic features: Connection counts within time windows, percentage of specific connections

Labeling: Each record is labeled as normal or as one of the specific attack types.

Data Volume: Around 4.9 million records in the full dataset; a 10% subset is also available.

Advantages

Established as a historical benchmark in IDS research.

Covers multiple attack categories for classification tasks.

Suitable for binary classification (normal vs. attack) and multi-class classification (attack type identification).

Limitations

Contains high redundancy (~78% repeated records) which can bias model performance.

Traffic patterns are outdated and may not reflect modern threats.

Imbalanced distribution of attack categories.

Usage

The KDD Cup 1999 dataset has been extensively used in academia for evaluating IDS algorithms due to its: - Large size and labeled structure - Multiple attack types - Historical significance in the development of intrusion detection systems
f
DataSheet_1_The TargetMine Data Warehouse: Enhancement and Updates.pdf
frontiersin.figshare.com
pdf
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-An Chen; Lokesh P. Tripathi; Takeshi Fujiwara; Tatsuya Kameyama; Mari N. Itoh; Kenji Mizuguchi (2023). DataSheet_1_The TargetMine Data Warehouse: Enhancement and Updates.pdf [Dataset]. http://doi.org/10.3389/fgene.2019.00934.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00934.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Yi-An Chen; Lokesh P. Tripathi; Takeshi Fujiwara; Tatsuya Kameyama; Mari N. Itoh; Kenji Mizuguchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Biological data analysis is the key to new discoveries in disease biology and drug discovery. The rapid proliferation of high-throughput ‘omics’ data has necessitated a need for tools and platforms that allow the researchers to combine and analyse different types of biological data and obtain biologically relevant knowledge. We had previously developed TargetMine, an integrative data analysis platform for target prioritisation and broad-based biological knowledge discovery. Here, we describe the newly modelled biological data types and the enhanced visual and analytical features of TargetMine. These enhancements have included: an enhanced coverage of gene–gene relations, small molecule metabolite to pathway mappings, an improved literature survey feature, and in silico prediction of gene functional associations such as protein–protein interactions and global gene co-expression. We have also described two usage examples on trans-omics data analysis and extraction of gene-disease associations using MeSH term descriptors. These examples have demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations. TargetMine with its auxiliary toolkit is available at https://targetmine.mizuguchilab.org. The TargetMine source code is available at https://github.com/chenyian-nibio/targetmine-gradle.
G
Knowledge Discovery Platform Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Knowledge Discovery Platform Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/knowledge-discovery-platform-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Knowledge Discovery Platform Market Outlook

According to our latest research, the global Knowledge Discovery Platform market size in 2024 stands at USD 17.2 billion, reflecting robust adoption across industries. The market is experiencing a strong growth momentum, with a compound annual growth rate (CAGR) of 18.5% projected from 2025 to 2033. By the end of 2033, the market is forecasted to reach an impressive USD 89.7 billion. This rapid expansion is primarily driven by escalating data volumes, the imperative for actionable business intelligence, and the proliferation of artificial intelligence and machine learning technologies. As organizations seek to harness the power of big data for competitive advantage, the demand for advanced Knowledge Discovery Platforms continues to surge globally.

One of the principal growth factors propelling the Knowledge Discovery Platform market is the exponential increase in data generated by enterprises, governments, and consumers. The digital transformation wave has resulted in data being produced at an unprecedented rate, from social media interactions to IoT devices, transactional records, and digital documents. Organizations are under mounting pressure to extract meaningful insights from this sea of information to inform strategic decisions, optimize operations, and enhance customer experiences. Knowledge Discovery Platforms, equipped with sophisticated data mining, text analytics, and visualization tools, enable businesses to uncover hidden patterns, trends, and correlations within massive datasets. This capability is particularly vital in sectors such as BFSI, healthcare, and retail, where timely and accurate insights can directly impact profitability and risk management.

Another significant driver is the growing integration of artificial intelligence and machine learning algorithms into Knowledge Discovery Platforms. These intelligent systems automate complex analytical processes, reducing the reliance on manual data exploration and accelerating time-to-insight. Predictive analytics functionalities, for example, empower organizations to anticipate market trends, customer behaviors, and operational risks with greater precision. As AI and ML technologies mature, their seamless incorporation into knowledge discovery workflows enhances the platforms' ability to handle unstructured data, perform sentiment analysis, and support real-time decision-making. The increasing availability of cloud-based solutions further democratizes access, enabling even small and medium enterprises to leverage advanced analytics without heavy upfront investments in infrastructure.

The regulatory landscape and the need for compliance are also fueling the adoption of Knowledge Discovery Platforms. Industries such as banking, healthcare, and government face stringent requirements around data governance, privacy, and reporting. Advanced platforms help organizations maintain compliance by providing traceable, auditable insights and supporting data lineage tracking. Moreover, the rise of explainable AI and transparent analytics has become crucial for organizations seeking to build trust with regulators, partners, and customers. As regulations evolve to address new data privacy and security concerns, the role of robust knowledge discovery solutions in ensuring organizational resilience and accountability becomes even more pronounced.

From a regional perspective, North America leads the market, driven by early technology adoption, a strong presence of leading vendors, and high enterprise IT spending. Europe follows closely, with substantial investments in digital transformation and data-driven initiatives across key sectors. The Asia Pacific region is witnessing the fastest growth, propelled by rapid industrialization, expanding digital infrastructure, and government-led smart initiatives. Latin America and the Middle East & Africa are also emerging as promising markets, supported by increasing awareness of data-driven decision-making and the gradual modernization of business processes. Each region presents unique opportunities and challenges, shaped by local regulatory environments, technological readiness, and industry dynamics.

Data Mining Tools are integral to the functionality of Knowledge Discovery Platforms, offering organizations the ability to process and analyze vast amoun
Data from: Identification of patterns for increasing production with...
scielo.figshare.com
jpeg
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paulo Rodrigues Peloia; Felipe Ferreira Bocca; Luiz Henrique Antunes Rodrigues (2023). Identification of patterns for increasing production with decision trees in sugarcane mill data [Dataset]. http://doi.org/10.6084/m9.figshare.7899809.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7899809.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Paulo Rodrigues Peloia; Felipe Ferreira Bocca; Luiz Henrique Antunes Rodrigues
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT: Sugarcane mills in Brazil collect a vast amount of data relating to production on an annual basis. The analysis of this type of database is complex, especially when factors relating to varieties, climate, detailed management techniques, and edaphic conditions are taken into account. The aim of this paper was to perform a decision tree analysis of a detailed database from a production unit and to evaluate the actionable patterns found in terms of their usefulness for increasing production. The decision tree revealed interpretable patterns relating to sugarcane yield (R2 = 0.617), certain of which were actionable and had been previously studied and reported in the literature. Based on two actionable patterns relating to soil chemistry, intervention which will increase production by almost 2 % were suitable for recommendation. The method was successful in reproducing the knowledge of experts of the factors which influence sugarcane yield, and the decision trees can support the decision-making process in the context of production and the formulation of hypotheses for specific experiments.
Z
Data from: Ancient Greek language models
data.niaid.nih.gov
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
Explore at:
Dataset updated
Apr 29, 2024
Dataset provided by
Barbara
Silvia
Malvina
Saskia
Nilo
Authors
Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

Diachronica models

Training data

Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

Classical subcorpus

Hellenistic subcorpus

Whole corpus

Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

Models

Count-based

Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

Word2Vec

Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

Syntactic word embeddings

Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

ALP models

Training data

Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

Models

Count-based

Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

Word2Vec

Software used: Gensim library (Řehůřek and Sojka, 2010)

a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

References

Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
H
Replication Data for: "Unraveling spatial, structural, and social...
dataverse.harvard.edu
search.dataone.org
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agustin PÁJARO; Ignacio J. DURAN; Pablo RODRIGO (2023). Replication Data for: "Unraveling spatial, structural, and social country-level conditions for the emergence of the foreign fighter phenomenon: an exploratory data mining approach to the case of ISIS" [Dataset]. http://doi.org/10.7910/DVN/SFT3RT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/SFT3RT
Dataset updated
Oct 19, 2023
Dataset provided by
Harvard Dataverse
Authors
Agustin PÁJARO; Ignacio J. DURAN; Pablo RODRIGO
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data from the article "Unraveling spatial, structural, and social country-level conditions for the emergence of the foreign fighter phenomenon: an exploratory data mining approach to the case of ISIS", by Agustin Pájaro, Ignacio J. Duran and Pablo Rodrigo, published in Revista DADOS, v. 65, n. 3, 2022.
d
Data from: Towards open data blockchain analytics: a Bitcoin perspective
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan McGinn; Douglas McIlwraith; Yike Guo (2025). Towards open data blockchain analytics: a Bitcoin perspective [Dataset]. http://doi.org/10.5061/dryad.h9r0p65
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.h9r0p65
Dataset updated
Jun 12, 2025
Dataset provided by
Dryad Digital Repository
Authors
Dan McGinn; Douglas McIlwraith; Yike Guo
Time period covered
Jul 9, 2018
Description
Bitcoin is the first implementation of a technology that has become known as a 'public permissionless' blockchain. Such systems allow public read/write access to an append-only blockchain database without the need for any mediating central authority. Instead they guarantee access, security and protocol conformity through an elegant combination of cryptographic assurances and game theoretic economic incentives. Not until the advent of the Bitcoin blockchain has such a trusted, transparent, comprehensive and granular data set of digital economic behaviours been available for public network analysis. In this article, by translating the cumbersome binary data structure of the Bitcoin blockchain into a high fidelity graph model, we demonstrate through various analyses the often overlooked social and econometric benefits of employing such a novel open data architecture. Specifically we show (a) how repeated patterns of transaction behaviours can be revealed to link user activity across t...

Facebook

Twitter

Click to copy link

Link copied

Cite

E.M. Ruiz Lobaina; C. P. Romero Suárez (2023). Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science. [Dataset]. http://doi.org/10.6084/m9.figshare.20011798.v1

Data from: Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science.

Explore at:

jpegAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.20011798.v1

Dataset updated

Jun 4, 2023

Dataset provided by

SciELOhttp://www.scielo.org/

Authors

E.M. Ruiz Lobaina; C. P. Romero Suárez

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.

Clear search

Close search

Google apps

Main menu

Data from: Results obtained in a data mining process applied to a database...

Data Mining and Knowledge Discovery - impact-factor

Knowledge Discovery in Databases Market Research Report 2033

Knowledge Discovery in Databases Market Outlook

Component Analysis

Knowledge Discovery In Databases Market Research Report 2033

Knowledge Discovery in Databases (KDD) Market Outlook

Component Analysis

List of Top Authors of Data Mining and Knowledge Discovery sorted by...

kdd cyberattack

Context

Content

Acknowledgements

Inspiration

List of Top Institutions of Data Mining and Knowledge Discovery sorted by...

Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...

Data Analysis for the Systematic Literature Review of DL4SE

Data from: KDD Cup 1999 Data

Additional file 1 of Learning from biomedical linked data to suggest valid...

Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...

Additional file 2 of Learning from biomedical linked data to suggest valid...

KDD-99 Original dataset

KDD Cup 1999 Dataset

Key Characteristics

Advantages

Limitations

Usage

DataSheet_1_The TargetMine Data Warehouse: Enhancement and Updates.pdf

Knowledge Discovery Platform Market Research Report 2033

Knowledge Discovery Platform Market Outlook

Data from: Identification of patterns for increasing production with...

Data from: Ancient Greek language models

Replication Data for: "Unraveling spatial, structural, and social...

Data from: Towards open data blockchain analytics: a Bitcoin perspective

Data from: Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science.