81 datasets found
  1. Data from: Results obtained in a data mining process applied to a database...

    • scielo.figshare.com
    jpeg
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    E.M. Ruiz Lobaina; C. P. Romero Suárez (2023). Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science. [Dataset]. http://doi.org/10.6084/m9.figshare.20011798.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    E.M. Ruiz Lobaina; C. P. Romero Suárez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.

  2. e

    Data Mining and Knowledge Discovery - impact-factor

    • exaly.com
    csv, json
    Updated Nov 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Data Mining and Knowledge Discovery - impact-factor [Dataset]. https://exaly.com/journal/23379/data-mining-and-knowledge-discovery
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Nov 1, 2025
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The graph shows the changes in the impact factor of ^ and its corresponding percentile for the sake of comparison with the entire literature. Impact Factor is the most common scientometric index, which is defined by the number of citations of papers in two preceding years divided by the number of papers published in those years.

  3. G

    Knowledge Discovery in Databases Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Knowledge Discovery in Databases Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/knowledge-discovery-in-databases-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Knowledge Discovery in Databases Market Outlook



    According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 8.7 billion in 2024, driven by the exponential growth of data across industries and increasing demand for advanced analytics solutions. The market is experiencing a robust expansion, registering a CAGR of 18.5% during the forecast period. By 2033, the Knowledge Discovery in Databases market is projected to attain a value of USD 44.9 billion. This remarkable growth is primarily attributed to the rising adoption of artificial intelligence (AI), machine learning (ML), and big data analytics, which are transforming how organizations extract actionable insights from vast and complex datasets.



    The surge in data generation from digital transformation initiatives, IoT devices, and cloud-based applications is a major growth driver for the Knowledge Discovery in Databases market. As organizations increasingly digitize their operations and customer interactions, the volume, variety, and velocity of data have soared, making traditional data analysis methods insufficient. KDD platforms and solutions are essential for uncovering hidden patterns, correlations, and trends within large datasets, enabling businesses to make data-driven decisions and gain a competitive edge. Furthermore, the proliferation of unstructured data from sources such as social media, emails, and multimedia content has heightened the need for advanced mining techniques, further fueling market growth.



    Another significant factor propelling the Knowledge Discovery in Databases market is the integration of AI and ML technologies into KDD solutions. These intelligent algorithms enhance the automation, accuracy, and scalability of data mining processes, allowing organizations to extract deeper insights in real time. The increasing availability of cloud-based KDD solutions has democratized access to advanced analytics, enabling small and medium enterprises (SMEs) to leverage sophisticated tools without the need for extensive infrastructure investments. Additionally, the growing emphasis on regulatory compliance, risk management, and fraud detection in sectors such as BFSI and healthcare is driving the adoption of KDD technologies to ensure data integrity and security.



    The evolving landscape of digital businesses and the rising importance of customer-centric strategies have also contributed to the expansion of the Knowledge Discovery in Databases market. Enterprises across retail, telecommunications, and manufacturing are harnessing KDD tools to personalize offerings, optimize supply chains, and enhance operational efficiency. The ability of KDD platforms to handle diverse data types, including text, images, and video, has broadened their applicability across various domains. Moreover, the increasing focus on predictive analytics and real-time decision-making is encouraging organizations to invest in KDD solutions that provide timely and actionable insights, thereby driving sustained market growth through 2033.



    From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, supported by the presence of leading technology vendors, high digital adoption rates, and substantial investments in AI and analytics infrastructure. However, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT ecosystems, and government initiatives promoting data-driven innovation. Europe remains a significant market, characterized by strong regulatory frameworks and a focus on data privacy and security. Latin America and the Middle East & Africa are also emerging as promising markets, driven by increasing awareness of the benefits of KDD and growing investments in digital transformation across industries.





    Component Analysis



    The Knowledge Discovery in Databases market is segmented by component into Software, Services, and Platforms, each playing a crucial role in the overall ecosystem. Software solutions form the backbone of the KDD ma

  4. D

    Knowledge Discovery In Databases Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Knowledge Discovery In Databases Market Research Report 2033 [Dataset]. https://dataintelo.com/report/knowledge-discovery-in-databases-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Knowledge Discovery in Databases (KDD) Market Outlook




    According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 9.6 billion in 2024, propelled by the growing demand for advanced data analytics and intelligent decision-making across industries. The market is expanding at a robust CAGR of 18.7% and is forecasted to reach USD 53.2 billion by 2033. This remarkable growth is driven primarily by the exponential rise in data generation, the adoption of artificial intelligence and machine learning, and the increasing need for actionable insights in real-time environments. As per our latest research, organizations worldwide are leveraging KDD solutions to extract valuable information from massive datasets, thereby fostering innovation, operational efficiency, and competitive advantage.




    A significant growth factor for the Knowledge Discovery in Databases market is the rapid digital transformation witnessed across various sectors. Enterprises are increasingly migrating their core operations to digital platforms, resulting in the accumulation of vast amounts of structured and unstructured data. This surge in data volume necessitates advanced analytics tools capable of sifting through complex datasets to uncover hidden patterns, correlations, and anomalies. KDD solutions, encompassing data mining, machine learning algorithms, and visualization tools, are being widely deployed to convert raw data into strategic assets. Furthermore, the integration of KDD with emerging technologies such as big data analytics, Internet of Things (IoT), and cloud computing is further amplifying its adoption, enabling organizations to harness data-driven insights for enhanced decision-making and innovation.




    Another major driver fueling the growth of the KDD market is the increasing emphasis on fraud detection, risk management, and regulatory compliance, particularly in sectors like BFSI, healthcare, and government. The proliferation of cyber threats, financial crimes, and regulatory mandates has compelled organizations to invest in sophisticated KDD platforms that can proactively identify suspicious activities and ensure compliance with evolving standards. These solutions leverage advanced algorithms to analyze transactional data in real-time, flagging anomalies and potential risks before they escalate. As a result, businesses are able to mitigate financial losses, safeguard sensitive information, and uphold their reputational integrity in an increasingly complex regulatory landscape.




    The widespread adoption of KDD solutions is also being driven by the growing demand for personalized customer experiences and predictive analytics. In highly competitive markets such as retail, e-commerce, and telecommunications, organizations are leveraging KDD to analyze customer behavior, preferences, and purchasing patterns. This enables them to tailor their offerings, optimize marketing strategies, and enhance customer engagement. The ability to anticipate market trends, forecast demand, and identify emerging opportunities is proving invaluable for businesses seeking to maintain a competitive edge. Additionally, the shift towards cloud-based KDD solutions is making advanced analytics accessible to small and medium enterprises, democratizing the benefits of knowledge discovery and leveling the playing field.




    From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, accounting for the largest share in 2024. This leadership can be attributed to the strong presence of technology giants, advanced IT infrastructure, and early adoption of analytics solutions across key industries. However, the Asia Pacific region is emerging as the fastest-growing market, driven by rapid digitization, government initiatives promoting data-driven innovation, and the proliferation of SMEs embracing cloud-based KDD platforms. Europe also represents a significant market, characterized by stringent data protection regulations and a focus on industrial automation. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by increasing investments in digital infrastructure and a growing recognition of the value of data analytics.



    Component Analysis




    The component segment of the Knowledge Discovery in Databases market is categorized into software, services, and platforms, each playing a pivotal role in the

  5. e

    List of Top Authors of Data Mining and Knowledge Discovery sorted by...

    • exaly.com
    csv, json
    Updated Nov 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). List of Top Authors of Data Mining and Knowledge Discovery sorted by articles [Dataset]. https://exaly.com/journal/23379/data-mining-and-knowledge-discovery
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Nov 1, 2025
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    List of Top Authors of Data Mining and Knowledge Discovery sorted by articles.

  6. kdd cyberattack

    • kaggle.com
    zip
    Updated Jul 28, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziyad Mestour (2018). kdd cyberattack [Dataset]. https://www.kaggle.com/slashtea/kdd-cyberattack
    Explore at:
    zip(2298343 bytes)Available download formats
    Dataset updated
    Jul 28, 2018
    Authors
    Ziyad Mestour
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Context

    This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

    Content

    For more information about the contents refer to this link http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

    Acknowledgements

    The dataset is shared on Kaggle on behalf of KDD's work.

    Inspiration

    Build a classifier capable of distinguishing between attacks, and normal connections

  7. e

    List of Top Institutions of Data Mining and Knowledge Discovery sorted by...

    • exaly.com
    csv, json
    Updated Nov 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). List of Top Institutions of Data Mining and Knowledge Discovery sorted by citations [Dataset]. https://exaly.com/journal/23379/data-mining-and-knowledge-discovery/top-institutions
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Nov 1, 2025
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    List of Top Institutions of Data Mining and Knowledge Discovery sorted by citations.

  8. f

    Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krittapong Deshsorn; Panwad Chavalekvirat; Somrudee Deepaisarn; Ho-Chiao Chuang; Pawin Iamprasertkun (2025). Historical Data Mining Deep Dive into Machine Learning-Aided 2D Materials Research in Electrochemical Applications [Dataset]. http://doi.org/10.1021/acsmaterialsau.5c00030.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset provided by
    ACS Publications
    Authors
    Krittapong Deshsorn; Panwad Chavalekvirat; Somrudee Deepaisarn; Ho-Chiao Chuang; Pawin Iamprasertkun
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.

  9. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Washington and Lee University
    College of William and Mary
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  10. i

    Data from: KDD Cup 1999 Data

    • impactcybertrust.org
    • kaggle.com
    Updated Jan 19, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Data Source (2019). KDD Cup 1999 Data [Dataset]. http://doi.org/10.23721/100/1478801
    Explore at:
    Dataset updated
    Jan 19, 2019
    Authors
    External Data Source
    Description

    This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.

    The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

    Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

    The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu

  11. Additional file 1 of Learning from biomedical linked data to suggest valid...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet (2023). Additional file 1 of Learning from biomedical linked data to suggest valid pharmacogenes [Dataset]. http://doi.org/10.6084/m9.figshare.c.3747806_D1.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SPARQL query example 1. This text file contains the SPARQL query we apply on our PGx linked data to obtain the data graph represented in Fig. 3. This query includes the definition of prefixes mentioned in Figs. 2 and 3. This query takes about 30 s on our https://pgxlod.loria.fr server. (TXT 2 kb)

  12. d

    Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...

    • catalog.data.gov
    • s.cnmilf.com
    • +3more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms [Dataset]. https://catalog.data.gov/dataset/discovering-anomalous-aviation-safety-events-using-scalable-data-mining-algorithms
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    The worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.

  13. Additional file 2 of Learning from biomedical linked data to suggest valid...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet (2023). Additional file 2 of Learning from biomedical linked data to suggest valid pharmacogenes [Dataset]. http://doi.org/10.6084/m9.figshare.c.3747806_D2.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kevin Dalleau; Yassine Marzougui; SĂŠbastien Da Silva; Patrice Ringot; Ndeye Coumba Ndiaye; Adrien Coulet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SPARQL query example 2. This text file contains an example of SPARQL query that enable to explore the vicinity of an entity. This particular query returns the RDF graph surrounding, within a lenght of 4, the node pharmgkb:PA451906 that represents the warfarin, an anticoagulant drug. (TXT 392 bytes)

  14. KDD-99 Original dataset

    • kaggle.com
    zip
    Updated Aug 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nagi (2025). KDD-99 Original dataset [Dataset]. https://www.kaggle.com/datasets/primus11/kdd-99-original-dataset
    Explore at:
    zip(19081776 bytes)Available download formats
    Dataset updated
    Aug 13, 2025
    Authors
    nagi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    KDD Cup 1999 Dataset

    The KDD Cup 1999 dataset is one of the earliest and most widely used benchmark datasets for network intrusion detection research.
    It was created for the Third International Knowledge Discovery and Data Mining Tools Competition, hosted by the UCI KDD Archive, using network traffic captured in a simulated military environment at the MIT Lincoln Laboratory. The dataset contains both normal and malicious traffic, with attacks grouped into four main categories: Denial of Service (DoS), Probe, Remote to Local (R2L), and User to Root (U2R).

    Key Characteristics

    • Simulated Traffic Environment: Network traffic was generated in a controlled environment to replicate a military network under attack.
    • Attack Categories:
      • DoS: e.g., smurf, neptune, teardrop
      • Probe: e.g., satan, nmap, ipsweep
      • R2L: e.g., guess_passwd, ftp_write, imap
      • U2R: e.g., buffer_overflow, rootkit, perl
    • Data Capture: Raw TCP dump data was processed into connection records.
    • Feature Extraction: Each record contains 41 features, including:
      • Basic features: Duration, protocol type, service, flag
      • Content features: Failed login counts, number of file creations
      • Traffic features: Connection counts within time windows, percentage of specific connections
    • Labeling: Each record is labeled as normal or as one of the specific attack types.
    • Data Volume: Around 4.9 million records in the full dataset; a 10% subset is also available.

    Advantages

    • Established as a historical benchmark in IDS research.
    • Covers multiple attack categories for classification tasks.
    • Suitable for binary classification (normal vs. attack) and multi-class classification (attack type identification).

    Limitations

    • Contains high redundancy (~78% repeated records) which can bias model performance.
    • Traffic patterns are outdated and may not reflect modern threats.
    • Imbalanced distribution of attack categories.

    Usage

    The KDD Cup 1999 dataset has been extensively used in academia for evaluating IDS algorithms due to its: - Large size and labeled structure - Multiple attack types - Historical significance in the development of intrusion detection systems

  15. f

    DataSheet_1_The TargetMine Data Warehouse: Enhancement and Updates.pdf

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-An Chen; Lokesh P. Tripathi; Takeshi Fujiwara; Tatsuya Kameyama; Mari N. Itoh; Kenji Mizuguchi (2023). DataSheet_1_The TargetMine Data Warehouse: Enhancement and Updates.pdf [Dataset]. http://doi.org/10.3389/fgene.2019.00934.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-An Chen; Lokesh P. Tripathi; Takeshi Fujiwara; Tatsuya Kameyama; Mari N. Itoh; Kenji Mizuguchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Biological data analysis is the key to new discoveries in disease biology and drug discovery. The rapid proliferation of high-throughput ‘omics’ data has necessitated a need for tools and platforms that allow the researchers to combine and analyse different types of biological data and obtain biologically relevant knowledge. We had previously developed TargetMine, an integrative data analysis platform for target prioritisation and broad-based biological knowledge discovery. Here, we describe the newly modelled biological data types and the enhanced visual and analytical features of TargetMine. These enhancements have included: an enhanced coverage of gene–gene relations, small molecule metabolite to pathway mappings, an improved literature survey feature, and in silico prediction of gene functional associations such as protein–protein interactions and global gene co-expression. We have also described two usage examples on trans-omics data analysis and extraction of gene-disease associations using MeSH term descriptors. These examples have demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations. TargetMine with its auxiliary toolkit is available at https://targetmine.mizuguchilab.org. The TargetMine source code is available at https://github.com/chenyian-nibio/targetmine-gradle.

  16. G

    Knowledge Discovery Platform Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Knowledge Discovery Platform Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/knowledge-discovery-platform-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Knowledge Discovery Platform Market Outlook



    According to our latest research, the global Knowledge Discovery Platform market size in 2024 stands at USD 17.2 billion, reflecting robust adoption across industries. The market is experiencing a strong growth momentum, with a compound annual growth rate (CAGR) of 18.5% projected from 2025 to 2033. By the end of 2033, the market is forecasted to reach an impressive USD 89.7 billion. This rapid expansion is primarily driven by escalating data volumes, the imperative for actionable business intelligence, and the proliferation of artificial intelligence and machine learning technologies. As organizations seek to harness the power of big data for competitive advantage, the demand for advanced Knowledge Discovery Platforms continues to surge globally.




    One of the principal growth factors propelling the Knowledge Discovery Platform market is the exponential increase in data generated by enterprises, governments, and consumers. The digital transformation wave has resulted in data being produced at an unprecedented rate, from social media interactions to IoT devices, transactional records, and digital documents. Organizations are under mounting pressure to extract meaningful insights from this sea of information to inform strategic decisions, optimize operations, and enhance customer experiences. Knowledge Discovery Platforms, equipped with sophisticated data mining, text analytics, and visualization tools, enable businesses to uncover hidden patterns, trends, and correlations within massive datasets. This capability is particularly vital in sectors such as BFSI, healthcare, and retail, where timely and accurate insights can directly impact profitability and risk management.




    Another significant driver is the growing integration of artificial intelligence and machine learning algorithms into Knowledge Discovery Platforms. These intelligent systems automate complex analytical processes, reducing the reliance on manual data exploration and accelerating time-to-insight. Predictive analytics functionalities, for example, empower organizations to anticipate market trends, customer behaviors, and operational risks with greater precision. As AI and ML technologies mature, their seamless incorporation into knowledge discovery workflows enhances the platforms' ability to handle unstructured data, perform sentiment analysis, and support real-time decision-making. The increasing availability of cloud-based solutions further democratizes access, enabling even small and medium enterprises to leverage advanced analytics without heavy upfront investments in infrastructure.




    The regulatory landscape and the need for compliance are also fueling the adoption of Knowledge Discovery Platforms. Industries such as banking, healthcare, and government face stringent requirements around data governance, privacy, and reporting. Advanced platforms help organizations maintain compliance by providing traceable, auditable insights and supporting data lineage tracking. Moreover, the rise of explainable AI and transparent analytics has become crucial for organizations seeking to build trust with regulators, partners, and customers. As regulations evolve to address new data privacy and security concerns, the role of robust knowledge discovery solutions in ensuring organizational resilience and accountability becomes even more pronounced.




    From a regional perspective, North America leads the market, driven by early technology adoption, a strong presence of leading vendors, and high enterprise IT spending. Europe follows closely, with substantial investments in digital transformation and data-driven initiatives across key sectors. The Asia Pacific region is witnessing the fastest growth, propelled by rapid industrialization, expanding digital infrastructure, and government-led smart initiatives. Latin America and the Middle East & Africa are also emerging as promising markets, supported by increasing awareness of data-driven decision-making and the gradual modernization of business processes. Each region presents unique opportunities and challenges, shaped by local regulatory environments, technological readiness, and industry dynamics.



    Data Mining Tools are integral to the functionality of Knowledge Discovery Platforms, offering organizations the ability to process and analyze vast amoun

  17. Data from: Identification of patterns for increasing production with...

    • scielo.figshare.com
    jpeg
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paulo Rodrigues Peloia; Felipe Ferreira Bocca; Luiz Henrique Antunes Rodrigues (2023). Identification of patterns for increasing production with decision trees in sugarcane mill data [Dataset]. http://doi.org/10.6084/m9.figshare.7899809.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Paulo Rodrigues Peloia; Felipe Ferreira Bocca; Luiz Henrique Antunes Rodrigues
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT: Sugarcane mills in Brazil collect a vast amount of data relating to production on an annual basis. The analysis of this type of database is complex, especially when factors relating to varieties, climate, detailed management techniques, and edaphic conditions are taken into account. The aim of this paper was to perform a decision tree analysis of a detailed database from a production unit and to evaluate the actionable patterns found in terms of their usefulness for increasing production. The decision tree revealed interpretable patterns relating to sugarcane yield (R2 = 0.617), certain of which were actionable and had been previously studied and reported in the literature. Based on two actionable patterns relating to soil chemistry, intervention which will increase production by almost 2 % were suitable for recommendation. The method was successful in reproducing the knowledge of experts of the factors which influence sugarcane yield, and the decision trees can support the decision-making process in the context of production and the formulation of hypotheses for specific experiments.

  18. Z

    Data from: Ancient Greek language models

    • data.niaid.nih.gov
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
    Explore at:
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    Barbara
    Silvia
    Malvina
    Saskia
    Nilo
    Authors
    Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

    Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

    [Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

    [ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

    Diachronica models

    Training data

    Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

    Classical subcorpus

    Hellenistic subcorpus

    Whole corpus

    Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

    Word2Vec

    Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

    Syntactic word embeddings

    Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

    ALP models

    Training data

    Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

    Word2Vec

    Software used: Gensim library (Řehůřek and Sojka, 2010)

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

    References

    Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

    Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

    Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

    Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

    Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

    Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

    Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

    Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

    Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

    Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.

  19. H

    Replication Data for: "Unraveling spatial, structural, and social...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agustin PÁJARO; Ignacio J. DURAN; Pablo RODRIGO (2023). Replication Data for: "Unraveling spatial, structural, and social country-level conditions for the emergence of the foreign fighter phenomenon: an exploratory data mining approach to the case of ISIS" [Dataset]. http://doi.org/10.7910/DVN/SFT3RT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Agustin PÁJARO; Ignacio J. DURAN; Pablo RODRIGO
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data from the article "Unraveling spatial, structural, and social country-level conditions for the emergence of the foreign fighter phenomenon: an exploratory data mining approach to the case of ISIS", by Agustin Pájaro, Ignacio J. Duran and Pablo Rodrigo, published in Revista DADOS, v. 65, n. 3, 2022.

  20. d

    Data from: Towards open data blockchain analytics: a Bitcoin perspective

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan McGinn; Douglas McIlwraith; Yike Guo (2025). Towards open data blockchain analytics: a Bitcoin perspective [Dataset]. http://doi.org/10.5061/dryad.h9r0p65
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Dan McGinn; Douglas McIlwraith; Yike Guo
    Time period covered
    Jul 9, 2018
    Description

    Bitcoin is the first implementation of a technology that has become known as a 'public permissionless' blockchain. Such systems allow public read/write access to an append-only blockchain database without the need for any mediating central authority. Instead they guarantee access, security and protocol conformity through an elegant combination of cryptographic assurances and game theoretic economic incentives. Not until the advent of the Bitcoin blockchain has such a trusted, transparent, comprehensive and granular data set of digital economic behaviours been available for public network analysis. In this article, by translating the cumbersome binary data structure of the Bitcoin blockchain into a high fidelity graph model, we demonstrate through various analyses the often overlooked social and econometric benefits of employing such a novel open data architecture. Specifically we show (a) how repeated patterns of transaction behaviours can be revealed to link user activity across t...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
E.M. Ruiz Lobaina; C. P. Romero Suárez (2023). Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science. [Dataset]. http://doi.org/10.6084/m9.figshare.20011798.v1
Organization logo

Data from: Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science.

Related Article
Explore at:
jpegAvailable download formats
Dataset updated
Jun 4, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
E.M. Ruiz Lobaina; C. P. Romero Suárez
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.

Search
Clear search
Close search
Google apps
Main menu