32 datasets found
  1. D

    Knowledge Discovery In Databases Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Knowledge Discovery In Databases Market Research Report 2033 [Dataset]. https://dataintelo.com/report/knowledge-discovery-in-databases-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Knowledge Discovery in Databases (KDD) Market Outlook




    According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 9.6 billion in 2024, propelled by the growing demand for advanced data analytics and intelligent decision-making across industries. The market is expanding at a robust CAGR of 18.7% and is forecasted to reach USD 53.2 billion by 2033. This remarkable growth is driven primarily by the exponential rise in data generation, the adoption of artificial intelligence and machine learning, and the increasing need for actionable insights in real-time environments. As per our latest research, organizations worldwide are leveraging KDD solutions to extract valuable information from massive datasets, thereby fostering innovation, operational efficiency, and competitive advantage.




    A significant growth factor for the Knowledge Discovery in Databases market is the rapid digital transformation witnessed across various sectors. Enterprises are increasingly migrating their core operations to digital platforms, resulting in the accumulation of vast amounts of structured and unstructured data. This surge in data volume necessitates advanced analytics tools capable of sifting through complex datasets to uncover hidden patterns, correlations, and anomalies. KDD solutions, encompassing data mining, machine learning algorithms, and visualization tools, are being widely deployed to convert raw data into strategic assets. Furthermore, the integration of KDD with emerging technologies such as big data analytics, Internet of Things (IoT), and cloud computing is further amplifying its adoption, enabling organizations to harness data-driven insights for enhanced decision-making and innovation.




    Another major driver fueling the growth of the KDD market is the increasing emphasis on fraud detection, risk management, and regulatory compliance, particularly in sectors like BFSI, healthcare, and government. The proliferation of cyber threats, financial crimes, and regulatory mandates has compelled organizations to invest in sophisticated KDD platforms that can proactively identify suspicious activities and ensure compliance with evolving standards. These solutions leverage advanced algorithms to analyze transactional data in real-time, flagging anomalies and potential risks before they escalate. As a result, businesses are able to mitigate financial losses, safeguard sensitive information, and uphold their reputational integrity in an increasingly complex regulatory landscape.




    The widespread adoption of KDD solutions is also being driven by the growing demand for personalized customer experiences and predictive analytics. In highly competitive markets such as retail, e-commerce, and telecommunications, organizations are leveraging KDD to analyze customer behavior, preferences, and purchasing patterns. This enables them to tailor their offerings, optimize marketing strategies, and enhance customer engagement. The ability to anticipate market trends, forecast demand, and identify emerging opportunities is proving invaluable for businesses seeking to maintain a competitive edge. Additionally, the shift towards cloud-based KDD solutions is making advanced analytics accessible to small and medium enterprises, democratizing the benefits of knowledge discovery and leveling the playing field.




    From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, accounting for the largest share in 2024. This leadership can be attributed to the strong presence of technology giants, advanced IT infrastructure, and early adoption of analytics solutions across key industries. However, the Asia Pacific region is emerging as the fastest-growing market, driven by rapid digitization, government initiatives promoting data-driven innovation, and the proliferation of SMEs embracing cloud-based KDD platforms. Europe also represents a significant market, characterized by stringent data protection regulations and a focus on industrial automation. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by increasing investments in digital infrastructure and a growing recognition of the value of data analytics.



    Component Analysis




    The component segment of the Knowledge Discovery in Databases market is categorized into software, services, and platforms, each playing a pivotal role in the

  2. G

    Knowledge Discovery in Databases Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Knowledge Discovery in Databases Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/knowledge-discovery-in-databases-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Knowledge Discovery in Databases Market Outlook



    According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 8.7 billion in 2024, driven by the exponential growth of data across industries and increasing demand for advanced analytics solutions. The market is experiencing a robust expansion, registering a CAGR of 18.5% during the forecast period. By 2033, the Knowledge Discovery in Databases market is projected to attain a value of USD 44.9 billion. This remarkable growth is primarily attributed to the rising adoption of artificial intelligence (AI), machine learning (ML), and big data analytics, which are transforming how organizations extract actionable insights from vast and complex datasets.



    The surge in data generation from digital transformation initiatives, IoT devices, and cloud-based applications is a major growth driver for the Knowledge Discovery in Databases market. As organizations increasingly digitize their operations and customer interactions, the volume, variety, and velocity of data have soared, making traditional data analysis methods insufficient. KDD platforms and solutions are essential for uncovering hidden patterns, correlations, and trends within large datasets, enabling businesses to make data-driven decisions and gain a competitive edge. Furthermore, the proliferation of unstructured data from sources such as social media, emails, and multimedia content has heightened the need for advanced mining techniques, further fueling market growth.



    Another significant factor propelling the Knowledge Discovery in Databases market is the integration of AI and ML technologies into KDD solutions. These intelligent algorithms enhance the automation, accuracy, and scalability of data mining processes, allowing organizations to extract deeper insights in real time. The increasing availability of cloud-based KDD solutions has democratized access to advanced analytics, enabling small and medium enterprises (SMEs) to leverage sophisticated tools without the need for extensive infrastructure investments. Additionally, the growing emphasis on regulatory compliance, risk management, and fraud detection in sectors such as BFSI and healthcare is driving the adoption of KDD technologies to ensure data integrity and security.



    The evolving landscape of digital businesses and the rising importance of customer-centric strategies have also contributed to the expansion of the Knowledge Discovery in Databases market. Enterprises across retail, telecommunications, and manufacturing are harnessing KDD tools to personalize offerings, optimize supply chains, and enhance operational efficiency. The ability of KDD platforms to handle diverse data types, including text, images, and video, has broadened their applicability across various domains. Moreover, the increasing focus on predictive analytics and real-time decision-making is encouraging organizations to invest in KDD solutions that provide timely and actionable insights, thereby driving sustained market growth through 2033.



    From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, supported by the presence of leading technology vendors, high digital adoption rates, and substantial investments in AI and analytics infrastructure. However, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT ecosystems, and government initiatives promoting data-driven innovation. Europe remains a significant market, characterized by strong regulatory frameworks and a focus on data privacy and security. Latin America and the Middle East & Africa are also emerging as promising markets, driven by increasing awareness of the benefits of KDD and growing investments in digital transformation across industries.





    Component Analysis



    The Knowledge Discovery in Databases market is segmented by component into Software, Services, and Platforms, each playing a crucial role in the overall ecosystem. Software solutions form the backbone of the KDD ma

  3. i

    Data from: KDD Cup 1999 Data

    • impactcybertrust.org
    • kaggle.com
    Updated Jan 19, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Data Source (2019). KDD Cup 1999 Data [Dataset]. http://doi.org/10.23721/100/1478801
    Explore at:
    Dataset updated
    Jan 19, 2019
    Authors
    External Data Source
    Description

    This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.

    The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

    Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

    The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu

  4. kdd cyberattack

    • kaggle.com
    zip
    Updated Jul 28, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziyad Mestour (2018). kdd cyberattack [Dataset]. https://www.kaggle.com/slashtea/kdd-cyberattack
    Explore at:
    zip(2298343 bytes)Available download formats
    Dataset updated
    Jul 28, 2018
    Authors
    Ziyad Mestour
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Context

    This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

    Content

    For more information about the contents refer to this link http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

    Acknowledgements

    The dataset is shared on Kaggle on behalf of KDD's work.

    Inspiration

    Build a classifier capable of distinguishing between attacks, and normal connections

  5. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  6. f

    Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krittapong Deshsorn; Panwad Chavalekvirat; Somrudee Deepaisarn; Ho-Chiao Chuang; Pawin Iamprasertkun (2025). Historical Data Mining Deep Dive into Machine Learning-Aided 2D Materials Research in Electrochemical Applications [Dataset]. http://doi.org/10.1021/acsmaterialsau.5c00030.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset provided by
    ACS Publications
    Authors
    Krittapong Deshsorn; Panwad Chavalekvirat; Somrudee Deepaisarn; Ho-Chiao Chuang; Pawin Iamprasertkun
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.

  7. KDD-99 Original dataset

    • kaggle.com
    zip
    Updated Aug 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nagi (2025). KDD-99 Original dataset [Dataset]. https://www.kaggle.com/datasets/primus11/kdd-99-original-dataset
    Explore at:
    zip(19081776 bytes)Available download formats
    Dataset updated
    Aug 13, 2025
    Authors
    nagi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    KDD Cup 1999 Dataset

    The KDD Cup 1999 dataset is one of the earliest and most widely used benchmark datasets for network intrusion detection research.
    It was created for the Third International Knowledge Discovery and Data Mining Tools Competition, hosted by the UCI KDD Archive, using network traffic captured in a simulated military environment at the MIT Lincoln Laboratory. The dataset contains both normal and malicious traffic, with attacks grouped into four main categories: Denial of Service (DoS), Probe, Remote to Local (R2L), and User to Root (U2R).

    Key Characteristics

    • Simulated Traffic Environment: Network traffic was generated in a controlled environment to replicate a military network under attack.
    • Attack Categories:
      • DoS: e.g., smurf, neptune, teardrop
      • Probe: e.g., satan, nmap, ipsweep
      • R2L: e.g., guess_passwd, ftp_write, imap
      • U2R: e.g., buffer_overflow, rootkit, perl
    • Data Capture: Raw TCP dump data was processed into connection records.
    • Feature Extraction: Each record contains 41 features, including:
      • Basic features: Duration, protocol type, service, flag
      • Content features: Failed login counts, number of file creations
      • Traffic features: Connection counts within time windows, percentage of specific connections
    • Labeling: Each record is labeled as normal or as one of the specific attack types.
    • Data Volume: Around 4.9 million records in the full dataset; a 10% subset is also available.

    Advantages

    • Established as a historical benchmark in IDS research.
    • Covers multiple attack categories for classification tasks.
    • Suitable for binary classification (normal vs. attack) and multi-class classification (attack type identification).

    Limitations

    • Contains high redundancy (~78% repeated records) which can bias model performance.
    • Traffic patterns are outdated and may not reflect modern threats.
    • Imbalanced distribution of attack categories.

    Usage

    The KDD Cup 1999 dataset has been extensively used in academia for evaluating IDS algorithms due to its: - Large size and labeled structure - Multiple attack types - Historical significance in the development of intrusion detection systems

  8. Discovering Precursors to Aviation Safety Incidents: KDD 2010

    • data.nasa.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Discovering Precursors to Aviation Safety Incidents: KDD 2010 [Dataset]. https://data.nasa.gov/dataset/discovering-precursors-to-aviation-safety-incidents-kdd-2010
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Modern aircraft are producing data at an unprecedented rate with hundreds of parameters being recorded on a second by second basis. The data can be used for studying the condition of the hardware systems of the aircraft and also for studying the complex interactions between the pilot and the aircraft. NASA is developing novel data mining algorithms to detect precursors to aviation safety incidents from these data sources. This talk will cover the theoretical aspects of the algorithms and practical aspects of implementing these techniques to study one of the most complex dynamical systems in the world: the national airspace.

  9. g

    Discovering Precursors to Aviation Safety Incidents: KDD 2010 | gimi9.com

    • gimi9.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Discovering Precursors to Aviation Safety Incidents: KDD 2010 | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_discovering-precursors-to-aviation-safety-incidents-kdd-2010/
    Explore at:
    Description

    Modern aircraft are producing data at an unprecedented rate with hundreds of parameters being recorded on a second by second basis. The data can be used for studying the condition of the hardware systems of the aircraft and also for studying the complex interactions between the pilot and the aircraft. NASA is developing novel data mining algorithms to detect precursors to aviation safety incidents from these data sources. This talk will cover the theoretical aspects of the algorithms and practical aspects of implementing these techniques to study one of the most complex dynamical systems in the world: the national airspace.

  10. Z

    Datasets from the KDD 2021 article "A Semi-Personalized System for User Cold...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Léa Briand; Guillaume Salha-Galvan; Walid Bendada; Mathieu Morlon; Viet-Anh Tran (2021). Datasets from the KDD 2021 article "A Semi-Personalized System for User Cold Start Recommendation on Music Streaming Apps" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5121673
    Explore at:
    Dataset updated
    Jul 23, 2021
    Dataset provided by
    Deezer Research
    Authors
    Léa Briand; Guillaume Salha-Galvan; Walid Bendada; Mathieu Morlon; Viet-Anh Tran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We publicly release the anonymized song_embeddings.parquet user_embeddings.parquet user_features_test.parquet user_features_train.parquet user_features_validation.parquet datasets, with each of the TT-SVD or UT-ALS versions of embeddings, from the music streaming platform Deezer, as described in the article "A Semi-Personalized System for User Cold Start Recommendation on Music Streaming Apps" published in the proceedings of the 27TH ACM SIGKDD conference on knowledge discovery and data mining (KDD 2021). The paper is available here.

    These datasets are used in the GitHub repository deezer/semi_perso_user_cold_start to reproduce experiments from the article.

    Please cite our paper if you use our code or data in your work.

  11. Gowalla Checkins

    • kaggle.com
    zip
    Updated Nov 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bqlearner (2017). Gowalla Checkins [Dataset]. https://www.kaggle.com/bqlearner/gowalla-checkins
    Explore at:
    zip(105113346 bytes)Available download formats
    Dataset updated
    Nov 15, 2017
    Authors
    bqlearner
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Gowalla is a location-based social networking website where users share their locations by checking-in.

    Content

    Time and location information of check-ins made by users.

    Acknowledgements

    This data set is available from https://snap.stanford.edu/data/loc-gowalla.html

    E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.

  12. H

    StreamSpot Dataset

    • dataverse.harvard.edu
    • search.dataone.org
    application/x-gzip
    Updated Oct 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2018). StreamSpot Dataset [Dataset]. http://doi.org/10.7910/DVN/83KYJY
    Explore at:
    application/x-gzip(87860186), application/x-gzip(105412252), application/x-gzip(45293846), application/x-gzip(15951854), application/x-gzip(13425589), application/x-gzip(44778745), application/x-gzip(114450251)Available download formats
    Dataset updated
    Oct 2, 2018
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The original edge-list data credits to: Emaad Manzoor, Sadegh M. Milajerdi and Leman Akoglu. Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD). 2016.

  13. Z

    Data from: Ancient Greek language models

    • data.niaid.nih.gov
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
    Explore at:
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    Barbara
    Silvia
    Saskia
    Nilo
    Malvina
    Authors
    Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

    Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

    [Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

    [ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

    Diachronica models

    Training data

    Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

    Classical subcorpus

    Hellenistic subcorpus

    Whole corpus

    Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

    Word2Vec

    Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

    Syntactic word embeddings

    Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

    ALP models

    Training data

    Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

    Word2Vec

    Software used: Gensim library (Řehůřek and Sojka, 2010)

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

    References

    Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

    Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

    Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

    Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

    Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

    Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

    Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

    Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

    Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

    Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.

  14. HiDF: A Human-Indistinguishable Deepfake Dataset

    • zenodo.org
    csv, zip
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chaewon Kang; Chaewon Kang; Seoyoon Jeong; Seoyoon Jeong; Jonghyun Lee; Jonghyun Lee (2025). HiDF: A Human-Indistinguishable Deepfake Dataset [Dataset]. http://doi.org/10.1145/3711896.3737399
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chaewon Kang; Chaewon Kang; Seoyoon Jeong; Seoyoon Jeong; Jonghyun Lee; Jonghyun Lee
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    HiDF is a high-quality deepfake dataset designed to challenge the limits of current detection models. It contains over 62,000 images and 8,000 videos generated using commercial deepfake tools, all manually curated to be indistinguishable from real content by human evaluators.

    HiDF provides a new benchmark for evaluating the realism and detectability of AI-generated media, and is intended to support the development of more robust and generalizable deepfake detection systems.

    HiDF was introduced in our paper, "HiDF: A Human-Indistinguishable Deepfake Dataset", accepted to The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025).

  15. Autonomous System Graphs (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Autonomous System Graphs (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-as
    Explore at:
    zip(94677378 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Autonomous systems - Oregon-1

    Dataset information

    9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
    route-views between March 31 2001 and May 26 2001.

    Dataset statistics are calculated for the graph with the lowest (March 31 2001) and highest (from May 26 2001) number of nodes: Dataset statistics for graph
    witdh lowest number of nodes - 3 31 2001)

    Nodes 10670
    Edges 22002
    Nodes in largest WCC 10670 (1.000)
    Edges in largest WCC 22002 (1.000)
    Nodes in largest SCC 10670 (1.000)
    Edges in largest SCC 22002 (1.000)
    Average clustering coefficient 0.4559
    Number of triangles 17144
    Fraction of closed triangles 0.009306
    Diameter (longest shortest path) 9
    90-percentile effective diameter 4.5

    Dataset statistics for graph with highest number of nodes - 5 26 2001

    Nodes 11174
    Edges 23409
    Nodes in largest WCC 11174 (1.000)
    Edges in largest WCC 23409 (1.000)
    Nodes in largest SCC 11174 (1.000)
    Edges in largest SCC 23409 (1.000)
    Average clustering coefficient 0.4532
    Number of triangles 19894
    Fraction of closed triangles 0.009636
    Diameter (longest shortest path) 10
    90-percentile effective diameter 4.4

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    Files
    File Description
    * AS peering information inferred from Oregon route-views ...
    oregon1_010331.txt.gz from March 31 2001
    oregon1_010407.txt.gz from April 7 2001
    oregon1_010414.txt.gz from April 14 2001
    oregon1_010421.txt.gz from April 21 2001
    oregon1_010428.txt.gz from April 28 2001
    oregon1_010505.txt.gz from May 05 2001
    oregon1_010512.txt.gz from May 12 2001
    oregon1_010519.txt.gz from May 19 2001
    oregon1_010526.txt.gz from May 26 2001

    NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
    set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26 2001.

    The nodes are uniform across all graphs in the sequence in the UF collection.
    That is, nodes do...

  16. i

    Kyoto 2006+ Dataset

    • impactcybertrust.org
    • kaggle.com
    Updated Nov 1, 2006
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Data Source (2006). Kyoto 2006+ Dataset [Dataset]. http://doi.org/10.23721/100/1478781
    Explore at:
    Dataset updated
    Nov 1, 2006
    Authors
    External Data Source
    Time period covered
    Nov 1, 2006 - Dec 31, 2015
    Area covered
    Kyoto
    Description

    With the rapid evolution and proliferation of botnets, largescale cyber attacks such as DDoS, spam emails are also becoming more and more dangerous and serious cyber threats. Because of this, network based security technologies such as Network based Intrusion Detection Systems (NIDSs), Intrusion Prevention Systems (IPSs), firewalls have received remarkable attention to defend our crucial computer systems, networks and sensitive information from attackers on the Internet. In particular, there has been much effort towards high-performance NIDSs based on data mining and machine learning techniques. However, there is a fatal problem in that the existing evaluation dataset, called KDD Cup 99' dataset, cannot reflect current network situations and the latest attack trends. This is because it was generated by simulation over a virtual network many years ago. To the best of our knowledge, there is no alternative evaluation dataset. In this paper, we present a new evaluation dataset, called Kyoto 2006+, built on the 3 years of real traffic data (Nov. 2006 ? Aug. 2009) which are obtained from diverse types of honeypots. ;

  17. Sequential Vote Results of Swiss Referenda

    • zenodo.org
    • data.niaid.nih.gov
    text/x-python, zip
    Updated Aug 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Immer; Victor Kristof; Matthias Grossglauser; Patrick Thiran; Alexander Immer; Victor Kristof; Matthias Grossglauser; Patrick Thiran (2020). Sequential Vote Results of Swiss Referenda [Dataset]. http://doi.org/10.1145/3394486.3403277
    Explore at:
    zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 28, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander Immer; Victor Kristof; Matthias Grossglauser; Patrick Thiran; Alexander Immer; Victor Kristof; Matthias Grossglauser; Patrick Thiran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Switzerland
    Description

    This repo contains the data introduced in

    Immer, A.*, Kristof, V.*, Grossglauser, M., Thiran, P., Sub-Matrix Factorization for Real-Time Vote Prediction, KDD 2020

    These data have been collected from OpenData.Swiss every two minutes on two different referendum vote days: May 19, 2019, and February 9, 2020. We use these data to make real-time predictions of the referenda outcome on www.predikon.ch. We publish here the raw data, as retrieved in JSON format from the API. We also provide a python script to help scraping the JSON files.

    After unzipping the datasets, you can scrape the data by referendum vote day by doing:

    from scraper import scrape_referenda
    
    # Scrape the data from February 2, 2020.
    data_dir = 'path/to/2020-02-09'
    data = scrape_referenda(data_dir)

    The data variable will be a list of datum dictionaries of the following structure:

    {
     "vote": 6290,
     "municipality": 1,
     "timestamp": "2020-02-09T15:23:10",
     "num_yes": 222,
     "num_no": 482,
     "num_valid": 704,
     "num_total": 709,
     "num_eligible": 1407,
     "yes_percent": 0.3153409090909091,
     "turnout": 0.503909026297086
    }

    The datum is as follows:

    • vote: vote ID as defined by OpenData.Swiss
    • municipality: municipality ID as defined by OpenData.Swiss
    • timestamp: date and time at which the JSON files has been published on OpenData.Swiss
    • num_yes: number of "yes" in the municipality
    • num_no: number of "no" in the municipality
    • num_valid: number of valid ballots (the ones counting for the results)
    • numb_total: total number of ballots (including invalid ones)
    • num_eligible: number of registered voters
    • yes_percent: percentage of "yes" (computed as `num_yes / num_valid`)
    • turnout: turnout to the vote (computed as `num_total / num_eligible`)

    Don't hesitate to reach out to us if you have any questions!

    To cite this dataset:

    @inproceedings{immer2020submatrix,
     author = {Immer, Alexander and Kristof, Victor and Grossglauser, Matthias and Thiran, Patrick},
     title = {Sub-Matrix Factorization for Real-Time Vote Prediction},
     year = {2020},
     booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
    }

  18. criteo-attribution-dataset

    • huggingface.co
    Updated Aug 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CRITEO (2017). criteo-attribution-dataset [Dataset]. https://huggingface.co/datasets/criteo/criteo-attribution-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2017
    Dataset provided by
    Criteohttps://criteo.com/
    Authors
    CRITEO
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Criteo Attribution Modeling for Bidding Dataset

    This dataset is released along with the paper: Attribution Modeling Increases Efficiency of Bidding in Display Advertising Eustache Diemert*, Julien Meynet* (Criteo Research), Damien Lefortier (Facebook), Pierre Galland (Criteo) *authors contributed equally 2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) When using this dataset, please cite the paper… See the full description on the dataset page: https://huggingface.co/datasets/criteo/criteo-attribution-dataset.

  19. Location Graphs (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Location Graphs (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-loc
    Explore at:
    zip(163822208 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    loc-Brightkite

    https://snap.stanford.edu/data/loc-Brightkite.html

    Dataset information

    Brightkite (http://www.brightkite.com/) was once a location-based social
    networking service provider where users shared their locations by
    checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally
    directed but we have constructed a network with undirected edges when there is a friendship in both ways. We have also collected a total of 4,491,143
    checkins of these users over the period of Apr. 2008 - Oct. 2010.

    Dataset statistics
    Nodes 58,228
    Edges 214,078
    Nodes in largest WCC 56739 (0.974)
    Edges in largest WCC 212945 (0.995)
    Nodes in largest SCC 56739 (0.974)
    Edges in largest SCC 212945 (0.995)
    Average clustering coefficient 0.1723
    Number of triangles 494728
    Fraction of closed triangles 0.03979
    Diameter (longest shortest path) 16
    90-percentile effective diameter 6
    Checkins 4,491,143

    Source (citation)
    E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
    Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
    International Conference on Knowledge Discovery and Data Mining (KDD),
    2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

    Files
    File Description
    loc-brightkite_edges.txt.gz Friendship network of Brightkite users
    loc-brightkite_totalCheckins.txt.gz
    Time and location information of check-ins made by users

    Example of check-in information

    [user][check-in time]   [latitude] [longitude] [location id]    
    58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411    
    58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411    
    58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411    
    58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411    
    58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8    
    58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8    
    58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e    
    58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e    
    58188 2010-04-06T06:45:19Z 46.521389  14.854444 ddaa40aaa22411    
    58188 2008-12-30T15:30:08Z 46.522621  14.849618 58e12bc0d67e11    
    58189 2009-04-08T07:36:46Z 46.554722  15.646667 ddaf9c4ea22411    
    58190 2009-04-08T07:01:28Z 46.421389  15.869722 dd793f96a22411    
    

    Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

    The SNAP data set is 0-based, with nodes numbered 0 to 58,227.

    In the SuiteSparse Matrix Collection the graph is converted to 1-based.
    The Problem.A matrix is the undirected friendship network, where
    A(i,j)=1 if person 1+i and person 1+j are friends in the SNAP data set.

    There are 4,747,287 checkins in the loc-brightkite_totalCheckins.txt
    file, but 6 lines are empty with a user id but no other data (those
    are discarded here). In the SuiteSparse Matrix Collection, the checkin
    data is held in 5 vectors of length 4,747,281. These are in the
    Problem.aux component of the MATLAB struct. The kth entry of each of
    these vectors holds the data in the kth line of the
    loc-brightkite_totalCheckins.txt file (after deleting the 6 empty lines).

    userid: the SNAP user id is an integer in the range 0 to 58,227. It  
      has been incremented by one, here, to reflect the corresponding  
      row and column of the Problem.A matrix. It contains 51,406    
      unique user id's.                         
    checkin_time: a string of length 20                  
    latitude: a double precision number                  
    longitude: a double precision number                  
    location_id: a string of length 61.
    

    loc-Gowalla

    https://snap.stanford.edu/data/loc-Gowalla.html

    Dataset information

    Gowalla (http://www.gowalla.com/) is a location-based social networking
    website where users share their locations by checking-in. The friendship
    network is undirected and was collected using their public API, and
    consists of 196,591 nodes and 950,327 edges. We have collected a total of
    6,442,890 check-ins of these users over the period of Feb. 2009 - Oct.
    2010.

    Dataset statistics
    Nodes 196,591
    Edges 950,327
    Nodes in largest WCC 196591 (1.000)
    Edges in largest WCC 950327 (1.000)
    Nodes in largest SCC 196591 (1.000)
    Edges in largest SCC 950327 (1.000)
    Average clustering coefficient 0.2367
    Number of triangles 2273138
    Fraction of closed triangles 0.007952
    Diameter (longest shortest path) 14
    90-percentile effective diameter 5.7
    Check-ins 6,442,890

    Source (citation)
    E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
    Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
    International Conference on Knowledge Discovery and Data Mining (KDD),
    2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

    Files
    File Description
    loc-gowalla_edges.txt.gz Friendship network of Gowalla users
    loc-gowalla_totalCheckins.txt.gz Time and location information
    of check-ins made by users

    Example of check-in information

    [user] [check-in time]   [latitude]  [longitude] [location id]  
    196514 2010-07-24T13:45:06Z 53.3648119  -2.2723465833  145064   
    196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017  1275991   
    196514 2010-07-24T13:44:46Z 53.3653895945 -2.2754087046  376497   
    196514 2010-07-24T13:44:38Z 53.3663709833 -2.2700764333  98503    
    196514 2010-07-24T13:44:26Z 53.3674087524 -2.2783813477  1043431   
    196514 2010-07-24T13:44:08Z 53.3675663377 -2.278631763  881734   
    196514 2010-07-24T13:43:18Z 53.3679640626 -2.2792943689  207763   
    196514 2010-07-24T13:41:10Z 53.364905   -2.270824    1042822
    
  20. Citation Networks (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Citation Networks (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-cit
    Explore at:
    zip(95620457 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High-energy physics citation network

    Dataset information

    Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the
    e-print arXiv and covers all the citations within a dataset of 34,546 papers
    with 421,578 edges. If a paper i cites paper j, the graph contains a directed
    edge from i to j. If a paper cites, or is cited by, a paper outside the
    dataset, the graph does not contain any information about this.

    The data covers papers in the period from January 1993 to April 2003 (124
    months). It begins within a few months of the inception of the arXiv, and thus
    represents essentially the complete history of its HEP-PH section.

    The data was originally released as a part of 2003 KDD Cup.

    Dataset statistics
    Nodes 34546
    Edges 421578
    Nodes in largest WCC 34401 (0.996)
    Edges in largest WCC 421485 (1.000)
    Nodes in largest SCC 12711 (0.368)
    Edges in largest SCC 139981 (0.332)
    Average clustering coefficient 0.2962
    Number of triangles 1276868
    Fraction of closed triangles 0.1457
    Diameter (longest shortest path) 12
    90-percentile effective diameter 5

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    J. Gehrke, P. Ginsparg, J. M. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD
    Explorations 5(2): 149-151, 2003.

    Files
    File Description
    cit-HepPh.txt.gz Paper citation network of Arxiv High Energy Physics category cit-HepPh-dates.txt.gz Time of nodes (paper submission time to Arxiv)

    High-energy physics theory citation network

    Dataset information

    Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print
    arXiv and covers all the citations within a dataset of 27,770 papers with
    352,807 edges. If a paper i cites paper j, the graph contains a directed edge
    from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.

    The data covers papers in the period from January 1993 to April 2003 (124
    months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its HEP-TH section.

    The data was originally released as a part of 2003 KDD Cup.

    Dataset statistics
    Nodes 27770
    Edges 352807
    Nodes in largest WCC 27400 (0.987) ...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dataintelo (2025). Knowledge Discovery In Databases Market Research Report 2033 [Dataset]. https://dataintelo.com/report/knowledge-discovery-in-databases-market

Knowledge Discovery In Databases Market Research Report 2033

Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License

https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

Time period covered
2024 - 2032
Area covered
Global
Description

Knowledge Discovery in Databases (KDD) Market Outlook




According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 9.6 billion in 2024, propelled by the growing demand for advanced data analytics and intelligent decision-making across industries. The market is expanding at a robust CAGR of 18.7% and is forecasted to reach USD 53.2 billion by 2033. This remarkable growth is driven primarily by the exponential rise in data generation, the adoption of artificial intelligence and machine learning, and the increasing need for actionable insights in real-time environments. As per our latest research, organizations worldwide are leveraging KDD solutions to extract valuable information from massive datasets, thereby fostering innovation, operational efficiency, and competitive advantage.




A significant growth factor for the Knowledge Discovery in Databases market is the rapid digital transformation witnessed across various sectors. Enterprises are increasingly migrating their core operations to digital platforms, resulting in the accumulation of vast amounts of structured and unstructured data. This surge in data volume necessitates advanced analytics tools capable of sifting through complex datasets to uncover hidden patterns, correlations, and anomalies. KDD solutions, encompassing data mining, machine learning algorithms, and visualization tools, are being widely deployed to convert raw data into strategic assets. Furthermore, the integration of KDD with emerging technologies such as big data analytics, Internet of Things (IoT), and cloud computing is further amplifying its adoption, enabling organizations to harness data-driven insights for enhanced decision-making and innovation.




Another major driver fueling the growth of the KDD market is the increasing emphasis on fraud detection, risk management, and regulatory compliance, particularly in sectors like BFSI, healthcare, and government. The proliferation of cyber threats, financial crimes, and regulatory mandates has compelled organizations to invest in sophisticated KDD platforms that can proactively identify suspicious activities and ensure compliance with evolving standards. These solutions leverage advanced algorithms to analyze transactional data in real-time, flagging anomalies and potential risks before they escalate. As a result, businesses are able to mitigate financial losses, safeguard sensitive information, and uphold their reputational integrity in an increasingly complex regulatory landscape.




The widespread adoption of KDD solutions is also being driven by the growing demand for personalized customer experiences and predictive analytics. In highly competitive markets such as retail, e-commerce, and telecommunications, organizations are leveraging KDD to analyze customer behavior, preferences, and purchasing patterns. This enables them to tailor their offerings, optimize marketing strategies, and enhance customer engagement. The ability to anticipate market trends, forecast demand, and identify emerging opportunities is proving invaluable for businesses seeking to maintain a competitive edge. Additionally, the shift towards cloud-based KDD solutions is making advanced analytics accessible to small and medium enterprises, democratizing the benefits of knowledge discovery and leveling the playing field.




From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, accounting for the largest share in 2024. This leadership can be attributed to the strong presence of technology giants, advanced IT infrastructure, and early adoption of analytics solutions across key industries. However, the Asia Pacific region is emerging as the fastest-growing market, driven by rapid digitization, government initiatives promoting data-driven innovation, and the proliferation of SMEs embracing cloud-based KDD platforms. Europe also represents a significant market, characterized by stringent data protection regulations and a focus on industrial automation. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by increasing investments in digital infrastructure and a growing recognition of the value of data analytics.



Component Analysis




The component segment of the Knowledge Discovery in Databases market is categorized into software, services, and platforms, each playing a pivotal role in the

Search
Clear search
Close search
Google apps
Main menu