Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The KDD Cup 1999 dataset is one of the earliest and most widely used benchmark datasets for network intrusion detection research.
It was created for the Third International Knowledge Discovery and Data Mining Tools Competition, hosted by the UCI KDD Archive, using network traffic captured in a simulated military environment at the MIT Lincoln Laboratory. The dataset contains both normal and malicious traffic, with attacks grouped into four main categories: Denial of Service (DoS), Probe, Remote to Local (R2L), and User to Root (U2R).
The KDD Cup 1999 dataset has been extensively used in academia for evaluating IDS algorithms due to its: - Large size and labeled structure - Multiple attack types - Historical significance in the development of intrusion detection systems
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
For more information about the contents refer to this link http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
The dataset is shared on Kaggle on behalf of KDD's work.
Build a classifier capable of distinguishing between attacks, and normal connections
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 9.6 billion in 2024, propelled by the growing demand for advanced data analytics and intelligent decision-making across industries. The market is expanding at a robust CAGR of 18.7% and is forecasted to reach USD 53.2 billion by 2033. This remarkable growth is driven primarily by the exponential rise in data generation, the adoption of artificial intelligence and machine learning, and the increasing need for actionable insights in real-time environments. As per our latest research, organizations worldwide are leveraging KDD solutions to extract valuable information from massive datasets, thereby fostering innovation, operational efficiency, and competitive advantage.
A significant growth factor for the Knowledge Discovery in Databases market is the rapid digital transformation witnessed across various sectors. Enterprises are increasingly migrating their core operations to digital platforms, resulting in the accumulation of vast amounts of structured and unstructured data. This surge in data volume necessitates advanced analytics tools capable of sifting through complex datasets to uncover hidden patterns, correlations, and anomalies. KDD solutions, encompassing data mining, machine learning algorithms, and visualization tools, are being widely deployed to convert raw data into strategic assets. Furthermore, the integration of KDD with emerging technologies such as big data analytics, Internet of Things (IoT), and cloud computing is further amplifying its adoption, enabling organizations to harness data-driven insights for enhanced decision-making and innovation.
Another major driver fueling the growth of the KDD market is the increasing emphasis on fraud detection, risk management, and regulatory compliance, particularly in sectors like BFSI, healthcare, and government. The proliferation of cyber threats, financial crimes, and regulatory mandates has compelled organizations to invest in sophisticated KDD platforms that can proactively identify suspicious activities and ensure compliance with evolving standards. These solutions leverage advanced algorithms to analyze transactional data in real-time, flagging anomalies and potential risks before they escalate. As a result, businesses are able to mitigate financial losses, safeguard sensitive information, and uphold their reputational integrity in an increasingly complex regulatory landscape.
The widespread adoption of KDD solutions is also being driven by the growing demand for personalized customer experiences and predictive analytics. In highly competitive markets such as retail, e-commerce, and telecommunications, organizations are leveraging KDD to analyze customer behavior, preferences, and purchasing patterns. This enables them to tailor their offerings, optimize marketing strategies, and enhance customer engagement. The ability to anticipate market trends, forecast demand, and identify emerging opportunities is proving invaluable for businesses seeking to maintain a competitive edge. Additionally, the shift towards cloud-based KDD solutions is making advanced analytics accessible to small and medium enterprises, democratizing the benefits of knowledge discovery and leveling the playing field.
From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, accounting for the largest share in 2024. This leadership can be attributed to the strong presence of technology giants, advanced IT infrastructure, and early adoption of analytics solutions across key industries. However, the Asia Pacific region is emerging as the fastest-growing market, driven by rapid digitization, government initiatives promoting data-driven innovation, and the proliferation of SMEs embracing cloud-based KDD platforms. Europe also represents a significant market, characterized by stringent data protection regulations and a focus on industrial automation. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by increasing investments in digital infrastructure and a growing recognition of the value of data analytics.
The component segment of the Knowledge Discovery in Databases market is categorized into software, services, and platforms, each playing a pivotal role in the
Facebook
TwitterThis is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.
The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.
Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.
The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu
Facebook
Twitter
According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 8.7 billion in 2024, driven by the exponential growth of data across industries and increasing demand for advanced analytics solutions. The market is experiencing a robust expansion, registering a CAGR of 18.5% during the forecast period. By 2033, the Knowledge Discovery in Databases market is projected to attain a value of USD 44.9 billion. This remarkable growth is primarily attributed to the rising adoption of artificial intelligence (AI), machine learning (ML), and big data analytics, which are transforming how organizations extract actionable insights from vast and complex datasets.
The surge in data generation from digital transformation initiatives, IoT devices, and cloud-based applications is a major growth driver for the Knowledge Discovery in Databases market. As organizations increasingly digitize their operations and customer interactions, the volume, variety, and velocity of data have soared, making traditional data analysis methods insufficient. KDD platforms and solutions are essential for uncovering hidden patterns, correlations, and trends within large datasets, enabling businesses to make data-driven decisions and gain a competitive edge. Furthermore, the proliferation of unstructured data from sources such as social media, emails, and multimedia content has heightened the need for advanced mining techniques, further fueling market growth.
Another significant factor propelling the Knowledge Discovery in Databases market is the integration of AI and ML technologies into KDD solutions. These intelligent algorithms enhance the automation, accuracy, and scalability of data mining processes, allowing organizations to extract deeper insights in real time. The increasing availability of cloud-based KDD solutions has democratized access to advanced analytics, enabling small and medium enterprises (SMEs) to leverage sophisticated tools without the need for extensive infrastructure investments. Additionally, the growing emphasis on regulatory compliance, risk management, and fraud detection in sectors such as BFSI and healthcare is driving the adoption of KDD technologies to ensure data integrity and security.
The evolving landscape of digital businesses and the rising importance of customer-centric strategies have also contributed to the expansion of the Knowledge Discovery in Databases market. Enterprises across retail, telecommunications, and manufacturing are harnessing KDD tools to personalize offerings, optimize supply chains, and enhance operational efficiency. The ability of KDD platforms to handle diverse data types, including text, images, and video, has broadened their applicability across various domains. Moreover, the increasing focus on predictive analytics and real-time decision-making is encouraging organizations to invest in KDD solutions that provide timely and actionable insights, thereby driving sustained market growth through 2033.
From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, supported by the presence of leading technology vendors, high digital adoption rates, and substantial investments in AI and analytics infrastructure. However, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT ecosystems, and government initiatives promoting data-driven innovation. Europe remains a significant market, characterized by strong regulatory frameworks and a focus on data privacy and security. Latin America and the Middle East & Africa are also emerging as promising markets, driven by increasing awareness of the benefits of KDD and growing investments in digital transformation across industries.
The Knowledge Discovery in Databases market is segmented by component into Software, Services, and Platforms, each playing a crucial role in the overall ecosystem. Software solutions form the backbone of the KDD ma
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We publicly release the anonymized song_embeddings.parquet user_embeddings.parquet user_features_test.parquet user_features_train.parquet user_features_validation.parquet datasets, with each of the TT-SVD or UT-ALS versions of embeddings, from the music streaming platform Deezer, as described in the article "A Semi-Personalized System for User Cold Start Recommendation on Music Streaming Apps" published in the proceedings of the 27TH ACM SIGKDD conference on knowledge discovery and data mining (KDD 2021). The paper is available here.
These datasets are used in the GitHub repository deezer/semi_perso_user_cold_start to reproduce experiments from the article.
Please cite our paper if you use our code or data in your work.
Facebook
TwitterModern aircraft are producing data at an unprecedented rate with hundreds of parameters being recorded on a second by second basis. The data can be used for studying the condition of the hardware systems of the aircraft and also for studying the complex interactions between the pilot and the aircraft. NASA is developing novel data mining algorithms to detect precursors to aviation safety incidents from these data sources. This talk will cover the theoretical aspects of the algorithms and practical aspects of implementing these techniques to study one of the most complex dynamical systems in the world: the national airspace.
Facebook
TwitterThe original edge-list data credits to: Emaad Manzoor, Sadegh M. Milajerdi and Leman Akoglu. Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD). 2016.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
HiDF is a high-quality deepfake dataset designed to challenge the limits of current detection models. It contains over 62,000 images and 8,000 videos generated using commercial deepfake tools, all manually curated to be indistinguishable from real content by human evaluators.
HiDF provides a new benchmark for evaluating the realism and detectability of AI-generated media, and is intended to support the development of more robust and generalizable deepfake detection systems.
HiDF was introduced in our paper, "HiDF: A Human-Indistinguishable Deepfake Dataset", accepted to The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025).
Facebook
TwitterWith the rapid evolution and proliferation of botnets, largescale cyber attacks such as DDoS, spam emails are also becoming more and more dangerous and serious cyber threats. Because of this, network based security technologies such as Network based Intrusion Detection Systems (NIDSs), Intrusion Prevention Systems (IPSs), firewalls have received remarkable attention to defend our crucial computer systems, networks and sensitive information from attackers on the Internet. In particular, there has been much effort towards high-performance NIDSs based on data mining and machine learning techniques. However, there is a fatal problem in that the existing evaluation dataset, called KDD Cup 99' dataset, cannot reflect current network situations and the latest attack trends. This is because it was generated by simulation over a virtual network many years ago. To the best of our knowledge, there is no alternative evaluation dataset. In this paper, we present a new evaluation dataset, called Kyoto 2006+, built on the 3 years of real traffic data (Nov. 2006 ? Aug. 2009) which are obtained from diverse types of honeypots. ;
Facebook
TwitterModern aircraft are producing data at an unprecedented rate with hundreds of parameters being recorded on a second by second basis. The data can be used for studying the condition of the hardware systems of the aircraft and also for studying the complex interactions between the pilot and the aircraft. NASA is developing novel data mining algorithms to detect precursors to aviation safety incidents from these data sources. This talk will cover the theoretical aspects of the algorithms and practical aspects of implementing these techniques to study one of the most complex dynamical systems in the world: the national airspace.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Criteo Attribution Modeling for Bidding Dataset
This dataset is released along with the paper: Attribution Modeling Increases Efficiency of Bidding in Display Advertising Eustache Diemert*, Julien Meynet* (Criteo Research), Damien Lefortier (Facebook), Pierre Galland (Criteo) *authors contributed equally 2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) When using this dataset, please cite the paper… See the full description on the dataset page: https://huggingface.co/datasets/criteo/criteo-attribution-dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Gowalla is a location-based social networking website where users share their locations by checking-in.
Time and location information of check-ins made by users.
This data set is available from https://snap.stanford.edu/data/loc-gowalla.html
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the
e-print arXiv and covers all the citations within a dataset of 34,546 papers
with 421,578 edges. If a paper i cites paper j, the graph contains a directed
edge from i to j. If a paper cites, or is cited by, a paper outside the
dataset, the graph does not contain any information about this.
The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its HEP-PH section.
The data was originally released as a part of 2003 KDD Cup.
Dataset statistics
Nodes 34546
Edges 421578
Nodes in largest WCC 34401 (0.996)
Edges in largest WCC 421485 (1.000)
Nodes in largest SCC 12711 (0.368)
Edges in largest SCC 139981 (0.332)
Average clustering coefficient 0.2962
Number of triangles 1276868
Fraction of closed triangles 0.1457
Diameter (longest shortest path) 12
90-percentile effective diameter 5
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
J. Gehrke, P. Ginsparg, J. M. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD
Explorations 5(2): 149-151, 2003.
Files
File Description
cit-HepPh.txt.gz Paper citation network of Arxiv High Energy Physics category
cit-HepPh-dates.txt.gz Time of nodes (paper submission time to Arxiv)
Dataset information
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print
arXiv and covers all the citations within a dataset of 27,770 papers with
352,807 edges. If a paper i cites paper j, the graph contains a directed edge
from i to j. If a paper cites, or is cited by, a paper outside the dataset, the
graph does not contain any information about this.
The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its HEP-TH section.
The data was originally released as a part of 2003 KDD Cup.
Dataset statistics
Nodes 27770
Edges 352807
Nodes in largest WCC 27400 (0.987) ...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Data Information: WISDM (WIireless Sensor Data Mining) smart phone-based sensor , collecting data from 36 different users in six different activities.
Number of examples: 1,098,207
Number of attributes: 6
Missing attribute values: None
Data processing:
1.Replace the nanoseconds with seconds in the timestamp column, and remove the user column, because each user will perform the same action.
2.Use the sliding window method to transform the data into sequences, and then split each label into training and testing sets, ensuring each label has 8:2 ratio in both the training and testing sets.
3.Shuffle the order of the labels in both training and testing sets and interleave them to prevent two sequences with the same label from being consecutively lined up.
Activity:
0 = Downstairs 100,427 (9.1%)
1 = Jogging 342,177 (31.2%)
2 = Sitting 59,939 (5.5%)
3 = Standing 48,395 (4.4%)
4 = Upstair 122,869 (11.2%)
5 = Walking 424,400 (38.6%)
Resource:
The dataset are collected by WISDM Lab [https://www.cis.fordham.edu/wisdm/dataset.php]
Jeffrey W. Lockhart, Gary M. Weiss, Jack C. Xue, Shaun T. Gallagher, Andrew B. Grosner, and Tony T. Pulickal (2011). "Design Considerations for the WISDM Smart Phone-Based Sensor Mining Architecture," Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data (at KDD-11), San Diego, CA. [https://www.cis.fordham.edu/wisdm/includes/files/Lockhart-Design-SensorKDD11.pdf]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
route-views between March 31 2001 and May 26 2001.
Dataset statistics are calculated for the graph with the lowest (March 31 2001)
and highest (from May 26 2001) number of nodes: Dataset statistics for graph
witdh lowest number of nodes - 3 31 2001)
Nodes 10670
Edges 22002
Nodes in largest WCC 10670 (1.000)
Edges in largest WCC 22002 (1.000)
Nodes in largest SCC 10670 (1.000)
Edges in largest SCC 22002 (1.000)
Average clustering coefficient 0.4559
Number of triangles 17144
Fraction of closed triangles 0.009306
Diameter (longest shortest path) 9
90-percentile effective diameter 4.5
Dataset statistics for graph with highest number of nodes - 5 26 2001
Nodes 11174
Edges 23409
Nodes in largest WCC 11174 (1.000)
Edges in largest WCC 23409 (1.000)
Nodes in largest SCC 11174 (1.000)
Edges in largest SCC 23409 (1.000)
Average clustering coefficient 0.4532
Number of triangles 19894
Fraction of closed triangles 0.009636
Diameter (longest shortest path) 10
90-percentile effective diameter 4.4
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
Files
File Description
* AS peering information inferred from Oregon route-views ...
oregon1_010331.txt.gz from March 31 2001
oregon1_010407.txt.gz from April 7 2001
oregon1_010414.txt.gz from April 14 2001
oregon1_010421.txt.gz from April 21 2001
oregon1_010428.txt.gz from April 28 2001
oregon1_010505.txt.gz from May 05 2001
oregon1_010512.txt.gz from May 12 2001
oregon1_010519.txt.gz from May 19 2001
oregon1_010526.txt.gz from May 26 2001
NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26
2001.
The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
While the physiological response of humans to emotional events or stimuli is well-investigated for many modalities (like EEG, skin resistance, ...), surprisingly little is known about the exhalation of so-called Volatile Organic Compounds (VOCs) at quite low concentrations in response to such stimuli. VOCs are molecules of relatively small mass that quickly evaporate or sublimate and can be detected in the air that surrounds us. The project introduces a new field of application for data mining, where trace gas responses of people reacting on-line to films shown in cinemas (or movie theaters) are related to the semantic content of the films themselves. To do so, we measured the VOCs from a movie theater over a whole month in intervals of thirty seconds, and annotated the screened films by a controlled vocabulary compiled from multiple sources.
The data set consists of two parts, first the measured VOCs, and second the information about the movies. The VOCs are given in the file TOF_CO2_data_30sec.arff which is simply the time of the measurement in the first column, then all measured 400+ VOCs in the other columns. Roughly one measurement was carried out every 30 seconds. The information which movies were shown is given in the file screenings.csv. It gives start time, end time, movie title and how many visitors were in the screening. Additionally, the folder labels_aggregated give a consensus labelling of multiple annotators for the movies. The labels describe the scenes, each label represented by a row, then each column showing if the label is active (1) or not (0). This is available for 6 movies in the data set.
The goal of our initial analysis was the identification of markers, that is, finding certain VOCs that have a relation to certain labels and therefore emotions. For example, given the scene label blood, is there any increase or decrease in the concentration of a specific VOC?
Further information is available in our publications https://doi.org/10.1145/2783258.2783404, https://doi.org/10.1038/srep25464, and https://dx.doi.org/10.1371/journal.pone.0203044
If you use this data set, please cite:
Jörg Wicker, Nicolas Krauter, Bettina Derstorff, Christof Stönner, Efstratios Bourtsoukidis, Thomas Klüpfel, Jonathan Williams, and Stefan Kramer. 2015. Cinema Data Mining: The Smell of Fear. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1295-1304. DOI: https://doi.org/10.1145/2783258.2783404
While the first analysis gave already interesting results, we believe that this data set has a high potential for further analysis. We are currently working on increasing the size of the data. Additionally, multiple follow-up publications are being prepared. There are many posssible tasks, we focus mainly on the identification of markers in the VOC data, but there are many potential interesting findings in the data set. Are movies related based on the VOCs? Could we identify similar scenes based on the VOCs?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/loc-Brightkite.html
Dataset information
Brightkite (http://www.brightkite.com/) was once a location-based social
networking service provider where users shared their locations by
checking-in. The friendship network was collected using their public API,
and consists of 58,228 nodes and 214,078 edges. The network is originally
directed but we have constructed a network with undirected edges when there
is a friendship in both ways. We have also collected a total of 4,491,143
checkins of these users over the period of Apr. 2008 - Oct. 2010.
Dataset statistics
Nodes 58,228
Edges 214,078
Nodes in largest WCC 56739 (0.974)
Edges in largest WCC 212945 (0.995)
Nodes in largest SCC 56739 (0.974)
Edges in largest SCC 212945 (0.995)
Average clustering coefficient 0.1723
Number of triangles 494728
Fraction of closed triangles 0.03979
Diameter (longest shortest path) 16
90-percentile effective diameter 6
Checkins 4,491,143
Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf
Files
File Description
loc-brightkite_edges.txt.gz Friendship network of Brightkite users
loc-brightkite_totalCheckins.txt.gz
Time and location information of check-ins made by users
Example of check-in information
[user][check-in time] [latitude] [longitude] [location id]
58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411
58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411
58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e
58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e
58188 2010-04-06T06:45:19Z 46.521389 14.854444 ddaa40aaa22411
58188 2008-12-30T15:30:08Z 46.522621 14.849618 58e12bc0d67e11
58189 2009-04-08T07:36:46Z 46.554722 15.646667 ddaf9c4ea22411
58190 2009-04-08T07:01:28Z 46.421389 15.869722 dd793f96a22411
The SNAP data set is 0-based, with nodes numbered 0 to 58,227.
In the SuiteSparse Matrix Collection the graph is converted to 1-based.
The Problem.A matrix is the undirected friendship network, where
A(i,j)=1 if person 1+i and person 1+j are friends in the SNAP data set.
There are 4,747,287 checkins in the loc-brightkite_totalCheckins.txt
file, but 6 lines are empty with a user id but no other data (those
are discarded here). In the SuiteSparse Matrix Collection, the checkin
data is held in 5 vectors of length 4,747,281. These are in the
Problem.aux component of the MATLAB struct. The kth entry of each of
these vectors holds the data in the kth line of the
loc-brightkite_totalCheckins.txt file (after deleting the 6 empty lines).
userid: the SNAP user id is an integer in the range 0 to 58,227. It
has been incremented by one, here, to reflect the corresponding
row and column of the Problem.A matrix. It contains 51,406
unique user id's.
checkin_time: a string of length 20
latitude: a double precision number
longitude: a double precision number
location_id: a string of length 61.
https://snap.stanford.edu/data/loc-Gowalla.html
Dataset information
Gowalla (http://www.gowalla.com/) is a location-based social networking
website where users share their locations by checking-in. The friendship
network is undirected and was collected using their public API, and
consists of 196,591 nodes and 950,327 edges. We have collected a total of
6,442,890 check-ins of these users over the period of Feb. 2009 - Oct.
2010.
Dataset statistics
Nodes 196,591
Edges 950,327
Nodes in largest WCC 196591 (1.000)
Edges in largest WCC 950327 (1.000)
Nodes in largest SCC 196591 (1.000)
Edges in largest SCC 950327 (1.000)
Average clustering coefficient 0.2367
Number of triangles 2273138
Fraction of closed triangles 0.007952
Diameter (longest shortest path) 14
90-percentile effective diameter 5.7
Check-ins 6,442,890
Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf
Files
File Description
loc-gowalla_edges.txt.gz Friendship network of Gowalla users
loc-gowalla_totalCheckins.txt.gz Time and location information
of check-ins made by users
Example of check-in information
[user] [check-in time] [latitude] [longitude] [location id]
196514 2010-07-24T13:45:06Z 53.3648119 -2.2723465833 145064
196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017 1275991
196514 2010-07-24T13:44:46Z 53.3653895945 -2.2754087046 376497
196514 2010-07-24T13:44:38Z 53.3663709833 -2.2700764333 98503
196514 2010-07-24T13:44:26Z 53.3674087524 -2.2783813477 1043431
196514 2010-07-24T13:44:08Z 53.3675663377 -2.278631763 881734
196514 2010-07-24T13:43:18Z 53.3679640626 -2.2792943689 207763
196514 2010-07-24T13:41:10Z 53.364905 -2.270824 1042822
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The KDD Cup 1999 dataset is one of the earliest and most widely used benchmark datasets for network intrusion detection research.
It was created for the Third International Knowledge Discovery and Data Mining Tools Competition, hosted by the UCI KDD Archive, using network traffic captured in a simulated military environment at the MIT Lincoln Laboratory. The dataset contains both normal and malicious traffic, with attacks grouped into four main categories: Denial of Service (DoS), Probe, Remote to Local (R2L), and User to Root (U2R).
The KDD Cup 1999 dataset has been extensively used in academia for evaluating IDS algorithms due to its: - Large size and labeled structure - Multiple attack types - Historical significance in the development of intrusion detection systems