This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, and
good'' normal connections.
The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.
Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.
The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('kddcup99', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
called intrusions or attacks
Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
called intrusions or attacks
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
Modern aircraft are producing data at an unprecedented rate with hundreds of parameters being recorded on a second by second basis. The data can be used for studying the condition of the hardware systems of the aircraft and also for studying the complex interactions between the pilot and the aircraft. NASA is developing novel data mining algorithms to detect precursors to aviation safety incidents from these data sources. This talk will cover the theoretical aspects of the algorithms and practical aspects of implementing these techniques to study one of the most complex dynamical systems in the world: the national airspace.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Gowalla is a location-based social networking website where users share their locations by checking-in.
Time and location information of check-ins made by users.
This data set is available from https://snap.stanford.edu/data/loc-gowalla.html
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Imbalanced dataset for benchmarking
=======================
The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.
Characteristics
-------------------
|ID |Name |Repository & Target |Ratio |# samples| # features |
|:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:|
|1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 |
|2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 |
|3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 |
|4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 |
|5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 |
|6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 |
|7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 |
|8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 |
|9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 |
|10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 |
|11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 |
|12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 |
|13 |Libras Move |UCI, target: 1 |14:1 |360 |90 |
|14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 |
|15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 |
|16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 |
|17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 |
|18 |OIL |UCI, target: minority |22:1 |937 |49 |
|19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 |
|20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 |
|21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 |
|22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 |
|23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 |
|24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 |
|25 |Mammography |UCI, target: minority |42:1 |11,183 |6 |
|26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 |
|27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 |
References
----------
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H
ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).
[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).
[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
The original edge-list data credits to: Emaad Manzoor, Sadegh M. Milajerdi and Leman Akoglu. Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD). 2016.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
HiDF is a high-quality deepfake dataset designed to challenge the limits of current detection models. It contains over 62,000 images and 8,000 videos generated using commercial deepfake tools, all manually curated to be indistinguishable from real content by human evaluators.
HiDF provides a new benchmark for evaluating the realism and detectability of AI-generated media, and is intended to support the development of more robust and generalizable deepfake detection systems.
HiDF was introduced in our paper, "HiDF: A Human-Indistinguishable Deepfake Dataset", accepted to The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
route-views between March 31 2001 and May 26 2001.
Dataset statistics are calculated for the graph with the lowest (March 31 2001)
and highest (from May 26 2001) number of nodes: Dataset statistics for graph
witdh lowest number of nodes - 3 31 2001)
Nodes 10670
Edges 22002
Nodes in largest WCC 10670 (1.000)
Edges in largest WCC 22002 (1.000)
Nodes in largest SCC 10670 (1.000)
Edges in largest SCC 22002 (1.000)
Average clustering coefficient 0.4559
Number of triangles 17144
Fraction of closed triangles 0.009306
Diameter (longest shortest path) 9
90-percentile effective diameter 4.5
Dataset statistics for graph with highest number of nodes - 5 26 2001
Nodes 11174
Edges 23409
Nodes in largest WCC 11174 (1.000)
Edges in largest WCC 23409 (1.000)
Nodes in largest SCC 11174 (1.000)
Edges in largest SCC 23409 (1.000)
Average clustering coefficient 0.4532
Number of triangles 19894
Fraction of closed triangles 0.009636
Diameter (longest shortest path) 10
90-percentile effective diameter 4.4
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
Files
File Description
* AS peering information inferred from Oregon route-views ...
oregon1_010331.txt.gz from March 31 2001
oregon1_010407.txt.gz from April 7 2001
oregon1_010414.txt.gz from April 14 2001
oregon1_010421.txt.gz from April 21 2001
oregon1_010428.txt.gz from April 28 2001
oregon1_010505.txt.gz from May 05 2001
oregon1_010512.txt.gz from May 12 2001
oregon1_010519.txt.gz from May 19 2001
oregon1_010526.txt.gz from May 26 2001
NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26
2001.
The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Pretrained transformer encoder model and its 128-dim
z-representation vectors for Google Brain Ventilator Pressure Prediction train/test sets based on time_step
,u_in
, u_out
, R
and C
columns.
Training & Feature Extraction Notebook:
https://www.kaggle.com/markpeng/transformer-based-ts-representation-learning
Please upvote this dataset if you like it, thanks!
Reference:
George Zerveas et al. (2021). "_A Transformer-based Framework for Multivariate Time Series Representation Learning_," Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '21).
ArXiV paper: https://arxiv.org/abs/2010.02803
Github Repository: https://github.com/gzerveas/mvts_transformer
This dataset represents a sample of 30 days of Criteo live traffic data. Each line corresponds to one impression (a banner) that was displayed to a user. For each banner we have detailed information about the context, if it was clicked, if it led to a conversion and if it led to a conversion that was attributed to Criteo or not. Data has been sub-sampled and anonymized so as not to disclose proprietary elements.
Here is a detailed description of the fields (they are tab-separated in the file):
This dataset can be used in a large scope of applications related to Real-Time-Bidding, including but not limited to:
This dataset is released along with following paper:
“Attribution Modeling Increases Efficiency of Bidding in Display Advertising” Eustache Diemert*, Julien Meynet* (Criteo AI Lab), Damien Lefortier (Facebook), Pierre Galland (Criteo) (*authors contributed equally) published in “2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017)”
When using this dataset, please cite the paper with following bibtex (final ACM bibtex coming soon):
@inproceedings{DiemertMeynet2017,
author = {{Diemert Eustache, Meynet Julien} and Galland, Pierre and Lefortier, Damien},
title={Attribution Modeling Increases Efficiency of Bidding in Display Advertising},
publisher = {ACM},
pages={To appear},
booktitle = {Proceedings of the AdKDD and TargetAd Workshop, KDD, Halifax, NS, Canada, August, 14, 2017},
year = {2017}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repo contains the data introduced in
Immer, A.*, Kristof, V.*, Grossglauser, M., Thiran, P., Sub-Matrix Factorization for Real-Time Vote Prediction, KDD 2020
These data have been collected from OpenData.Swiss every two minutes on two different referendum vote days: May 19, 2019, and February 9, 2020. We use these data to make real-time predictions of the referenda outcome on www.predikon.ch. We publish here the raw data, as retrieved in JSON format from the API. We also provide a python script to help scraping the JSON files.
After unzipping the datasets, you can scrape the data by referendum vote day by doing:
from scraper import scrape_referenda
data_dir = 'path/to/2020-02-09' data = scrape_referenda(data_dir)
The data variable will be a list of datum dictionaries of the following structure:
{ "vote": 6290, "municipality": 1, "timestamp": "2020-02-09T15:23:10", "num_yes": 222, "num_no": 482, "num_valid": 704, "num_total": 709, "num_eligible": 1407, "yes_percent": 0.3153409090909091, "turnout": 0.503909026297086 }
The datum is as follows:
vote: vote ID as defined by OpenData.Swiss
municipality: municipality ID as defined by OpenData.Swiss
timestamp: date and time at which the JSON files has been published on OpenData.Swiss
num_yes: number of "yes" in the municipality
num_no: number of "no" in the municipality
num_valid: number of valid ballots (the ones counting for the results)
numb_total: total number of ballots (including invalid ones)
num_eligible: number of registered voters
yes_percent: percentage of "yes" (computed as num_yes / num_valid
)
turnout: turnout to the vote (computed as num_total / num_eligible
)
Don't hesitate to reach out to us if you have any questions!
To cite this dataset:
@inproceedings{immer2020submatrix, author = {Immer, Alexander and Kristof, Victor and Grossglauser, Matthias and Thiran, Patrick}, title = {Sub-Matrix Factorization for Real-Time Vote Prediction}, year = {2020}, booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining}, }
This dataset, divided into files by city, contains geotagged digital traces collected from different social media platforms, detailed below. • Tweets - Cheng et al. [1] • Gowalla [2] • Tweets - Lamsal [3] • YELP[4] • Tweets - Kejriwal et al. [5] • Geotagged Tweets [6] • UrbanActivity, [7] • Brightkite [8] • Weeplaces [8] • Flickr [9] • Foursquare [10] Each file is named according to the city to which the digital traces were associated and contains the columns: Source: contains the name of the source platform Event_date: contains the date associated with the digital trace Lat: latitude of the digital trace Lng: length of the digital trace The definition of city/town used is provided by Simplemaps [11], which considers a city/town any inhabited place as determined by U.S. government agencies. The location of cities and their respective centers were obtained from the World Cities Database provided by the same company. A specific group of these cities was utilized for the research presented in the article submitted to Sensors Journal: Muñoz-Cancino, R., Rios, S. A., & Graña, M. (2023). Clustering cities over features extracted from multiple virtual sensors measuring micro-level activity patterns allows to discriminate large-scale city characteristics. Sensors, Under Review. Comprehensive guidelines and the selection criteria can be found in the abovementioned article. References [1] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: A content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM '10, page 759{768, New York, NY, USA, 2010. Association for Computing Machinery. [2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, page 1082{1090, New York, NY, USA, 2011. Association for Computing Machinery. [3] Yunhe Feng and Wenjun Zhou. Is working from home the new norm? an observational study based on a large geo-tagged covid-19 twitter dataset, 2020. [4] Yelp Inc. Yelp Open Dataset, 2021. Retrieved from https://www.yelp.com/dataset. Accessed October 26, 2021. [5] Mayank Kejriwal and Sara Melotte. A Geo-Tagged COVID-19 Twitter Dataset for 10 North American Metropolitan Areas, January 2021. [6] Rabindra Lamsal. Design and analysis of a large-scale covid-19 tweets dataset. Applied Intelligence, 51(5):2790{2804, 2021. [7] Geraud Le Falher, Aristides Gionis, and Michael Mathioudakis. Where is the Soho of Rome? Measures and algorithms for finding similar neighborhoods in cities. In 9th AAAI Conference on Web and Social Media - ICWSM 2015, Oxford, United Kingdom, May 2015. [8] Yong Liu, WeiWei, Aixin Sun, and Chunyan Miao. Exploiting geographical neighborhood characteristics for location recommendation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM '14, page 739{748, New York, NY,USA, 2014. Association for Computing Machinery. [9] Hatem Mousselly-Sergieh, Daniel Watzinger, Bastian Huber, Mario Doller, Elood Egyed-Zsigmond, and Harald Kosch. World-wide scale geotagged image dataset for automatic image annotation and reverse geotagging. In Proceedings of the 5th ACM Multimedia Systems Conference, MMSys '14, page 47{52, New York, NY, USA, 2014. Association for Computing Machinery. [10] Dingqi Yang, Daqing Zhang, Vincent W. Zheng, and Zhiyong Yu. Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(1):129{142, 2015. [11] Simple Maps. Basic World Cities Database, 2021. Retrieved from https://simplemaps.com/data/world-cities. Accessed September 3, 2021. {"references": ["Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: A content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM '10, page 759{768, New York, NY, USA, 2010. Association for Computing Machinery.", "Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, page 1082{1090, New York, NY, USA, 2011. Association for Computing Machinery.", "Yunhe Feng and Wenjun Zhou. Is working from home the new norm? an observational study based on a large geo-tagged covid-19 twitter dataset, 2020.", "Yelp Inc. Yelp Open Dataset, 2021. Retrieved from https://www.yelp.com/dataset. Accessed October 26, 2021.", "Mayank Kejriwal and Sara Melotte. A Geo-Tagged COVID-19 Twitter Dataset for 10 North American Metropolitan Areas, January 2021.", "Rabindra Lamsal. Design and analysis of a large-scale covid-19 tweets dataset. Appl...
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Criteo Attribution Modeling for Bidding Dataset
This dataset is released along with the paper: Attribution Modeling Increases Efficiency of Bidding in Display Advertising Eustache Diemert*, Julien Meynet* (Criteo Research), Damien Lefortier (Facebook), Pierre Galland (Criteo) *authors contributed equally 2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) When using this dataset, please cite the paper… See the full description on the dataset page: https://huggingface.co/datasets/criteo/criteo-attribution-dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The first-ever, large-scale generative modeling research competition, Generative Dog Images, was held on Kaggle in the summer of 2019. Over 900+ teams participated and submitted a total of 10k+ generated samples, 1.6k of which were selected as the final submissions to rank on the private leaderboard. We are releasing the competition submissions as an effort to facilitate research on generative modeling metric design, particularly towards tackling the issue of detecting training sample memorization, intentional or not.
Each competition submission consists of 10k generated samples of dog images from a generative model trained on the Stanford dogs dataset. As expected participants are incentivized to optimize for the objective and many exploited the insensitivity to training sample memorization issue of current popular generative modeling metrics (e.g. IS, FID). We provided manual labels of the type of intentional memorization technique adopted (if any) for each submission. Details regarding the labels can be found in the description of the labels.csv
file. We also provided human-assessed image quality annotations for individual images.
Huge thanks to all the participants in the Generative Dog Images research competition for providing all the well-tuned models as well as feedback during the competition. The competition result analysis is published as a conference paper and if you find this dataset useful, please cite the following:
@inproceedings{bai2021genmem,
author = {Ching-Yuan Bai and Hsuan-Tien Lin and Colin Raffel and Wendy Chih-wen Kan},
title = {On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition},
booktitle = {Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)},
year = 2021,
month = aug
}
The Memorizaion-informed Fréchet Inception Distance (MiFID) was proposed and adopted as the benchmark metric during the competition to handle the training sample memorization issue. It works well in a competition setting but obvious flaws make it unideal in a general research setting.
Are there any other alternatives?
The large amount and great diversity of models in this dataset can serve as a testing ground for newly developed benchmark metrics.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, and
good'' normal connections.
The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.
Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.
The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu