18 datasets found

i
Data from: KDD Cup 1999 Data
impactcybertrust.org
Updated Jan 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
External Data Source (2019). KDD Cup 1999 Data [Dataset]. http://doi.org/10.23721/100/1478801
Explore at:
Unique identifier
https://doi.org/10.23721/100/1478801
Dataset updated
Jan 19, 2019
Authors
External Data Source
Description
This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.

The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu
T
kddcup99
tensorflow.org
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). kddcup99 [Dataset]. https://www.tensorflow.org/datasets/catalog/kddcup99
Explore at:
Dataset updated
Jan 4, 2023
Description
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('kddcup99', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
i
Internet Traffic (KDD Cup 99)
ieee-dataport.org
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ginel Dorleon (2025). Internet Traffic (KDD Cup 99) [Dataset]. https://ieee-dataport.org/documents/internet-traffic-kdd-cup-99
Explore at:
Dataset updated
Apr 23, 2025
Authors
Ginel Dorleon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
called intrusions or attacks
f
Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...
datasetcatalog.nlm.nih.gov
acs.figshare.com
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chavalekvirat, Panwad; Chuang, Ho-Chiao; Deepaisarn, Somrudee; Deshsorn, Krittapong; Iamprasertkun, Pawin (2025). Historical Data Mining Deep Dive into Machine Learning-Aided 2D Materials Research in Electrochemical Applications [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002054578
Explore at:
Dataset updated
Jun 23, 2025
Authors
Chavalekvirat, Panwad; Chuang, Ho-Chiao; Deepaisarn, Somrudee; Deshsorn, Krittapong; Iamprasertkun, Pawin
Description
Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.
i
KDDCup99
ieee-dataport.org
Updated Jan 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santhosh B J (2025). KDDCup99 [Dataset]. https://ieee-dataport.org/documents/kddcup99
Explore at:
Dataset updated
Jan 12, 2025
Authors
Santhosh B J
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
called intrusions or attacks
O
KDD Cup 1999
opendatalab.com
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
North Carolina State University, KDD Cup 1999 [Dataset]. https://opendatalab.com/OpenDataLab/KDD_Cup_1999
Explore at:
zipAvailable download formats
Dataset provided by
Florida Institute of Technology
North Carolina State University
Georgia Institute of Technology
Columbia University
Description
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
c
Discovering Precursors to Aviation Safety Incidents: KDD 2010
s.cnmilf.com
data.nasa.gov
+2more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Discovering Precursors to Aviation Safety Incidents: KDD 2010 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/discovering-precursors-to-aviation-safety-incidents-kdd-2010
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Modern aircraft are producing data at an unprecedented rate with hundreds of parameters being recorded on a second by second basis. The data can be used for studying the condition of the hardware systems of the aircraft and also for studying the complex interactions between the pilot and the aircraft. NASA is developing novel data mining algorithms to detect precursors to aviation safety incidents from these data sources. This talk will cover the theoretical aspects of the algorithms and practical aspects of implementing these techniques to study one of the most complex dynamical systems in the world: the national airspace.
Gowalla Checkins
kaggle.com
Updated Nov 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bqlearner (2017). Gowalla Checkins [Dataset]. https://www.kaggle.com/bqlearner/gowalla-checkins/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
bqlearner
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Gowalla is a location-based social networking website where users share their locations by checking-in.

Content

Time and location information of check-ins made by users.

Acknowledgements

This data set is available from https://snap.stanford.edu/data/loc-gowalla.html

E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.
Imbalanced dataset for benchmarking
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira (2020). Imbalanced dataset for benchmarking [Dataset]. http://doi.org/10.5281/zenodo.61452
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.61452
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Imbalanced dataset for benchmarking
=======================

The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

Characteristics
-------------------

|ID |Name |Repository & Target |Ratio |# samples| # features |
|:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:|
|1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 |
|2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 |
|3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 |
|4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 |
|5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 |
|6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 |
|7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 |
|8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 |
|9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 |
|10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 |
|11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 |
|12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 |
|13 |Libras Move |UCI, target: 1 |14:1 |360 |90 |
|14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 |
|15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 |
|16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 |
|17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 |
|18 |OIL |UCI, target: minority |22:1 |937 |49 |
|19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 |
|20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 |
|21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 |
|22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 |
|23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 |
|24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 |
|25 |Mammography |UCI, target: minority |42:1 |11,183 |6 |
|26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 |
|27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 |

References
----------
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H
ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
d
StreamSpot Dataset
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Han, Xueyuan (2023). StreamSpot Dataset [Dataset]. http://doi.org/10.7910/DVN/83KYJY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/83KYJY
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Han, Xueyuan
Description
The original edge-list data credits to: Emaad Manzoor, Sadegh M. Milajerdi and Leman Akoglu. Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD). 2016.
HiDF: A Human-Indistinguishable Deepfake Dataset
zenodo.org
csv, zip
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chaewon Kang; Chaewon Kang; Seoyoon Jeong; Seoyoon Jeong; Jonghyun Lee; Jonghyun Lee (2025). HiDF: A Human-Indistinguishable Deepfake Dataset [Dataset]. http://doi.org/10.1145/3711896.3737399
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.1145/3711896.3737399
Dataset updated
Jul 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chaewon Kang; Chaewon Kang; Seoyoon Jeong; Seoyoon Jeong; Jonghyun Lee; Jonghyun Lee
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
HiDF is a high-quality deepfake dataset designed to challenge the limits of current detection models. It contains over 62,000 images and 8,000 videos generated using commercial deepfake tools, all manually curated to be indistinguishable from real content by human evaluators.

HiDF provides a new benchmark for evaluating the realism and detectability of AI-generated media, and is intended to support the development of more robust and generalizable deepfake detection systems.

HiDF was introduced in our paper, "HiDF: A Human-Indistinguishable Deepfake Dataset", accepted to The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025).
Autonomous System Graphs (SNAP)
kaggle.com
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Autonomous System Graphs (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-as
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 16, 2021
Dataset provided by
Kaggle
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Autonomous systems - Oregon-1

Dataset information

9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
route-views between March 31 2001 and May 26 2001.

Dataset statistics are calculated for the graph with the lowest (March 31 2001) and highest (from May 26 2001) number of nodes: Dataset statistics for graph
witdh lowest number of nodes - 3 31 2001)

Nodes 10670
Edges 22002
Nodes in largest WCC 10670 (1.000)
Edges in largest WCC 22002 (1.000)
Nodes in largest SCC 10670 (1.000)
Edges in largest SCC 22002 (1.000)
Average clustering coefficient 0.4559
Number of triangles 17144
Fraction of closed triangles 0.009306
Diameter (longest shortest path) 9
90-percentile effective diameter 4.5

Dataset statistics for graph with highest number of nodes - 5 26 2001

Nodes 11174
Edges 23409
Nodes in largest WCC 11174 (1.000)
Edges in largest WCC 23409 (1.000)
Nodes in largest SCC 11174 (1.000)
Edges in largest SCC 23409 (1.000)
Average clustering coefficient 0.4532
Number of triangles 19894
Fraction of closed triangles 0.009636
Diameter (longest shortest path) 10
90-percentile effective diameter 4.4

Source (citation)

J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.

Files
File Description
* AS peering information inferred from Oregon route-views ...
oregon1_010331.txt.gz from March 31 2001
oregon1_010407.txt.gz from April 7 2001
oregon1_010414.txt.gz from April 14 2001
oregon1_010421.txt.gz from April 21 2001
oregon1_010428.txt.gz from April 28 2001
oregon1_010505.txt.gz from May 05 2001
oregon1_010512.txt.gz from May 12 2001
oregon1_010519.txt.gz from May 19 2001
oregon1_010526.txt.gz from May 26 2001

NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26 2001.

The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do...
ventilator-lightning-unsupervised-tst
kaggle.com
Updated Oct 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Peng (2021). ventilator-lightning-unsupervised-tst [Dataset]. https://www.kaggle.com/datasets/markpeng/lightning-unsupervised-tst/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 25, 2021
Dataset provided by
Kaggle
Authors
Mark Peng
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Pretrained transformer encoder model and its 128-dim z-representation vectors for Google Brain Ventilator Pressure Prediction train/test sets based on time_step,u_in, u_out, R and C columns.

Training & Feature Extraction Notebook:

https://www.kaggle.com/markpeng/transformer-based-ts-representation-learning

Please upvote this dataset if you like it, thanks!

Reference:

George Zerveas et al. (2021). "_A Transformer-based Framework for Multivariate Time Series Representation Learning_," Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '21).

ArXiV paper: https://arxiv.org/abs/2010.02803

Github Repository: https://github.com/gzerveas/mvts_transformer
Criteo Attribution Modeling for Bidding Dataset
kaggle.com
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharat Sachin (2022). Criteo Attribution Modeling for Bidding Dataset [Dataset]. https://www.kaggle.com/datasets/sharatsachin/criteo-attribution-modeling
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sharat Sachin
Description
Data description

This dataset represents a sample of 30 days of Criteo live traffic data. Each line corresponds to one impression (a banner) that was displayed to a user. For each banner we have detailed information about the context, if it was clicked, if it led to a conversion and if it led to a conversion that was attributed to Criteo or not. Data has been sub-sampled and anonymized so as not to disclose proprietary elements.

Here is a detailed description of the fields (they are tab-separated in the file):

timestamp: timestamp of the impression (starting from 0 for the first impression). The dataset is sorted according to timestamp.

uid a unique user identifier

campaign a unique identifier for the campaign

conversion 1 if there was a conversion in the 30 days after the impression (independently of whether this impression was last click or not)

conversion_timestamp the timestamp of the conversion or -1 if no conversion was observed

conversion_id a unique identifier for each conversion (so that timelines can be reconstructed if needed). -1 if there was no conversion

attribution 1 if the conversion was attributed to Criteo, 0 otherwise

click 1 if the impression was clicked, 0 otherwise

click_pos the position of the click before a conversion (0 for first-click)

click_nb number of clicks. More than 1 if there was several clicks before a conversion

cost the price paid by Criteo for this display (disclaimer: not the real price, only a transformed version of it)

cpo the cost-per-order in case of attributed conversion (disclaimer: not the real price, only a transformed version of it)

time_since_last_click the time since the last click (in s) for the given impression

cat[1-9] contextual features associated to the display. Can be used to learn the click/conversion models. We do not disclose the meaning of these features but it is not relevant for this study. Each column is a categorical variable. In the experiments, they are mapped to a fixed dimensionality space using the Hashing Trick (see paper for reference).

Key figures

2,4Gb uncompressed

16.5M impressions

45K conversions

700 campaigns

Tasks

This dataset can be used in a large scope of applications related to Real-Time-Bidding, including but not limited to:

Attribution modeling: rule based, model based, etc…

Conversion modeling in display advertising: the data includes cost and value used for computing Utility metrics.

Offline metrics for real-time bidding

Citation

This dataset is released along with following paper:

“Attribution Modeling Increases Efficiency of Bidding in Display Advertising” Eustache Diemert*, Julien Meynet* (Criteo AI Lab), Damien Lefortier (Facebook), Pierre Galland (Criteo) (*authors contributed equally) published in “2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017)”

When using this dataset, please cite the paper with following bibtex (final ACM bibtex coming soon):

@inproceedings{DiemertMeynet2017, author = {{Diemert Eustache, Meynet Julien} and Galland, Pierre and Lefortier, Damien}, title={Attribution Modeling Increases Efficiency of Bidding in Display Advertising}, publisher = {ACM}, pages={To appear}, booktitle = {Proceedings of the AdKDD and TargetAd Workshop, KDD, Halifax, NS, Canada, August, 14, 2017}, year = {2017} }
Z
Sequential Vote Results of Swiss Referenda
data.niaid.nih.gov
Updated Aug 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thiran, Patrick (2020). Sequential Vote Results of Swiss Referenda [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3984924
Explore at:
Dataset updated
Aug 28, 2020
Dataset provided by
Thiran, Patrick
Kristof, Victor
Grossglauser, Matthias
Immer, Alexander
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Switzerland
Description
This repo contains the data introduced in

Immer, A.*, Kristof, V.*, Grossglauser, M., Thiran, P., Sub-Matrix Factorization for Real-Time Vote Prediction, KDD 2020

These data have been collected from OpenData.Swiss every two minutes on two different referendum vote days: May 19, 2019, and February 9, 2020. We use these data to make real-time predictions of the referenda outcome on www.predikon.ch. We publish here the raw data, as retrieved in JSON format from the API. We also provide a python script to help scraping the JSON files.

After unzipping the datasets, you can scrape the data by referendum vote day by doing:

from scraper import scrape_referenda

Scrape the data from February 2, 2020.

data_dir = 'path/to/2020-02-09' data = scrape_referenda(data_dir)

The data variable will be a list of datum dictionaries of the following structure:

{ "vote": 6290, "municipality": 1, "timestamp": "2020-02-09T15:23:10", "num_yes": 222, "num_no": 482, "num_valid": 704, "num_total": 709, "num_eligible": 1407, "yes_percent": 0.3153409090909091, "turnout": 0.503909026297086 }

The datum is as follows:

vote: vote ID as defined by OpenData.Swiss

municipality: municipality ID as defined by OpenData.Swiss

timestamp: date and time at which the JSON files has been published on OpenData.Swiss

num_yes: number of "yes" in the municipality

num_no: number of "no" in the municipality

num_valid: number of valid ballots (the ones counting for the results)

numb_total: total number of ballots (including invalid ones)

num_eligible: number of registered voters

yes_percent: percentage of "yes" (computed as num_yes / num_valid)

turnout: turnout to the vote (computed as num_total / num_eligible)

Don't hesitate to reach out to us if you have any questions!

To cite this dataset:

@inproceedings{immer2020submatrix, author = {Immer, Alexander and Kristof, Victor and Grossglauser, Matthias and Thiran, Patrick}, title = {Sub-Matrix Factorization for Real-Time Vote Prediction}, year = {2020}, booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining}, }
o
Geotagged Digital Traces
explore.openaire.eu
ekoizpen-zientifikoa.ehu.eus
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo Munoz-Cancino; Sebastián A. Sebastián A. Ríos; Manuel Graña (2023). Geotagged Digital Traces [Dataset]. http://doi.org/10.5281/zenodo.7949306
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7949306
Dataset updated
May 18, 2023
Authors
Ricardo Munoz-Cancino; Sebastián A. Sebastián A. Ríos; Manuel Graña
Description
This dataset, divided into files by city, contains geotagged digital traces collected from different social media platforms, detailed below. • Tweets - Cheng et al. [1] • Gowalla [2] • Tweets - Lamsal [3] • YELP[4] • Tweets - Kejriwal et al. [5] • Geotagged Tweets [6] • UrbanActivity, [7] • Brightkite [8] • Weeplaces [8] • Flickr [9] • Foursquare [10] Each file is named according to the city to which the digital traces were associated and contains the columns: Source: contains the name of the source platform Event_date: contains the date associated with the digital trace Lat: latitude of the digital trace Lng: length of the digital trace The definition of city/town used is provided by Simplemaps [11], which considers a city/town any inhabited place as determined by U.S. government agencies. The location of cities and their respective centers were obtained from the World Cities Database provided by the same company. A specific group of these cities was utilized for the research presented in the article submitted to Sensors Journal: Muñoz-Cancino, R., Rios, S. A., & Graña, M. (2023). Clustering cities over features extracted from multiple virtual sensors measuring micro-level activity patterns allows to discriminate large-scale city characteristics. Sensors, Under Review. Comprehensive guidelines and the selection criteria can be found in the abovementioned article. References [1] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: A content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM '10, page 759{768, New York, NY, USA, 2010. Association for Computing Machinery. [2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, page 1082{1090, New York, NY, USA, 2011. Association for Computing Machinery. [3] Yunhe Feng and Wenjun Zhou. Is working from home the new norm? an observational study based on a large geo-tagged covid-19 twitter dataset, 2020. [4] Yelp Inc. Yelp Open Dataset, 2021. Retrieved from https://www.yelp.com/dataset. Accessed October 26, 2021. [5] Mayank Kejriwal and Sara Melotte. A Geo-Tagged COVID-19 Twitter Dataset for 10 North American Metropolitan Areas, January 2021. [6] Rabindra Lamsal. Design and analysis of a large-scale covid-19 tweets dataset. Applied Intelligence, 51(5):2790{2804, 2021. [7] Geraud Le Falher, Aristides Gionis, and Michael Mathioudakis. Where is the Soho of Rome? Measures and algorithms for finding similar neighborhoods in cities. In 9th AAAI Conference on Web and Social Media - ICWSM 2015, Oxford, United Kingdom, May 2015. [8] Yong Liu, WeiWei, Aixin Sun, and Chunyan Miao. Exploiting geographical neighborhood characteristics for location recommendation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM '14, page 739{748, New York, NY,USA, 2014. Association for Computing Machinery. [9] Hatem Mousselly-Sergieh, Daniel Watzinger, Bastian Huber, Mario Doller, Elood Egyed-Zsigmond, and Harald Kosch. World-wide scale geotagged image dataset for automatic image annotation and reverse geotagging. In Proceedings of the 5th ACM Multimedia Systems Conference, MMSys '14, page 47{52, New York, NY, USA, 2014. Association for Computing Machinery. [10] Dingqi Yang, Daqing Zhang, Vincent W. Zheng, and Zhiyong Yu. Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(1):129{142, 2015. [11] Simple Maps. Basic World Cities Database, 2021. Retrieved from https://simplemaps.com/data/world-cities. Accessed September 3, 2021. {"references": ["Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: A content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM '10, page 759{768, New York, NY, USA, 2010. Association for Computing Machinery.", "Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, page 1082{1090, New York, NY, USA, 2011. Association for Computing Machinery.", "Yunhe Feng and Wenjun Zhou. Is working from home the new norm? an observational study based on a large geo-tagged covid-19 twitter dataset, 2020.", "Yelp Inc. Yelp Open Dataset, 2021. Retrieved from https://www.yelp.com/dataset. Accessed October 26, 2021.", "Mayank Kejriwal and Sara Melotte. A Geo-Tagged COVID-19 Twitter Dataset for 10 North American Metropolitan Areas, January 2021.", "Rabindra Lamsal. Design and analysis of a large-scale covid-19 tweets dataset. Appl...
criteo-attribution-dataset
huggingface.co
Updated Aug 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CRITEO (2017). criteo-attribution-dataset [Dataset]. https://huggingface.co/datasets/criteo/criteo-attribution-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2017
Dataset provided by
Criteohttps://criteo.com/
Authors
CRITEO
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Criteo Attribution Modeling for Bidding Dataset

This dataset is released along with the paper: Attribution Modeling Increases Efficiency of Bidding in Display Advertising Eustache Diemert*, Julien Meynet* (Criteo Research), Damien Lefortier (Facebook), Pierre Galland (Criteo) *authors contributed equally 2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) When using this dataset, please cite the paper… See the full description on the dataset page: https://huggingface.co/datasets/criteo/criteo-attribution-dataset.
Generative Dog Images
kaggle.com
zip
Updated Jun 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ching-Yuan Bai (2021). Generative Dog Images [Dataset]. https://www.kaggle.com/andrewcybai/generative-dog-images
Explore at:
zip(93845035418 bytes)Available download formats
Dataset updated
Jun 17, 2021
Authors
Ching-Yuan Bai
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Background

The first-ever, large-scale generative modeling research competition, Generative Dog Images, was held on Kaggle in the summer of 2019. Over 900+ teams participated and submitted a total of 10k+ generated samples, 1.6k of which were selected as the final submissions to rank on the private leaderboard. We are releasing the competition submissions as an effort to facilitate research on generative modeling metric design, particularly towards tackling the issue of detecting training sample memorization, intentional or not.

Content

Each competition submission consists of 10k generated samples of dog images from a generative model trained on the Stanford dogs dataset. As expected participants are incentivized to optimize for the objective and many exploited the insensitivity to training sample memorization issue of current popular generative modeling metrics (e.g. IS, FID). We provided manual labels of the type of intentional memorization technique adopted (if any) for each submission. Details regarding the labels can be found in the description of the labels.csv file. We also provided human-assessed image quality annotations for individual images.

Acknowledgements

Huge thanks to all the participants in the Generative Dog Images research competition for providing all the well-tuned models as well as feedback during the competition. The competition result analysis is published as a conference paper and if you find this dataset useful, please cite the following: @inproceedings{bai2021genmem, author = {Ching-Yuan Bai and Hsuan-Tien Lin and Colin Raffel and Wendy Chih-wen Kan}, title = {On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition}, booktitle = {Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)}, year = 2021, month = aug }

Inspiration

The Memorizaion-informed Fréchet Inception Distance (MiFID) was proposed and adopted as the benchmark metric during the competition to handle the training sample memorization issue. It works well in a competition setting but obvious flaws make it unideal in a general research setting.

Are there any other alternatives?

The large amount and great diversity of models in this dataset can serve as a testing ground for newly developed benchmark metrics.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

External Data Source (2019). KDD Cup 1999 Data [Dataset]. http://doi.org/10.23721/100/1478801

Data from: KDD Cup 1999 Data

DS-0937

Explore at:

78 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.23721/100/1478801

Dataset updated

Jan 19, 2019

Authors

External Data Source

Description

This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.

The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu

Clear search

Close search

Google apps

Main menu

Data from: KDD Cup 1999 Data

kddcup99

Internet Traffic (KDD Cup 99)

Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...

KDDCup99

KDD Cup 1999

Discovering Precursors to Aviation Safety Incidents: KDD 2010

Gowalla Checkins

Context

Content

Acknowledgements

Imbalanced dataset for benchmarking

StreamSpot Dataset

HiDF: A Human-Indistinguishable Deepfake Dataset

Autonomous System Graphs (SNAP)

Autonomous systems - Oregon-1

ventilator-lightning-unsupervised-tst

Criteo Attribution Modeling for Bidding Dataset

Data description

Key figures

Tasks

Citation

Sequential Vote Results of Swiss Referenda

Scrape the data from February 2, 2020.

Geotagged Digital Traces

criteo-attribution-dataset

Generative Dog Images

Background

Content

Acknowledgements

Inspiration

Data from: KDD Cup 1999 Data

DS-0937