18 datasets found
  1. i

    Data from: KDD Cup 1999 Data

    • impactcybertrust.org
    Updated Jan 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Data Source (2019). KDD Cup 1999 Data [Dataset]. http://doi.org/10.23721/100/1478801
    Explore at:
    Dataset updated
    Jan 19, 2019
    Authors
    External Data Source
    Description

    This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.

    The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

    Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

    The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu

  2. T

    kddcup99

    • tensorflow.org
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). kddcup99 [Dataset]. https://www.tensorflow.org/datasets/catalog/kddcup99
    Explore at:
    Dataset updated
    Jan 4, 2023
    Description

    This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('kddcup99', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. i

    Internet Traffic (KDD Cup 99)

    • ieee-dataport.org
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ginel Dorleon (2025). Internet Traffic (KDD Cup 99) [Dataset]. https://ieee-dataport.org/documents/internet-traffic-kdd-cup-99
    Explore at:
    Dataset updated
    Apr 23, 2025
    Authors
    Ginel Dorleon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    called intrusions or attacks

  4. f

    Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chavalekvirat, Panwad; Chuang, Ho-Chiao; Deepaisarn, Somrudee; Deshsorn, Krittapong; Iamprasertkun, Pawin (2025). Historical Data Mining Deep Dive into Machine Learning-Aided 2D Materials Research in Electrochemical Applications [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002054578
    Explore at:
    Dataset updated
    Jun 23, 2025
    Authors
    Chavalekvirat, Panwad; Chuang, Ho-Chiao; Deepaisarn, Somrudee; Deshsorn, Krittapong; Iamprasertkun, Pawin
    Description

    Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.

  5. i

    KDDCup99

    • ieee-dataport.org
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santhosh B J (2025). KDDCup99 [Dataset]. https://ieee-dataport.org/documents/kddcup99
    Explore at:
    Dataset updated
    Jan 12, 2025
    Authors
    Santhosh B J
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    called intrusions or attacks

  6. O

    KDD Cup 1999

    • opendatalab.com
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    North Carolina State University, KDD Cup 1999 [Dataset]. https://opendatalab.com/OpenDataLab/KDD_Cup_1999
    Explore at:
    zipAvailable download formats
    Dataset provided by
    Florida Institute of Technology
    North Carolina State University
    Georgia Institute of Technology
    Columbia University
    Description

    This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

  7. c

    Discovering Precursors to Aviation Safety Incidents: KDD 2010

    • s.cnmilf.com
    • data.nasa.gov
    • +2more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Discovering Precursors to Aviation Safety Incidents: KDD 2010 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/discovering-precursors-to-aviation-safety-incidents-kdd-2010
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Modern aircraft are producing data at an unprecedented rate with hundreds of parameters being recorded on a second by second basis. The data can be used for studying the condition of the hardware systems of the aircraft and also for studying the complex interactions between the pilot and the aircraft. NASA is developing novel data mining algorithms to detect precursors to aviation safety incidents from these data sources. This talk will cover the theoretical aspects of the algorithms and practical aspects of implementing these techniques to study one of the most complex dynamical systems in the world: the national airspace.

  8. Gowalla Checkins

    • kaggle.com
    Updated Nov 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bqlearner (2017). Gowalla Checkins [Dataset]. https://www.kaggle.com/bqlearner/gowalla-checkins/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    bqlearner
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Gowalla is a location-based social networking website where users share their locations by checking-in.

    Content

    Time and location information of check-ins made by users.

    Acknowledgements

    This data set is available from https://snap.stanford.edu/data/loc-gowalla.html

    E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.

  9. Imbalanced dataset for benchmarking

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira (2020). Imbalanced dataset for benchmarking [Dataset]. http://doi.org/10.5281/zenodo.61452
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Imbalanced dataset for benchmarking
    =======================

    The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

    Characteristics
    -------------------

    |ID |Name |Repository & Target |Ratio |# samples| # features |
    |:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:|
    |1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 |
    |2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 |
    |3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 |
    |4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 |
    |5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 |
    |6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 |
    |7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 |
    |8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 |
    |9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 |
    |10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 |
    |11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 |
    |12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 |
    |13 |Libras Move |UCI, target: 1 |14:1 |360 |90 |
    |14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 |
    |15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 |
    |16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 |
    |17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 |
    |18 |OIL |UCI, target: minority |22:1 |937 |49 |
    |19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 |
    |20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 |
    |21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 |
    |22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 |
    |23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 |
    |24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 |
    |25 |Mammography |UCI, target: minority |42:1 |11,183 |6 |
    |26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 |
    |27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 |

    References
    ----------
    [1] Ding, Zejin, "Diversified Ensemble Classifiers for H
    ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

    [2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

    [3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

    [4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

  10. d

    StreamSpot Dataset

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Han, Xueyuan (2023). StreamSpot Dataset [Dataset]. http://doi.org/10.7910/DVN/83KYJY
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Han, Xueyuan
    Description

    The original edge-list data credits to: Emaad Manzoor, Sadegh M. Milajerdi and Leman Akoglu. Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD). 2016.

  11. HiDF: A Human-Indistinguishable Deepfake Dataset

    • zenodo.org
    csv, zip
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chaewon Kang; Chaewon Kang; Seoyoon Jeong; Seoyoon Jeong; Jonghyun Lee; Jonghyun Lee (2025). HiDF: A Human-Indistinguishable Deepfake Dataset [Dataset]. http://doi.org/10.1145/3711896.3737399
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chaewon Kang; Chaewon Kang; Seoyoon Jeong; Seoyoon Jeong; Jonghyun Lee; Jonghyun Lee
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    HiDF is a high-quality deepfake dataset designed to challenge the limits of current detection models. It contains over 62,000 images and 8,000 videos generated using commercial deepfake tools, all manually curated to be indistinguishable from real content by human evaluators.

    HiDF provides a new benchmark for evaluating the realism and detectability of AI-generated media, and is intended to support the development of more robust and generalizable deepfake detection systems.

    HiDF was introduced in our paper, "HiDF: A Human-Indistinguishable Deepfake Dataset", accepted to The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025).

  12. Autonomous System Graphs (SNAP)

    • kaggle.com
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Autonomous System Graphs (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-as
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2021
    Dataset provided by
    Kaggle
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Autonomous systems - Oregon-1

    Dataset information

    9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
    route-views between March 31 2001 and May 26 2001.

    Dataset statistics are calculated for the graph with the lowest (March 31 2001) and highest (from May 26 2001) number of nodes: Dataset statistics for graph
    witdh lowest number of nodes - 3 31 2001)

    Nodes 10670
    Edges 22002
    Nodes in largest WCC 10670 (1.000)
    Edges in largest WCC 22002 (1.000)
    Nodes in largest SCC 10670 (1.000)
    Edges in largest SCC 22002 (1.000)
    Average clustering coefficient 0.4559
    Number of triangles 17144
    Fraction of closed triangles 0.009306
    Diameter (longest shortest path) 9
    90-percentile effective diameter 4.5

    Dataset statistics for graph with highest number of nodes - 5 26 2001

    Nodes 11174
    Edges 23409
    Nodes in largest WCC 11174 (1.000)
    Edges in largest WCC 23409 (1.000)
    Nodes in largest SCC 11174 (1.000)
    Edges in largest SCC 23409 (1.000)
    Average clustering coefficient 0.4532
    Number of triangles 19894
    Fraction of closed triangles 0.009636
    Diameter (longest shortest path) 10
    90-percentile effective diameter 4.4

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    Files
    File Description
    * AS peering information inferred from Oregon route-views ...
    oregon1_010331.txt.gz from March 31 2001
    oregon1_010407.txt.gz from April 7 2001
    oregon1_010414.txt.gz from April 14 2001
    oregon1_010421.txt.gz from April 21 2001
    oregon1_010428.txt.gz from April 28 2001
    oregon1_010505.txt.gz from May 05 2001
    oregon1_010512.txt.gz from May 12 2001
    oregon1_010519.txt.gz from May 19 2001
    oregon1_010526.txt.gz from May 26 2001

    NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
    set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26 2001.

    The nodes are uniform across all graphs in the sequence in the UF collection.
    That is, nodes do...

  13. ventilator-lightning-unsupervised-tst

    • kaggle.com
    Updated Oct 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Peng (2021). ventilator-lightning-unsupervised-tst [Dataset]. https://www.kaggle.com/datasets/markpeng/lightning-unsupervised-tst/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 25, 2021
    Dataset provided by
    Kaggle
    Authors
    Mark Peng
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Pretrained transformer encoder model and its 128-dim z-representation vectors for Google Brain Ventilator Pressure Prediction train/test sets based on time_step,u_in, u_out, R and C columns.

    Training & Feature Extraction Notebook:

    https://www.kaggle.com/markpeng/transformer-based-ts-representation-learning

    Please upvote this dataset if you like it, thanks!

    Reference:

    George Zerveas et al. (2021). "_A Transformer-based Framework for Multivariate Time Series Representation Learning_," Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '21).

    ArXiV paper: https://arxiv.org/abs/2010.02803

    Github Repository: https://github.com/gzerveas/mvts_transformer

  14. Criteo Attribution Modeling for Bidding Dataset

    • kaggle.com
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharat Sachin (2022). Criteo Attribution Modeling for Bidding Dataset [Dataset]. https://www.kaggle.com/datasets/sharatsachin/criteo-attribution-modeling
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sharat Sachin
    Description

    Data description

    This dataset represents a sample of 30 days of Criteo live traffic data. Each line corresponds to one impression (a banner) that was displayed to a user. For each banner we have detailed information about the context, if it was clicked, if it led to a conversion and if it led to a conversion that was attributed to Criteo or not. Data has been sub-sampled and anonymized so as not to disclose proprietary elements.

    Here is a detailed description of the fields (they are tab-separated in the file):

    1. timestamp: timestamp of the impression (starting from 0 for the first impression). The dataset is sorted according to timestamp.
    2. uid a unique user identifier
    3. campaign a unique identifier for the campaign
    4. conversion 1 if there was a conversion in the 30 days after the impression (independently of whether this impression was last click or not)
    5. conversion_timestamp the timestamp of the conversion or -1 if no conversion was observed
    6. conversion_id a unique identifier for each conversion (so that timelines can be reconstructed if needed). -1 if there was no conversion
    7. attribution 1 if the conversion was attributed to Criteo, 0 otherwise
    8. click 1 if the impression was clicked, 0 otherwise
    9. click_pos the position of the click before a conversion (0 for first-click)
    10. click_nb number of clicks. More than 1 if there was several clicks before a conversion
    11. cost the price paid by Criteo for this display (disclaimer: not the real price, only a transformed version of it)
    12. cpo the cost-per-order in case of attributed conversion (disclaimer: not the real price, only a transformed version of it)
    13. time_since_last_click the time since the last click (in s) for the given impression
    14. cat[1-9] contextual features associated to the display. Can be used to learn the click/conversion models. We do not disclose the meaning of these features but it is not relevant for this study. Each column is a categorical variable. In the experiments, they are mapped to a fixed dimensionality space using the Hashing Trick (see paper for reference).

    Key figures

    • 2,4Gb uncompressed
    • 16.5M impressions
    • 45K conversions
    • 700 campaigns

    Tasks

    This dataset can be used in a large scope of applications related to Real-Time-Bidding, including but not limited to:

    1. Attribution modeling: rule based, model based, etc…
    2. Conversion modeling in display advertising: the data includes cost and value used for computing Utility metrics.
    3. Offline metrics for real-time bidding

    Citation

    This dataset is released along with following paper:

    “Attribution Modeling Increases Efficiency of Bidding in Display Advertising” Eustache Diemert*, Julien Meynet* (Criteo AI Lab), Damien Lefortier (Facebook), Pierre Galland (Criteo) (*authors contributed equally) published in “2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017)”

    When using this dataset, please cite the paper with following bibtex (final ACM bibtex coming soon):

    @inproceedings{DiemertMeynet2017,
      author = {{Diemert Eustache, Meynet Julien} and Galland, Pierre and Lefortier, Damien},
      title={Attribution Modeling Increases Efficiency of Bidding in Display Advertising},
      publisher = {ACM},
      pages={To appear},
      booktitle = {Proceedings of the AdKDD and TargetAd Workshop, KDD, Halifax, NS, Canada, August, 14, 2017},
      year = {2017}
    }
    
  15. Z

    Sequential Vote Results of Swiss Referenda

    • data.niaid.nih.gov
    Updated Aug 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thiran, Patrick (2020). Sequential Vote Results of Swiss Referenda [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3984924
    Explore at:
    Dataset updated
    Aug 28, 2020
    Dataset provided by
    Thiran, Patrick
    Kristof, Victor
    Grossglauser, Matthias
    Immer, Alexander
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Switzerland
    Description

    This repo contains the data introduced in

    Immer, A.*, Kristof, V.*, Grossglauser, M., Thiran, P., Sub-Matrix Factorization for Real-Time Vote Prediction, KDD 2020

    These data have been collected from OpenData.Swiss every two minutes on two different referendum vote days: May 19, 2019, and February 9, 2020. We use these data to make real-time predictions of the referenda outcome on www.predikon.ch. We publish here the raw data, as retrieved in JSON format from the API. We also provide a python script to help scraping the JSON files.

    After unzipping the datasets, you can scrape the data by referendum vote day by doing:

    from scraper import scrape_referenda

    Scrape the data from February 2, 2020.

    data_dir = 'path/to/2020-02-09' data = scrape_referenda(data_dir)

    The data variable will be a list of datum dictionaries of the following structure:

    { "vote": 6290, "municipality": 1, "timestamp": "2020-02-09T15:23:10", "num_yes": 222, "num_no": 482, "num_valid": 704, "num_total": 709, "num_eligible": 1407, "yes_percent": 0.3153409090909091, "turnout": 0.503909026297086 }

    The datum is as follows:

    vote: vote ID as defined by OpenData.Swiss

    municipality: municipality ID as defined by OpenData.Swiss

    timestamp: date and time at which the JSON files has been published on OpenData.Swiss

    num_yes: number of "yes" in the municipality

    num_no: number of "no" in the municipality

    num_valid: number of valid ballots (the ones counting for the results)

    numb_total: total number of ballots (including invalid ones)

    num_eligible: number of registered voters

    yes_percent: percentage of "yes" (computed as num_yes / num_valid)

    turnout: turnout to the vote (computed as num_total / num_eligible)

    Don't hesitate to reach out to us if you have any questions!

    To cite this dataset:

    @inproceedings{immer2020submatrix, author = {Immer, Alexander and Kristof, Victor and Grossglauser, Matthias and Thiran, Patrick}, title = {Sub-Matrix Factorization for Real-Time Vote Prediction}, year = {2020}, booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining}, }

  16. o

    Geotagged Digital Traces

    • explore.openaire.eu
    • ekoizpen-zientifikoa.ehu.eus
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo Munoz-Cancino; Sebastián A. Sebastián A. Ríos; Manuel Graña (2023). Geotagged Digital Traces [Dataset]. http://doi.org/10.5281/zenodo.7949306
    Explore at:
    Dataset updated
    May 18, 2023
    Authors
    Ricardo Munoz-Cancino; Sebastián A. Sebastián A. Ríos; Manuel Graña
    Description

    This dataset, divided into files by city, contains geotagged digital traces collected from different social media platforms, detailed below. • Tweets - Cheng et al. [1] • Gowalla [2] • Tweets - Lamsal [3] • YELP[4] • Tweets - Kejriwal et al. [5] • Geotagged Tweets [6] • UrbanActivity, [7] • Brightkite [8] • Weeplaces [8] • Flickr [9] • Foursquare [10] Each file is named according to the city to which the digital traces were associated and contains the columns: Source: contains the name of the source platform Event_date: contains the date associated with the digital trace Lat: latitude of the digital trace Lng: length of the digital trace The definition of city/town used is provided by Simplemaps [11], which considers a city/town any inhabited place as determined by U.S. government agencies. The location of cities and their respective centers were obtained from the World Cities Database provided by the same company. A specific group of these cities was utilized for the research presented in the article submitted to Sensors Journal: Muñoz-Cancino, R., Rios, S. A., & Graña, M. (2023). Clustering cities over features extracted from multiple virtual sensors measuring micro-level activity patterns allows to discriminate large-scale city characteristics. Sensors, Under Review. Comprehensive guidelines and the selection criteria can be found in the abovementioned article. References [1] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: A content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM '10, page 759{768, New York, NY, USA, 2010. Association for Computing Machinery. [2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, page 1082{1090, New York, NY, USA, 2011. Association for Computing Machinery. [3] Yunhe Feng and Wenjun Zhou. Is working from home the new norm? an observational study based on a large geo-tagged covid-19 twitter dataset, 2020. [4] Yelp Inc. Yelp Open Dataset, 2021. Retrieved from https://www.yelp.com/dataset. Accessed October 26, 2021. [5] Mayank Kejriwal and Sara Melotte. A Geo-Tagged COVID-19 Twitter Dataset for 10 North American Metropolitan Areas, January 2021. [6] Rabindra Lamsal. Design and analysis of a large-scale covid-19 tweets dataset. Applied Intelligence, 51(5):2790{2804, 2021. [7] Geraud Le Falher, Aristides Gionis, and Michael Mathioudakis. Where is the Soho of Rome? Measures and algorithms for finding similar neighborhoods in cities. In 9th AAAI Conference on Web and Social Media - ICWSM 2015, Oxford, United Kingdom, May 2015. [8] Yong Liu, WeiWei, Aixin Sun, and Chunyan Miao. Exploiting geographical neighborhood characteristics for location recommendation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM '14, page 739{748, New York, NY,USA, 2014. Association for Computing Machinery. [9] Hatem Mousselly-Sergieh, Daniel Watzinger, Bastian Huber, Mario Doller, Elood Egyed-Zsigmond, and Harald Kosch. World-wide scale geotagged image dataset for automatic image annotation and reverse geotagging. In Proceedings of the 5th ACM Multimedia Systems Conference, MMSys '14, page 47{52, New York, NY, USA, 2014. Association for Computing Machinery. [10] Dingqi Yang, Daqing Zhang, Vincent W. Zheng, and Zhiyong Yu. Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(1):129{142, 2015. [11] Simple Maps. Basic World Cities Database, 2021. Retrieved from https://simplemaps.com/data/world-cities. Accessed September 3, 2021. {"references": ["Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: A content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM '10, page 759{768, New York, NY, USA, 2010. Association for Computing Machinery.", "Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, page 1082{1090, New York, NY, USA, 2011. Association for Computing Machinery.", "Yunhe Feng and Wenjun Zhou. Is working from home the new norm? an observational study based on a large geo-tagged covid-19 twitter dataset, 2020.", "Yelp Inc. Yelp Open Dataset, 2021. Retrieved from https://www.yelp.com/dataset. Accessed October 26, 2021.", "Mayank Kejriwal and Sara Melotte. A Geo-Tagged COVID-19 Twitter Dataset for 10 North American Metropolitan Areas, January 2021.", "Rabindra Lamsal. Design and analysis of a large-scale covid-19 tweets dataset. Appl...

  17. criteo-attribution-dataset

    • huggingface.co
    Updated Aug 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CRITEO (2017). criteo-attribution-dataset [Dataset]. https://huggingface.co/datasets/criteo/criteo-attribution-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2017
    Dataset provided by
    Criteohttps://criteo.com/
    Authors
    CRITEO
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Criteo Attribution Modeling for Bidding Dataset

    This dataset is released along with the paper: Attribution Modeling Increases Efficiency of Bidding in Display Advertising Eustache Diemert*, Julien Meynet* (Criteo Research), Damien Lefortier (Facebook), Pierre Galland (Criteo) *authors contributed equally 2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) When using this dataset, please cite the paper… See the full description on the dataset page: https://huggingface.co/datasets/criteo/criteo-attribution-dataset.

  18. Generative Dog Images

    • kaggle.com
    zip
    Updated Jun 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ching-Yuan Bai (2021). Generative Dog Images [Dataset]. https://www.kaggle.com/andrewcybai/generative-dog-images
    Explore at:
    zip(93845035418 bytes)Available download formats
    Dataset updated
    Jun 17, 2021
    Authors
    Ching-Yuan Bai
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Background

    The first-ever, large-scale generative modeling research competition, Generative Dog Images, was held on Kaggle in the summer of 2019. Over 900+ teams participated and submitted a total of 10k+ generated samples, 1.6k of which were selected as the final submissions to rank on the private leaderboard. We are releasing the competition submissions as an effort to facilitate research on generative modeling metric design, particularly towards tackling the issue of detecting training sample memorization, intentional or not.

    Content

    Each competition submission consists of 10k generated samples of dog images from a generative model trained on the Stanford dogs dataset. As expected participants are incentivized to optimize for the objective and many exploited the insensitivity to training sample memorization issue of current popular generative modeling metrics (e.g. IS, FID). We provided manual labels of the type of intentional memorization technique adopted (if any) for each submission. Details regarding the labels can be found in the description of the labels.csv file. We also provided human-assessed image quality annotations for individual images.

    Acknowledgements

    Huge thanks to all the participants in the Generative Dog Images research competition for providing all the well-tuned models as well as feedback during the competition. The competition result analysis is published as a conference paper and if you find this dataset useful, please cite the following: @inproceedings{bai2021genmem, author = {Ching-Yuan Bai and Hsuan-Tien Lin and Colin Raffel and Wendy Chih-wen Kan}, title = {On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition}, booktitle = {Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)}, year = 2021, month = aug }

    Inspiration

    The Memorizaion-informed Fréchet Inception Distance (MiFID) was proposed and adopted as the benchmark metric during the competition to handle the training sample memorization issue. It works well in a competition setting but obvious flaws make it unideal in a general research setting.

    Are there any other alternatives?

    The large amount and great diversity of models in this dataset can serve as a testing ground for newly developed benchmark metrics.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
External Data Source (2019). KDD Cup 1999 Data [Dataset]. http://doi.org/10.23721/100/1478801

Data from: KDD Cup 1999 Data

DS-0937

Related Article
Explore at:
78 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 19, 2019
Authors
External Data Source
Description

This is the data set used for intrusion detector learning task in the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad'' connections, called intrusions or attacks, andgood'' normal connections.

The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.

Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. ; gcounsel@ics.uci.edu

Search
Clear search
Close search
Google apps
Main menu