9 datasets found
  1. Z

    Data from: Login Data Set for Risk-Based Authentication

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thunem, Sigurd (2022). Login Data Set for Risk-Based Authentication [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6782155
    Explore at:
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    Jørgensen, Paul René
    Lo Iacono, Luigi
    Wiefling, Stephan
    Thunem, Sigurd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Login Data Set for Risk-Based Authentication

    Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

    This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

    The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

    WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

    Overview

    The data set contains the following features related to each login attempt on the SSO:

        Feature
        Data Type
        Description
        Range or Example
    
    
    
    
        IP Address
        String
        IP address belonging to the login attempt
        0.0.0.0 - 255.255.255.255
    
    
        Country
        String
        Country derived from the IP address
        US
    
    
        Region
        String
        Region derived from the IP address
        New York
    
    
        City
        String
        City derived from the IP address
        Rochester
    
    
        ASN
        Integer
        Autonomous system number derived from the IP address
        0 - 600000
    
    
        User Agent String
        String
        User agent string submitted by the client
        Mozilla/5.0 (Windows NT 10.0; Win64; ...
    
    
        OS Name and Version
        String
        Operating system name and version derived from the user agent string
        Windows 10
    
    
        Browser Name and Version
        String
        Browser name and version derived from the user agent string
        Chrome 70.0.3538
    
    
        Device Type
        String
        Device type derived from the user agent string
        (mobile, desktop, tablet, bot, unknown)1
    
    
        User ID
        Integer
        Idenfication number related to the affected user account
        [Random pseudonym]
    
    
        Login Timestamp
        Integer
        Timestamp related to the login attempt
        [64 Bit timestamp]
    
    
        Round-Trip Time (RTT) [ms]
        Integer
        Server-side measured latency between client and server
        1 - 8600000
    
    
        Login Successful
        Boolean
        True: Login was successful, False: Login failed
        (true, false)
    
    
        Is Attack IP
        Boolean
        IP address was found in known attacker data set
        (true, false)
    
    
        Is Account Takeover
        Boolean
        Login attempt was identified as account takeover by incident response team of the online service
        (true, false)
    

    Data Creation

    As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

    The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

    The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.

    The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.

    The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

    Regarding the Data Values

    Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

    You can recognize them by the following values:

    ASNs with values >= 500.000

    IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

    Study Reproduction

    Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

    The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

    See RESULTS.md for more details.

    Ethics

    By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

    The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

    Publication

    You can find more details on our conducted study in the following journal article:

    Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022) Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono. ACM Transactions on Privacy and Security

    Bibtex

    @article{Wiefling_Pump_2022, author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi}, title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}}, journal = {{ACM} {Transactions} on {Privacy} and {Security}}, doi = {10.1145/3546069}, publisher = {ACM}, year = {2022} }

    License

    This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

    Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

    Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

  2. Z

    Transparency in Keyword Faceted Search: a dataset of Google Shopping html...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoang Van Tien (2020). Transparency in Keyword Faceted Search: a dataset of Google Shopping html pages [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1491556
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Petrocchi Marinella
    Cozza Vittoria
    Hoang Van Tien
    De Nicola Rocco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of around 2,000 HTML pages: these web pages contain the search results obtained in return to queries for different products, searched by a set of synthetic users surfing Google Shopping (US version) from different locations, in July, 2016.

    Each file in the collection has a name where there is indicated the location from where the search has been done, the userID, and the searched product: no_email_LOCATION_USERID.PRODUCT.shopping_testing.#.html

    The locations are Philippines (PHI), United States (US), India (IN). The userIDs: 26 to 30 for users searching from Philippines, 1 to 5 from US, 11 to 15 from India.

    Products have been choice following 130 keywords (e.g., MP3 player, MP4 Watch, Personal organizer, Television, etc.).

    In the following, we describe how the search results have been collected.

    Each user has a fresh profile. The creation of a new profile corresponds to launch a new, isolated, web browser client instance and open the Google Shopping US web page.

    To mimic real users, the synthetic users can browse, scroll pages, stay on a page, and click on links.

    A fully-fledged web browser is used to get the correct desktop version of the website under investigation. This is because websites could be designed to behave according to user agents, as witnessed by the differences between the mobile and desktop versions of the same website.

    The prices are the retail ones displayed by Google Shopping in US dollars (thus, excluding shipping fees).

    Several frameworks have been proposed for interacting with web browsers and analysing results from search engines. This research adopts OpenWPM. OpenWPM is automatised with Selenium to efficiently create and manage different users with isolated Firefox and Chrome client instances, each of them with their own associated cookies.

    The experiments run, on average, 24 hours. In each of them, the software runs on our local server, but the browser's traffic is redirected to the designated remote servers (i.e., to India), via tunneling in SOCKS proxies. This way, all commands are simultaneously distributed over all proxies. The experiments adopt the Mozilla Firefox browser (version 45.0) for the web browsing tasks and run under Ubuntu 14.04. Also, for each query, we consider the first page of results, counting 40 products. Among them, the focus of the experiments is mostly on the top 10 and top 3 results.

    Due to connection errors, one of the Philippine profiles have no associated results. Also, for Philippines, a few keywords did not lead to any results: videocassette recorders, totes, umbrellas. Similarly, for US, no results were for totes and umbrellas.

    The search results have been analyzed in order to check if there were evidence of price steering, based on users' location.

    One term of usage applies:

    In any research product whose findings are based on this dataset, please cite

    @inproceedings{DBLP:conf/ircdl/CozzaHPN19, author = {Vittoria Cozza and Van Tien Hoang and Marinella Petrocchi and Rocco {De Nicola}}, title = {Transparency in Keyword Faceted Search: An Investigation on Google Shopping}, booktitle = {Digital Libraries: Supporting Open Science - 15th Italian Research Conference on Digital Libraries, {IRCDL} 2019, Pisa, Italy, January 31 - February 1, 2019, Proceedings}, pages = {29--43}, year = {2019}, crossref = {DBLP:conf/ircdl/2019}, url = {https://doi.org/10.1007/978-3-030-11226-4_3}, doi = {10.1007/978-3-030-11226-4_3}, timestamp = {Fri, 18 Jan 2019 23:22:50 +0100}, biburl = {https://dblp.org/rec/bib/conf/ircdl/CozzaHPN19}, bibsource = {dblp computer science bibliography, https://dblp.org} }

  3. i

    VPN-nonVPN dataset

    • impactcybertrust.org
    Updated Jan 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Data Source (2019). VPN-nonVPN dataset [Dataset]. http://doi.org/10.23721/100/1478793
    Explore at:
    Dataset updated
    Jan 19, 2019
    Authors
    External Data Source
    Description

    To generate a representative dataset of real-world traffic in ISCX we defined a set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the complete list of different types of traffic and applications considered in our dataset for each traffic type (VoIP, P2P, etc.)

    We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different types of traffic generated:

    Browsing: Under this label we have HTTPS traffic generated by users while browsing or performing any task that includes the use of a browser. For instance, when we captured voice-calls using hangouts, even though browsing is not the main activity, we captured several browsing flows.

    Email: The traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts. The clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one client and IMAP/SSL in the other.

    Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook and Hangouts via web browsers, Skype, and IAM and ICQ using an application called pidgin [14].

    Streaming: The streaming label identifies multimedia applications that require a continuous and steady stream of data. We captured traffic from Youtube (HTML5 and flash versions) and Vimeo services using Chrome and Firefox.

    File Transfer: This label identifies traffic applications whose main purpose is to send or receive files and documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over SSL (FTPS) traffic sessions.

    VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we captured voice calls using Facebook, Hangouts and Skype.

    TraP2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we downloaded different .torrent files from a public a repository and captured traffic sessions using the uTorrent and Transmission applications.

    The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a client.

    To facilitate the labeling process, when capturing the traffic all unnecessary services and applications were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call, SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the address of the local client (Alice or Bob).

    The full research paper outlining the details of the dataset and its underlying principles:

    Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.
    ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on selected features. The UNB ISCX Network Traffic (VPN-nonVPN) dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by ISCXFlowMeter) also are publicly available for researchers.

    For more information contact cic@unb.ca.

    The UNB ISCX Network Traffic Dataset content
    Traffic: Content
    Web Browsing: Firefox and Chrome
    Email: SMPTS, POP3S and IMAPS
    Chat: ICQ, AIM, Skype, Facebook and Hangouts
    Streaming: Vimeo and Youtube
    File Transfer: Skype, FTPS and SFTP using Filezilla and an external service
    VoIP: Facebook, Skype and Hangouts voice calls (1h duration)
    P2P: uTorrent and Transmission (Bittorrent)
    ; cic@unb.ca.

  4. Supplementary files for Collection of Datasets with DNS over HTTPS Traffic

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Feb 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamil Jeřábek; Kamil Jeřábek; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Ondřej Ryšavý; Ondřej Ryšavý (2022). Supplementary files for Collection of Datasets with DNS over HTTPS Traffic [Dataset]. http://doi.org/10.5281/zenodo.6024914
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kamil Jeřábek; Kamil Jeřábek; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Ondřej Ryšavý; Ondřej Ryšavý
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The DNS over HTTPS (DoH) is becoming a default option for domain resolution in modern privacy-aware software. Therefore, research has already focused on various aspects; however, a comprehensive dataset from an actual production network is still missing. In this paper, we present a novel dataset, which comprises multiple PCAP files of DoH traffic. The captured traffic is generated towards various DoH providers to cover differences of various DoH server implementations and configurations. In addition to generated traffic, we also provide real network traffic captured on high-speed backbone lines of a large Internet Service Provider with around half a million users. Network identifiers (excluding network identifiers of DoH resolvers) in the real network traffic (e.g., IP addresses and transmitted content) were anonymized, but still, the important characteristics of the traffic can still be obtained from the data that can be used, e.g., for network traffic classification research. The real network traffic dataset contains DoH and also non-DoH HTTPS traffic as observed at the collection points in the network.

    This repository provides supplementary files for the "Collection of Datasets with DNS over HTTPS Traffic" :

    ─── supplementary_files  | - Directory with supplementary files (scripts, DoH resolver list) used for dataset creation
      ├── chrome       | - Generation scripts for Chrome browser and visited websites during generation
      ├── doh_resolvers   | - The list of DoH resolvers used for filter creation during ISP backbone capture
      ├── firefox      | - Generation scripts for Firefox browser and visited websites during generation
      └── pcap-anonymizer  | - Anonymization script of real backbone captures

    Collection of datasets:

  5. Google Patents Public Data

    • kaggle.com
    zip
    Updated Sep 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2018). Google Patents Public Data [Dataset]. https://www.kaggle.com/datasets/bigquery/patents
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 19, 2018
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

    Context

    Google Patents Public Data, provided by IFI CLAIMS Patent Services, is a worldwide bibliographic and US full-text dataset of patent publications. Patent information accessibility is critical for examining new patents, informing public policy decisions, managing corporate investment in intellectual property, and promoting future scientific innovation. The growing number of available patent data sources means researchers often spend more time downloading, parsing, loading, syncing and managing local databases than conducting analysis. With these new datasets, researchers and companies can access the data they need from multiple sources in one place, thus spending more time on analysis than data preparation.

    Content

    The Google Patents Public Data dataset contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.

    Acknowledgements

    Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:patents

    For more info, see the documentation at https://developers.google.com/web/tools/chrome-user-experience-report/

    “Google Patents Public Data” by IFI CLAIMS Patent Services and Google is licensed under a Creative Commons Attribution 4.0 International License.

    Banner photo by Helloquence on Unsplash

  6. Data from: Detecting Degradation of Web Browsing Quality of Experience

    • figshare.com
    txt
    Updated Nov 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Huet; Zied Ben Houidi; Bertrand Mathieu; Dario Rossi (2020). Detecting Degradation of Web Browsing Quality of Experience [Dataset]. http://doi.org/10.6084/m9.figshare.13089854.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 2, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Alexis Huet; Zied Ben Houidi; Bertrand Mathieu; Dario Rossi
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset represents 222k samples of web browsing session measurements collected during 2.5 months using the Web View platform (https://webview.orange.com) [1]. Web View allows different probes to automatically execute multiple web sessions in a real end-user environment. In our test campaign, we use 17 machines, spread in three different locations worldwide (Lannion, Paris and Mauritius islands), different ISPs and access technologies (ADSL, WiFi and fiber) for a total of 9 combinations, and up to 12 browser versions, which include various versions of Chrome and Firefox. Each machine can request a different browser viewport, can enable or disable the AdBlock plugin to emulate different user preferences and can request a specific network protocol (HTTP/1, HTTP/2 or QUIC).We leverage this dataset to phrase the QoE degradation detection issue as a change point detection problem in [2]. Our results, beyond showing feasibility, warn about the exclusive use of QoE indicators that are very close to content, as changes in the content space can lead to false alarms that are not tied to network-related problems.If you use these datasets in your research, you can reference the appropriate papers:[1] A. Saverimoutou, B. Mathieu, and S. Vaton, “Web View: A measurement platform for depicting web browsing performance and delivery,” IEEE Communications Magazine, vol. 58, no. 3, pp. 33–39, 2020.[2] A. Huet, Z. Ben Houidi, B. Mathieu, D. Rossi “Detecting degradation of web browsing quality of experience,” 16th International Conference on Network and Service Management (CNSM), 2020.Each row represents one experiment, and the columns are as follows:- wwwName: Target page- timestamp: Timestamp with format YYYY-MM-DD hh:mm:ss- browserUsed: Internet browser and version - requestedProtocol: Requested L7 protocol- adBlocker: Whether adBlocker is used or not- networkIface: Network interface- winSize: Window size- visiblePortion: Visible portion of the page that is above the fold in percents- h1Share: Share of the traffic coming from HTTP/1 in percents- h2Share: Share of the traffic coming from HTTP/2 in percents- hqShare: Share of the traffic coming from QUIC in percents- pushShare: Share of the traffic coming from HTTP/2 Server Push in percents- nbRes: Number of objects of the page- nbResNA: Number of objects coming from North America- nbResSA: Number of objects coming from South America- nbResEU: Number of objects coming from Europe- nbResAS: Number of objects coming from Asia- nbResAF: Number of objects coming from Africa- nbResOC: Number of objects coming from Oceania- nbResUKN: Number of objects coming from unknown provenance- nbHTTPS: Number of objects coming from an HTTPS connection- nbHTTP: Number of objects coming from an HTTP connection- nbDomNA: Number of different domain names coming from North America- nbDomSA: Number of different domain names coming from South America- nbDomEU: Number of different domain names coming from Europe- nbDomAS: Number of different domain names coming from Asia- nbDomAF: Number of different domain names coming from Africa- nbDomOC: Number of different domain names coming from Oceania- firstPaint: First paint time (ms)- tfvr: Time for Full Visual Rendering (ms)- dom: DOM time (ms)- plt: Page Load Time (ms)- machine: Machine name (containing location information)- categoryType: Category of the web page- pageSize: Total web page size (bytes)- receiveTime: Total receive time from HAR (ms)- transferRate: Transfer rate (bps)- id: Unique identification of the current experiment- config: Identification for the tuple (browserUsed, requestedProtocol, adBlocker, networkIface, winSize, machine, wwwName), i.e. the probe configuration with target wwwName

  7. Dataset used for fingerprinting of DNS over HTTPS responses.

    • zenodo.org
    • data.niaid.nih.gov
    Updated Nov 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karel Hynek; Karel Hynek; Tomas Cejka; Tomas Cejka (2021). Dataset used for fingerprinting of DNS over HTTPS responses. [Dataset]. http://doi.org/10.5281/zenodo.4039588
    Explore at:
    Dataset updated
    Nov 29, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Karel Hynek; Karel Hynek; Tomas Cejka; Tomas Cejka
    Description

    The dataset consists of multiple different data sources:

    1. DoH enabled Firefox on Linux OS
    2. DoH enabled Firefox on Windows 10 OS
    3. DoH enabled Chrome on Windows 10 OS

    We captured the traffic from the DoH enabled web-browsers using tcpdump. To automate the process of traffic generation, we installed Google Chrome and Mozilla Firefox into separate virtual machines and controlled them with the Selenium framework shows detailed information about used browsers and environments). Selenium simulates a user's browsing according to the predefined script and a list of domain names (i.e., URLs from Alexa's top websites list in our case). The selenium was configured to visit pages in random order multiple times. For capturing the traffic, we used the default settings of each browser. We did not disable the DNS cache of the browser, and the random order of visiting webpages secures that the dataset contains traces influenced by DNS caching mechanisms. Each virtual machine was configured to export TLS cryptographic keys, that was used for decrypting the traffic using WireShark application.

    The WireShark text output of the decrypted traffic is provided in the dataset files. The detailed information about each file is provided in dataset README.

    Acknowledgment

    This work was supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 833418 and also by the Grant Agency of the CTU in Prague, grant No. SGS20/210/OHK3/3T/18 funded by the MEYS of the Czech Republic and the project Reg. No. CZ.02.1.01/0.0/0.0/16_013/0001797 co-funded by the MEYS and ERDF

  8. Z

    Dataset used for detecting DNS over HTTPS by Machine Learning.

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vekshin,Dmitrii (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818004
    Explore at:
    Dataset updated
    Oct 28, 2020
    Dataset provided by
    Cejka,Tomas
    Hynek,Karel
    Vekshin,Dmitrii
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset consists of three different data sources:

    DoH enabled Firefox

    DoH enabled Google Chrome

    Cloudflared DoH proxy

    The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

    The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

    The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

    The CSV with extracted features has the following data fields:

    • Label (1 - Doh, 0 - regular HTTPS)
    • Data source
    • Duration
    • Minimal Inter-Packet Delay
    • Maximal Inter-Packet Delay
    • Average Inter-Packet Delay
    • A variance of Incoming Packet Sizes
    • A variance of Outgoing Packet Sizes
    • A ratio of the number of Incoming and outgoing bytes
    • A ration of the number of Incoming and outgoing packets
    • Average of Incoming Packet sizes
    • Average of Outgoing Packet sizes
    • The median value of Incoming Packet sizes
    • The median value of outgoing Packet sizes
    • The ratio of bursts and pauses
    • Number of bursts
    • Number of pauses
    • Autocorrelation
    • Transmission symmetry in the 1st third of connection
    • Transmission symmetry in the 2nd third of connection
    • Transmission symmetry in the last third of connection

    The observed network traffic does not contain privacy-sensitive information.

    The zip file structure is:

    |-- data | |-- extracted-features...extracted features used in ML for DoH recognition | | |-- chrome | | |-- cloudflared | | -- firefox | |-- flows...............................................exported flow data | | |-- chrome | | |-- cloudflared | |-- firefox | -- pcaps....................................................raw PCAP data | |-- chrome | |-- cloudflared |-- firefox |-- LICENSE `-- README.md

    When using this dataset, please cite the original work as follows:

    @inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }

  9. Dataset used for detecting DNS over HTTPS by Machine Learning.

    • zenodo.org
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitrii Vekshin; Karel Hynek; Tomas Cejka; Dmitrii Vekshin; Karel Hynek; Tomas Cejka (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. http://doi.org/10.5281/zenodo.3818005
    Explore at:
    Dataset updated
    Oct 28, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dmitrii Vekshin; Karel Hynek; Tomas Cejka; Dmitrii Vekshin; Karel Hynek; Tomas Cejka
    Description

    The dataset consists of three different data sources:

    1. DoH enabled Firefox
    2. DoH enabled Google Chrome
    3. Cloudflared DoH proxy

    The capture of of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening on the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozzila and 1,000 pages visited by firefox.

    The Cloudflared DoH proxy was installed in raspberry and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

    The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide it in the form of CSV file with the following datafields:

    • Label (1 - Doh, 0 - regular HTTPS)
    • Data source
    • Duration
    • Minimal Inter-Packet Delay
    • Maximal Inter-Packet Delay
    • Average Inter-Packet Delay
    • A variance of Incoming Packet Sizes
    • A variance of Outgoing Packet Sizes
    • A ratio of the number of Incoming and outgoing bytes
    • A ration of the number of Incoming and outgoing packets
    • Average of Incoming Packet sizes
    • Average of Outgoing Packet sizes
    • The median value of Incoming Packet sizes
    • The median value of outgoing Packet sizes
    • The ratio of bursts and pauses
    • Number of bursts
    • Number of pauses
    • Autocorrelation
    • Transmission symmetry in the 1st third of connection
    • Transmission symmetry in the 2nd third of connection
    • Transmission symmetry in the last third of connection

    The observed network traffic does not contain privacy-sensitive information.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Thunem, Sigurd (2022). Login Data Set for Risk-Based Authentication [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6782155

Data from: Login Data Set for Risk-Based Authentication

Related Article
Explore at:
Dataset updated
Jun 30, 2022
Dataset provided by
Jørgensen, Paul René
Lo Iacono, Luigi
Wiefling, Stephan
Thunem, Sigurd
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Login Data Set for Risk-Based Authentication

Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

Overview

The data set contains the following features related to each login attempt on the SSO:

    Feature
    Data Type
    Description
    Range or Example




    IP Address
    String
    IP address belonging to the login attempt
    0.0.0.0 - 255.255.255.255


    Country
    String
    Country derived from the IP address
    US


    Region
    String
    Region derived from the IP address
    New York


    City
    String
    City derived from the IP address
    Rochester


    ASN
    Integer
    Autonomous system number derived from the IP address
    0 - 600000


    User Agent String
    String
    User agent string submitted by the client
    Mozilla/5.0 (Windows NT 10.0; Win64; ...


    OS Name and Version
    String
    Operating system name and version derived from the user agent string
    Windows 10


    Browser Name and Version
    String
    Browser name and version derived from the user agent string
    Chrome 70.0.3538


    Device Type
    String
    Device type derived from the user agent string
    (mobile, desktop, tablet, bot, unknown)1


    User ID
    Integer
    Idenfication number related to the affected user account
    [Random pseudonym]


    Login Timestamp
    Integer
    Timestamp related to the login attempt
    [64 Bit timestamp]


    Round-Trip Time (RTT) [ms]
    Integer
    Server-side measured latency between client and server
    1 - 8600000


    Login Successful
    Boolean
    True: Login was successful, False: Login failed
    (true, false)


    Is Attack IP
    Boolean
    IP address was found in known attacker data set
    (true, false)


    Is Account Takeover
    Boolean
    Login attempt was identified as account takeover by incident response team of the online service
    (true, false)

Data Creation

As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.

The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.

The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

Regarding the Data Values

Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

You can recognize them by the following values:

ASNs with values >= 500.000

IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

Study Reproduction

Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

See RESULTS.md for more details.

Ethics

By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

Publication

You can find more details on our conducted study in the following journal article:

Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022) Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono. ACM Transactions on Privacy and Security

Bibtex

@article{Wiefling_Pump_2022, author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi}, title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}}, journal = {{ACM} {Transactions} on {Privacy} and {Security}}, doi = {10.1145/3546069}, publisher = {ACM}, year = {2022} }

License

This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

Search
Clear search
Close search
Google apps
Main menu