3 datasets found
  1. Z

    Network Traffic Analysis: Data and Code

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Honig, Joshua (2024). Network Traffic Analysis: Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11479410
    Explore at:
    Dataset updated
    Jun 12, 2024
    Dataset provided by
    Ferrell, Nathan
    Soni, Shreena
    Moran, Madeline
    Honig, Joshua
    Chan-Tin, Eric
    Homan, Sophia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code:

    Packet_Features_Generator.py & Features.py

    To run this code:

    pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j

    -h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j

    Purpose:

    Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.

    Uses Features.py to calcualte the features.

    startMachineLearning.sh & machineLearning.py

    To run this code:

    bash startMachineLearning.sh

    This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags

    Options (to be edited within this file):

    --evaluate-only to test 5 fold cross validation accuracy

    --test-scaling-normalization to test 6 different combinations of scalers and normalizers

    Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use

    --grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'

    Purpose:

    Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.

    Data

    Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.

    Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:

    First number is a classification number to denote what website, query, or vr action is taking place.

    The remaining numbers in each line denote:

    The size of a packet,

    and the direction it is traveling.

    negative numbers denote incoming packets

    positive numbers denote outgoing packets

    Figure 4 Data

    This data uses specific lines from the Virtual Reality.txt file.

    The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.

    The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.

    The .xlsx and .csv file are identical

    Each file includes (from right to left):

    The origional packet data,

    each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,

    and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.

  2. Common languages used for web content 2025, by share of websites

    • statista.com
    • ai-chatbox.pro
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset updated
    Feb 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 2025
    Area covered
    Worldwide
    Description

    As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  3. Share of consumers who restrict the use of cookies in Western Europe 2022,...

    • statista.com
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Share of consumers who restrict the use of cookies in Western Europe 2022, by country [Dataset]. https://www.statista.com/statistics/1367654/cookie-use-restrict-europe/
    Explore at:
    Dataset updated
    Oct 18, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 2022
    Area covered
    Europe
    Description

    During a 2022 survey, 49 percent of responding internet users from Germany said that they denied the storage of cookies or adjusted the scope of the storage of cookies, depending on the website. Roughly a third stated that they generally set their internet browser so that cookies were not stored permanently, or so as for their use to at least be restricted.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Honig, Joshua (2024). Network Traffic Analysis: Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11479410

Network Traffic Analysis: Data and Code

Explore at:
Dataset updated
Jun 12, 2024
Dataset provided by
Ferrell, Nathan
Soni, Shreena
Moran, Madeline
Honig, Joshua
Chan-Tin, Eric
Homan, Sophia
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Code:

Packet_Features_Generator.py & Features.py

To run this code:

pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j

-h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j

Purpose:

Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.

Uses Features.py to calcualte the features.

startMachineLearning.sh & machineLearning.py

To run this code:

bash startMachineLearning.sh

This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags

Options (to be edited within this file):

--evaluate-only to test 5 fold cross validation accuracy

--test-scaling-normalization to test 6 different combinations of scalers and normalizers

Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use

--grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'

Purpose:

Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.

Data

Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.

Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:

First number is a classification number to denote what website, query, or vr action is taking place.

The remaining numbers in each line denote:

The size of a packet,

and the direction it is traveling.

negative numbers denote incoming packets

positive numbers denote outgoing packets

Figure 4 Data

This data uses specific lines from the Virtual Reality.txt file.

The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.

The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.

The .xlsx and .csv file are identical

Each file includes (from right to left):

The origional packet data,

each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,

and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.

Search
Clear search
Close search
Google apps
Main menu