28 datasets found

Z
Data from: CESNET-QUIC22: A large one-month QUIC network traffic dataset...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hynek, Karel (2024). CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7409923
Explore at:
Dataset updated
Feb 29, 2024
Dataset provided by
Šiška, Pavel
Lukačovič, Andrej
Čejka, Tomáš
Hynek, Karel
Luxemburk, Jan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size:

W-2022-44

Uncompressed Size: 19 GB Capture Period: 31.10.2022 - 6.11.2022 Number of flows: 32.6M W-2022-45

Uncompressed Size: 25 GB Capture Period: 7.11.2022 - 13.11.2022 Number of flows: 42.6M W-2022-46

Uncompressed Size: 20 GB Capture Period: 14.11.2022 - 20.11.2022 Number of flows: 33.7M W-2022-47

Uncompressed Size: 25 GB Capture Period: 21.11.2022 - 27.11.2022 Number of flows: 44.1M CESNET-QUIC22

Uncompressed Size: 89 GB Capture Period: 31.10.2022 - 27.11.2022 Number of flows: 153M

Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files:

ID: Unique identifier SRC_IP: Source IP address DST_IP: Destination IP address DST_ASN: Destination Autonomous System number SRC_PORT: Source port DST_PORT: Destination port PROTOCOL: Transport protocol QUIC_VERSION QUIC: protocol version QUIC_SNI: Server Name Indication domain QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION: Duration of the flow in seconds BYTES: Number of transmitted bytes from client to server BYTES_REV: Number of transmitted bytes from server to client PACKETS: Number of packets transmitted from client to server PACKETS_REV: Number of packets transmitted from server to client PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN: Number of packets in the PPI sequence PPI_DURATION: Duration of the PPI sequence in seconds PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence PHIST_SRC_SIZES: Histogram of packet sizes from client to server PHIST_DST_SIZES: Histogram of packet sizes from server to client PHIST_SRC_IPT: Histogram of inter-packet times from client to server PHIST_DST_IPT: Histogram of inter-packet times from server to client APP: Web service label CATEGORY: Service category FLOW_ENDREASON_IDLE: Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER: Flow was terminated for other reasons

Link to other CESNET datasets

https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ https://github.com/CESNET/cesnet-datazoo Please cite the original data article:

@article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }
m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
Daily website visitors (time series regression)
kaggle.com
Updated Aug 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bob Nau (2020). Daily website visitors (time series regression) [Dataset]. https://www.kaggle.com/bobnau/daily-website-visitors/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bob Nau
Description
Context

This file contains 5 years of daily time series data for several measures of traffic on a statistical forecasting teaching notes website whose alias is statforecasting.com. The variables have complex seasonality that is keyed to the day of the week and to the academic calendar. The patterns you you see here are similar in principle to what you would see in other daily data with day-of-week and time-of-year effects. Some good exercises are to develop a 1-day-ahead forecasting model, a 7-day ahead forecasting model, and an entire-next-week forecasting model (i.e., next 7 days) for unique visitors.

Content

The variables are daily counts of page loads, unique visitors, first-time visitors, and returning visitors to an academic teaching notes website. There are 2167 rows of data spanning the date range from September 14, 2014, to August 19, 2020. A visit is defined as a stream of hits on one or more pages on the site on a given day by the same user, as identified by IP address. Multiple individuals with a shared IP address (e.g., in a computer lab) are considered as a single user, so real users may be undercounted to some extent. A visit is classified as "unique" if a hit from the same IP address has not come within the last 6 hours. Returning visitors are identified by cookies if those are accepted. All others are classified as first-time visitors, so the count of unique visitors is the sum of the counts of returning and first-time visitors by definition. The data was collected through a traffic monitoring service known as StatCounter.

Inspiration

This file and a number of other sample datasets can also be found on the website of RegressIt, a free Excel add-in for linear and logistic regression which I originally developed for use in the course whose website generated the traffic data given here. If you use Excel to some extent as well as Python or R, you might want to try it out on this dataset.
Data from: Analysis of the Quantitative Impact of Social Networks General...
figshare.com
produccioncientifica.ucm.es
doc
Updated Oct 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Parra; Santiago Martínez Arias; Sergio Mena Muñoz (2022). Analysis of the Quantitative Impact of Social Networks General Data.doc [Dataset]. http://doi.org/10.6084/m9.figshare.21329421.v1
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21329421.v1
Dataset updated
Oct 14, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
David Parra; Santiago Martínez Arias; Sergio Mena Muñoz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union". Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content? To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic. In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained. To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market. It includes:

Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures
d
Mill Road Project: Traffic Sensor Data
findtransportdata.dft.gov.uk
Updated Oct 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Smart Cambridge (2020). Mill Road Project: Traffic Sensor Data [Dataset]. https://findtransportdata.dft.gov.uk/dataset/mill-road-project:-traffic-sensor-data-177f76b38b2
Explore at:
Dataset updated
Oct 7, 2020
Dataset authored and provided by
Smart Cambridge
License
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
Description
15 smart sensors were installed on Mill Road and surrounding streets to record numbers of pedestrians, bicycles, cars and other vehicles. The data being collated and analysed by the Smart Cambridge programme will help the Greater Cambridge Partnership understand how people use the road network.

Data will be released monthly for these locations until the end of 2020. Please note that due to the level of insight that can be gained from these sensors, additional sensors in more locations have been installed in Cambridge since the summer of 2019. Some sensors will remain beyond 2020 in strategic locations and the network is expected to grow. Data for those more permanent sites, outside of the Mill Road project will be published here: https://data.cambridgeshireinsight.org.uk/dataset/cambridge-city-smart-s...

Mill Road Bridge was closed for eight weeks from 1 July 2019 for crucial work being carried out to improve rail services. Pedestrians and cyclists will still be able to cross the railway for most of the working time.

A high concentration of sensors were installed for approximately 18 months to gather data before the closure, during the time when there is no vehicle traffic coming over Mill Road Bridge and then after the bridge is re-opened. This has allowed engineers to see the impact of the closure on surrounding roads, including on air quality. Keeping the sensors in place for this long has also allowed teams to make greater comparisons, by taking in to account daily, weekly, monthly and annual variations in traffic levels.

The below data release offers counts for each sensor over 1 hour periods. The curent data covers the period 03/06/2019 to 13/12/2020.

Hourly counts are broken down by inbound and outbound journeys. .

Counts are also broken down by vehicle type. This includes:

Pedestrians Cyclists Buses LGV OGV 1 OGV 2 The release also includes a full list of sensor sites with geographic point location data.
P
Wiki Dataset
paperswithcode.com
Updated Jan 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Wiki Dataset [Dataset]. https://paperswithcode.com/dataset/wiki
Explore at:
Dataset updated
Jan 20, 2023
Description
Context There's a story behind every dataset and here's your opportunity to share yours.

Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Passive Operating System Fingerprinting Revisited - Network Flows Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Feb 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Laštovička; Martin Laštovička; Martin Husák; Martin Husák; Petr Velan; Petr Velan; Tomáš Jirsík; Tomáš Jirsík; Pavel Čeleda; Pavel Čeleda (2023). Passive Operating System Fingerprinting Revisited - Network Flows Dataset [Dataset]. http://doi.org/10.5281/zenodo.7635138
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7635138
Dataset updated
Feb 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Laštovička; Martin Laštovička; Martin Husák; Martin Husák; Petr Velan; Petr Velan; Tomáš Jirsík; Tomáš Jirsík; Pavel Čeleda; Pavel Čeleda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For the evaluation of OS fingerprinting methods, we need a dataset with the following requirements:

First, the dataset needs to be big enough to capture the variability of the data. In this case, we need many connections from different operating systems.

Second, the dataset needs to be annotated, which means that the corresponding operating system needs to be known for each network connection captured in the dataset. Therefore, we cannot just capture any network traffic for our dataset; we need to be able to determine the OS reliably.

To overcome these issues, we have decided to create the dataset from the traffic of several web servers at our university. This allows us to address the first issue by collecting traces from thousands of devices ranging from user computers and mobile phones to web crawlers and other servers. The ground truth values are obtained from the HTTP User-Agent, which resolves the second of the presented issues. Even though most traffic is encrypted, the User-Agent can be recovered from the web server logs that record every connection’s details. By correlating the IP address and timestamp of each log record to the captured traffic, we can add the ground truth to the dataset.

For this dataset, we have selected a cluster of five web servers that host 475 unique university domains for public websites. The monitoring point recording the traffic was placed at the backbone network connecting the university to the Internet.

The dataset used in this paper was collected from approximately 8 hours of university web traffic throughout a single workday. The logs were collected from Microsoft IIS web servers and converted from W3C extended logging format to JSON. The logs are referred to as web logs and are used to annotate the records generated from packet capture obtained by using a network probe tapped into the link to the Internet.

The entire dataset creation process consists of seven steps:

The packet capture was processed by the Flowmon flow exporter (https://www.flowmon.com) to obtain primary flow data containing information from TLS and HTTP protocols.

Additional statistical features were extracted using GoFlows flow exporter (https://github.com/CN-TU/go-flows).

The primary flows were filtered to remove incomplete records and network scans.

The flows from both exporters were merged together into records containing fields from both sources.

Web logs were filtered to cover the same time frame as the flow records.

Web logs were paired with the flow records based on shared properties (IP address, port, time).

The last step was to convert the User-Agent values into the operating system using a Python version of the open-source tool ua-parser (https://github.com/ua-parser/uap-python). We replaced the unstructured User-Agent string in the records with the resulting OS.

The collected and enriched flows contain 111 data fields that can be used as features for OS fingerprinting or any other data analyses. The fields grouped by their area are listed below:

basic flow properties - flow_ID;start;end;L3 PROTO;L4 PROTO;BYTES A;PACKETS A;SRC IP;DST IP;TCP flags A;SRC port;DST port;packetTotalCountforward;packetTotalCountbackward;flowDirection;flowEndReason;

IP parameters - IP ToS;maximumTTLforward;maximumTTLbackward;IPv4DontFragmentforward;IPv4DontFragmentbackward;

TCP parameters - TCP SYN Size;TCP Win Size;TCP SYN TTL;tcpTimestampFirstPacketbackward;tcpOptionWindowScaleforward;tcpOptionWindowScalebackward;tcpOptionSelectiveAckPermittedforward;tcpOptionSelectiveAckPermittedbackward;tcpOptionMaximumSegmentSizeforward;tcpOptionMaximumSegmentSizebackward;tcpOptionNoOperationforward;tcpOptionNoOperationbackward;synAckFlag;tcpTimestampFirstPacketforward;

HTTP - HTTP Request Host;URL;

User-agent - UA OS family;UA OS major;UA OS minor;UA OS patch;UA OS patch minor;

TLS - TLS_CONTENT_TYPE;TLS_HANDSHAKE_TYPE;TLS_SETUP_TIME;TLS_SERVER_VERSION;TLS_SERVER_RANDOM;TLS_SERVER_SESSION_ID;TLS_CIPHER_SUITE;TLS_ALPN;TLS_SNI;TLS_SNI_LENGTH;TLS_CLIENT_VERSION;TLS_CIPHER_SUITES;TLS_CLIENT_RANDOM;TLS_CLIENT_SESSION_ID;TLS_EXTENSION_TYPES;TLS_EXTENSION_LENGTHS;TLS_ELLIPTIC_CURVES;TLS_EC_POINT_FORMATS;TLS_CLIENT_KEY_LENGTH;TLS_ISSUER_CN;TLS_SUBJECT_CN;TLS_SUBJECT_ON;TLS_VALIDITY_NOT_BEFORE;TLS_VALIDITY_NOT_AFTER;TLS_SIGNATURE_ALG;TLS_PUBLIC_KEY_ALG;TLS_PUBLIC_KEY_LENGTH;TLS_JA3_FINGERPRINT;

Packet timings - NPM_CLIENT_NETWORK_TIME;NPM_SERVER_NETWORK_TIME;NPM_SERVER_RESPONSE_TIME;NPM_ROUND_TRIP_TIME;NPM_RESPONSE_TIMEOUTS_A;NPM_RESPONSE_TIMEOUTS_B;NPM_TCP_RETRANSMISSION_A;NPM_TCP_RETRANSMISSION_B;NPM_TCP_OUT_OF_ORDER_A;NPM_TCP_OUT_OF_ORDER_B;NPM_JITTER_DEV_A;NPM_JITTER_AVG_A;NPM_JITTER_MIN_A;NPM_JITTER_MAX_A;NPM_DELAY_DEV_A;NPM_DELAY_AVG_A;NPM_DELAY_MIN_A;NPM_DELAY_MAX_A;NPM_DELAY_HISTOGRAM_1_A;NPM_DELAY_HISTOGRAM_2_A;NPM_DELAY_HISTOGRAM_3_A;NPM_DELAY_HISTOGRAM_4_A;NPM_DELAY_HISTOGRAM_5_A;NPM_DELAY_HISTOGRAM_6_A;NPM_DELAY_HISTOGRAM_7_A;NPM_JITTER_DEV_B;NPM_JITTER_AVG_B;NPM_JITTER_MIN_B;NPM_JITTER_MAX_B;NPM_DELAY_DEV_B;NPM_DELAY_AVG_B;NPM_DELAY_MIN_B;NPM_DELAY_MAX_B;NPM_DELAY_HISTOGRAM_1_B;NPM_DELAY_HISTOGRAM_2_B;NPM_DELAY_HISTOGRAM_3_B;NPM_DELAY_HISTOGRAM_4_B;NPM_DELAY_HISTOGRAM_5_B;NPM_DELAY_HISTOGRAM_6_B;NPM_DELAY_HISTOGRAM_7_B;

ICMP - ICMP TYPE;

The details of OS distribution grouped by the OS family are summarized in the table below. The Other OS family contains records generated by web crawling bots that do not include OS information in the User-Agent.

OS Family Number of flows
Other 42474
Windows 40349
Android 10290
iOS 8840
Mac OS X 5324
Linux 1589
Ubuntu 653
Fedora 88
Chrome OS 53
Symbian OS 1
Slackware 1
Linux Mint 1
Google Analytics Sample
console.cloud.google.com
Updated Jul 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data&hl=de&inv=1&invt=Ab2fng (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data?hl=de
Explore at:
Dataset updated
Jul 15, 2017
Dataset provided by
Googlehttp://google.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
UK Truck Brands Dataset
kaggle.com
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerrit Hoekstra (2023). UK Truck Brands Dataset [Dataset]. https://www.kaggle.com/bignosethethird/uk-truck-brands-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gerrit Hoekstra
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Area covered
United Kingdom
Description
Context

What would we use this dataset for? Firstly, crime prevention and detection. Following on, so much more else too, like finer traffic control management. Want to make a case to your local council for building that bypass around your village, already?

Content

With a comprehensive set of images of each of the popular vehicle brands, it should be possible to "teach" a machine-learning application to recognize car brands in real time from live video and camera observations. This gives an added attribute to a vehicle-of-interest's registration / license plate, or at least provides some backup information where the license plate could not be read.

These images have all been manually curated to prevent any ambiguities in the ML process, and advertisements and other useless vehicle views (like vehicle interiors and photos of your smiling salesman) have been removed.

Acknowledgements

The data collection process is described in https://github.com/gerritonagoodday/VehicleBrandDatasetScraping and used web scraping from popular car deal websites. I used ScrapingBee to do the web scraping where websites had put up obstacles to prevent web scraping.

If you want to enhance the dataset further or create datasets for other countries besides the UK, you can make a few configuration changes in the Python scripts. You will also need your own API key, at $29 per month. Sign up through this link and get your own API key:

https://www.scrapingbee.com?fpr=nobnose-inc27

Inspiration

Southern Africa has a huge problem with crime, corruption, and wildlife poaching. Much of this crime is committed by government officials, directly or indirectly. The problem to secure convictions has always been to pinpoint the culprits with irrefutable evidence within an already corrupt judiciary system. This could go some way towards achieving this.
d
Swash Web Browsing Clickstream Data - 1.5M Worldwide Users - GDPR Compliant
datarade.ai
.csv, .xls
Updated Jun 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swash (2023). Swash Web Browsing Clickstream Data - 1.5M Worldwide Users - GDPR Compliant [Dataset]. https://datarade.ai/data-products/swash-blockchain-bitcoin-and-web3-enthusiasts-swash
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Jun 27, 2023
Dataset authored and provided by
Swash
Area covered
Latvia, Monaco, Jordan, Belarus, Jamaica, India, Saint Vincent and the Grenadines, Uzbekistan, Liechtenstein, Russian Federation
Description
Unlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.

Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.

User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.

Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.

GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.

Market Intelligence and Consumer Behaviuor: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.

High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.

Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.

Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.
Data from: 3DHD CityScenes: High-Definition Maps in High-Density Point...
zenodo.org
data.niaid.nih.gov
bin, pdf
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt; Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt (2024). 3DHD CityScenes: High-Definition Maps in High-Density Point Clouds [Dataset]. http://doi.org/10.5281/zenodo.7085090
Explore at:
bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7085090
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt; Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items.

Our corresponding paper (published at ITSC 2022) is available here.
Further, we have applied 3DHD CityScenes to map deviation detection here.

Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises:

Python tools to read, generate, and visualize the dataset,

3DHDNet deep learning pipeline (training, inference, evaluation) for
map deviation detection and 3D object detection.

The DevKit is available here:

https://github.com/volkswagen/3DHD_devkit.

The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany.

When using our dataset, you are welcome to cite:

@INPROCEEDINGS{9921866, author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and Fingscheidt, Tim}, booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, year={2022}, pages={627-634}}

Acknowledgements

We thank the following interns for their exceptional contributions to our work.

Benjamin Sertolli: Major contributions to our DevKit during his master thesis

Niels Maier: Measurement campaign for data collection and data preparation

The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies.

The Dataset

After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following.

1. Dataset

This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map.

During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet.

To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example.

import json json_path = r"E:\3DHD_CityScenes\Dataset\train.json" with open(json_path) as jf: data = json.load(jf) print(data)

2. HD_Map

Map items are stored as lists of items in JSON format. In particular, we provide:

traffic signs,

traffic lights,

pole-like objects,

construction site locations,

construction site obstacles (point-like such as cones, and line-like such as fences),

line-shaped markings (solid, dashed, etc.),

polygon-shaped markings (arrows, stop lines, symbols, etc.),

lanes (ordinary and temporary),

relations between elements (only for construction sites, e.g., sign to lane association).

3. HD_Map_MetaData

Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON.

Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as “shape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API.

4. HD_PointCloud_Tiles

The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows.

x-coordinates: 4 byte integer

y-coordinates: 4 byte integer

z-coordinates: 4 byte integer

intensity of reflected beams: 2 byte unsigned integer

ground classification flag: 1 byte unsigned integer

After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance.

import numpy as np import pptk file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin" pc_dict = {} key_list = ['x', 'y', 'z', 'intensity', 'is_ground'] type_list = ['
5. Trajectories We provide 15 real-world trajectories recorded during a measurement campaign covering the whole HD map. Trajectory samples are provided approx. with 30 Hz and are encoded in JSON. These trajectories were used to provide the samples in train.json, val.json. and test.json with realistic geolocations and orientations of the ego vehicle. OP1 – OP5 cover the majority of the map with 5 trajectories. RH1 – RH10 cover the majority of the map with 10 trajectories. Note that OP5 is split into three separate parts, a-c. RH9 is split into two parts, a-b. Moreover, OP4 mostly equals OP1 (thus, we speak of 14 trajectories in our paper). For completeness, however, we provide all recorded trajectories here.
Number of internet users worldwide 2014-2029
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
World
Description
The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.
World Traffic Map
hub.arcgis.com
data-bgky.hub.arcgis.com
+1more
Updated Dec 13, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2012). World Traffic Map [Dataset]. https://hub.arcgis.com/maps/esri::world-traffic-map/about
Explore at:
Dataset updated
Dec 13, 2012
Dataset authored and provided by
Esrihttp://esri.com/
Area covered

Description
This map contains a dynamic traffic map service with capabilities for visualizing traffic speeds relative to free-flow speeds as well as traffic incidents which can be visualized and identified. The traffic data is updated every five minutes. Traffic speeds are displayed as a percentage of free-flow speeds, which is frequently the speed limit or how fast cars tend to travel when unencumbered by other vehicles. The streets are color coded as follows:Green (fast): 85 - 100% of free flow speedsYellow (moderate): 65 - 85%Orange (slow); 45 - 65%Red (stop and go): 0 - 45%Esri's historical, live, and predictive traffic feeds come directly from TomTom (www.tomtom.com). Historical traffic is based on the average of observed speeds over the past year. The live and predictive traffic data is updated every five minutes through traffic feeds. The color coded traffic map layer can be used to represent relative traffic speeds; this is a common type of a map for online services and is used to provide context for routing, navigation and field operations. The traffic map layer contains two sublayers: Traffic and Live Traffic. The Traffic sublayer (shown by default) leverages historical, live and predictive traffic data; while the Live Traffic sublayer is calculated from just the live and predictive traffic data only. A color coded traffic map can be requested for the current time and any time in the future. A map for a future request might be used for planning purposes. The map also includes dynamic traffic incidents showing the location of accidents, construction, closures and other issues that could potentially impact the flow of traffic. Traffic incidents are commonly used to provide context for routing, navigation and field operations. Incidents are not features; they cannot be exported and stored for later use or additional analysis. The service works globally and can be used to visualize traffic speeds and incidents in many countries. Check the service coverage web map to determine availability in your area of interest. In the coverage map, the countries color coded in dark green support visualizing live traffic. The support for traffic incidents can be determined by identifying a country. For detailed information on this service, including a data coverage map, visit the directions and routing documentation and ArcGIS Help.
ASAYAR: A Dataset for Arabic-Latin Text Detection
kaggle.com
Updated Feb 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed AKALLOUCH (2022). ASAYAR: A Dataset for Arabic-Latin Text Detection [Dataset]. https://www.kaggle.com/datasets/akallouch/asayar
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 4, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohammed AKALLOUCH
Description
ASAYAR

This is a description for the paper:
ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels
Mohammed Akallouch; Kaoutar Sefrioui Boujemaa; Afaf Bouhoute; Khalid Fardousse; Ismail Berrada

Overview

ASAYAR is the first public dataset dedicated to Latin (French) and Arabic Scene Text Detection in Highway panels. It comprises more than 1800 well-annotated images. The dataset was collected from Moroccan Highway,## ASAYAR

Annotation format

In the dataset, each instance's location is annotated by a rectangle bounding box. The bounding box can be denoted as :
{XMIN, YMIN, XMAX, YMAX}. An object has a class name denoted as CLASS. The global image information is defined as follows: FOLDER, PATH, NAME, and SIZE.

Dataset structure

Train or Test/ ├── ASAYAR_SIGN/ │ ├── Annotations/ │ │ ├── image_1.xml │ │ └── ... │ └── Images │ ├── image_1.png │ └── ... │ ├── ASAYAR_TXT/ │ ├── Annotations/ │ │ ├── Line-Level/ │ │ │ ├── image_1.xml │ │ │ └── ... │ │ └── Word-Level/ │ │ ├── image_1.xml │ │ └── ... │ └── Images/ │ ├── image_1.png │ └── ... └── ASAYAR_SYM/ ├── Annotations/ │ ├── image_1.xml │ └── ... └── Images/ ├── image_1.png └── ...

Import data

We provide a Jupyter Notebook with an example to import images and their annotations.

Convert to text format

To convert annotations from Voc pascal to txt format (xmin,ymin,xmax,ymax,class) use convert2txt.py.

Examples of Annotated Images

https://vcar.github.io/ASAYAR/images/image_895.png">

Website

The data website: ASAYAR

Citation

Our paper introducing the dataset and the evaluations methods is published at the IEEE Transactions on Intelligent Transportation Systems 2020 and is available here. If you make use of the ASAYAR dataset, please cite our following paper:

@ARTICLE{9233923, author={M. {Akallouch} and K. S. {Boujemaa} and A. {Bouhoute} and K. {Fardousse} and I. {Berrada}}, journal={IEEE Transactions on Intelligent Transportation Systems}, title={ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels}, year={2020}, pages={1-11}, doi={10.1109/TITS.2020.3029451}}
d
Coresignal | Web Data | Company Data | Global / 71M+ Records / Largest...
datarade.ai
.json, .csv
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal (2024). Coresignal | Web Data | Company Data | Global / 71M+ Records / Largest Professional Network / Updated Daily [Dataset]. https://datarade.ai/data-products/coresignal-web-data-company-data-global-69m-records-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset updated
Feb 21, 2024
Dataset authored and provided by
Coresignal
Area covered
Sweden, United Kingdom, State of, Nauru, Yemen, Finland, Trinidad and Tobago, Hong Kong, New Zealand, Libya
Description
Our Web Data dataset includes such data points as company name, location, headcount, industry, and size, among others. It offers extensive fresh and historical data, including even companies that operate in stealth mode.

For lead generation

With millions of companies worldwide, Web Company Database helps you filter potential clients based on custom criteria and speed up the conversion process.

Use cases

Filter potential clients according to location, size, and other criteria

Enrich your existing database

Improve conversion rates

Use predictive models to identify potential leads

Group your leads in segments for more accurate targeting

For market and business analysis

Our Web Company Data provides information about millions of companies, allowing you to find your competitors and see their weaknesses and strengths.

Use cases

Pinpoint your competitors

Learn about your competitors' size, headcount, and revenue

Prepare a data-driven plan for the next quarter

For Investors

We recommend B2B Web Data for investors to discover and evaluate businesses with the highest potential.

Gain strategic business insights, enhance decision-making, and maintain algorithms that signal investment opportunities with Coresignal’s global B2B Web Dataset.

Use cases

Screen startups and industries showing early signs of growth

Identify companies hungry for the next investment

Check if a startup is about to reach the next maturity phase

Identify and predict a startup's potential at the founding moment

Choose companies that fit you in terms of size and headcount

For sales prospecting

B2B Web Database saves time your employees would otherwise use to search for potential clients manually.

Use cases

Make a short list of the top prospects

Define which companies are large or small enough to buy your product

Based on the revenue, determine which companies are ready to convert

Sort the companies by their distance from your warehouse to draw a line where selling won't result in satisfactory profit
D
Dataset Alerts - Open and Monitoring
datasf.org
data.sfgov.org
+1more
application/rdfxml +5
Updated Jun 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset Alerts - Open and Monitoring [Dataset]. https://datasf.org/opendata/
Explore at:
json, application/rssxml, csv, tsv, xml, application/rdfxmlAvailable download formats
Dataset updated
Jun 20, 2025
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
A log of dataset alerts open, monitored or resolved on the open data portal. Alerts can include issues as well as deprecation or discontinuation notices.
o
Resources of IncRML: Incremental Knowledge Graph Construction from...
explore.openaire.eu
zenodo.org
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Van Assche; Julian Andres Rojas Melendez; Ben De Meester; Pieter Colpaert (2024). Resources of IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources [Dataset]. http://doi.org/10.5281/zenodo.10171157
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10171157
Dataset updated
Mar 18, 2024
Authors
Dylan Van Assche; Julian Andres Rojas Melendez; Ben De Meester; Pieter Colpaert
Description
IncRML resources This Zenodo dataset contains all the resources of the paper 'IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources' submitted to the Semantic Web Journal's Special Issue on Knowledge Graph Construction. This resource aims to make the paper experiments fully reproducible through our experiment tool written in Python which was already used before in the Knowledge Graph Construction Challenge by the ESWC 2023 Workshop on Knowledge Graph Construction. The exact Java JAR file of the RMLMapper (rmlmapper.jar) is also provided in this dataset which was used to execute the experiments. This JAR file was executed with Java OpenJDK 11.0.20.1 on Ubuntu 22.04.1 LTS (Linux 5.15.0-53-generic). Each experiment was executed 5 times and the median values are reported together with the standard deviation of the measurements. Datasets We provide both dataset dumps of the GTFS-Madrid-Benchmark and of real-life use cases from Open Data in Belgium.GTFS-Madrid-Benchmark dumps are used to analyze the impact on execution time and resources, while the real-life use cases aim to verify the approach on different types of datasets since the GTFS-Madrid-Benchmark is a single type of dataset which does not advertise changes at all. Benchmarks GTFS-Madrid-Benchmark: change types with fixed data size and amount of changes: additions-only, modifications-only, deletions-only (11 versions) GTFS-Madrid-Benchmark: amount of changes with fixed data size: 0%, 25%, 50%, 75%, and 100% changes (11 versions) GTFS-Madrid-Benchmark: data size with fixed amount of changes: scales 1, 10, 100 (11 versions) Real-life use cases Traffic control center Vlaams Verkeerscentrum (Belgium): traffic board messages data (1 day, 28760 versions) Meteorological institute KMI (Belgium): weather sensor data (1 day, 144 versions) Public transport agency NMBS (Belgium): train schedule data (1 week, 7 versions) Public transport agency De Lijn (Belgium): busses schedule data (1 week, 7 versions) Bike-sharing company BlueBike (Belgium): bike-sharing availability data (1 day, 1440 versions) Bike-sharing company JCDecaux (EU): bike-sharing availability data (1 day, 1440 versions) OpenStreetMap (World): geographical map data (1 day, 1440 versions) Remarks The first version of each dataset is always used as a baseline. All next versions are applied as an update on the existing version. The reported results are only focusing on the updates since these are the actual incremental generation. GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz datasets are not uploaded as GTFS-Madrid-Benchmark scale 100 because both share the same parameters (50% changes, scale 100). Please use GTFS-Scale-100-{ALL, CHANGE}.tar.xz for GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz All datasets are compressed with XZ and provided as a TAR archive, be aware that you need sufficient space to decompress these archives! 2 TB of free space is advised to decompress all benchmarks and use cases. The expected output is provided as a ZIP file in each TAR archive, decompressing these requires even more space (4 TB). Reproducing By using our experiment tool, you can easily reproduce the experiments as followed: Download one of the TAR.XZ archives and unpack them. Clone the GitHub repository of our experiment tool and install the Python dependencies with 'pip install -r requirements.txt'. Download the rmlmapper.jar JAR file from this Zenodo dataset and place it inside the experiment tool root folder. Execute the tool by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive --runs=5 run'. The argument '--runs=5' is used to perform the experiment 5 times. Once executed, you can generate the statistics by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive stats'. Testcases Testcases to verify the integration of RML and LDES with IncRML, see https://doi.org/10.5281/zenodo.10171394

OS Family	Number of flows
Other	42474
Windows	40349
Android	10290
iOS	8840
Mac OS X	5324
Linux	1589
Ubuntu	653
Fedora	88
Chrome OS	53
Symbian OS	1
Slackware	1
Linux Mint	1

Instagram accounts with the most followers worldwide 2024

statista.com
davegsmith.com

Updated Jun 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon (2025). Instagram accounts with the most followers worldwide 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset updated

Jun 17, 2025

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.

              The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.

              How popular is Instagram?

              Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.

              Who uses Instagram?

              Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.

              Celebrity influencers on Instagram
              Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.

Countries with the most Facebook users 2024

statista.com
ai-chatbox.pro
+1more

Updated Jun 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon (2025). Countries with the most Facebook users 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset updated

Jun 17, 2025

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

Which county has the most Facebook users?

              There are more than 378 million Facebook users in India alone, making it the leading country in terms of Facebook audience size. To put this into context, if India’s Facebook audience were a country then it would be ranked third in terms of largest population worldwide. Apart from India, there are several other markets with more than 100 million Facebook users each: The United States, Indonesia, and Brazil with 193.8 million, 119.05 million, and 112.55 million Facebook users respectively.

              Facebook – the most used social media

              Meta, the company that was previously called Facebook, owns four of the most popular social media platforms worldwide, WhatsApp, Facebook Messenger, Facebook, and Instagram. As of the third quarter of 2021, there were around 3,5 billion cumulative monthly users of the company’s products worldwide. With around 2.9 billion monthly active users, Facebook is the most popular social media worldwide. With an audience of this scale, it is no surprise that the vast majority of Facebook’s revenue is generated through advertising.

              Facebook usage by device
              As of July 2021, it was found that 98.5 percent of active users accessed their Facebook account from mobile devices. In fact, almost 81.8 percent of Facebook audiences worldwide access the platform only via mobile phone. Facebook is not only available through mobile browser as the company has published several mobile apps for users to access their products and services. As of the third quarter 2021, the four core Meta products were leading the ranking of most downloaded mobile apps worldwide, with WhatsApp amassing approximately six billion downloads.

How to choose the right product for your client?
kaggle.com
Updated Mar 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julia Beyers (2020). How to choose the right product for your client? [Dataset]. https://www.kaggle.com/juliabeyers/how-to-choose-the-right-product-for-your-client/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Julia Beyers
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4686357%2F186cf4f6172ca2c696819b7b09931bd3%2Fimage3.jpg?generation=1584955857130173&alt=media" alt="">

The presence of business in the digital space is a must now. Indeed, there’s hardly any company, be it a small startup or an international corporation, that wouldn’t be available online. For this, the company may use one of two options — to develop an app or a website, or both.

In the case of a limited budget, business owners often have to make a choice. Thus, considering that mobile traffic bypassed the desktop’s in 2016 and continues to grow, it becomes obvious that the business should become accessible and convenient for smartphone users. But what is better a responsive website or a mobile application?

Entrepreneurs often turn to development companies to ask this question. Lacking sufficient knowledge, they hope to get answers to their questions from people with experience in this field. So, we decided to compile a guide that will give you clear and understandable information.

Mobile app

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4686357%2F0541557795519f24d812f78dfb51867e%2Fimage4.png?generation=1584955894277647&alt=media" alt="">

Let's look at the stats. It will help you understand why a mobile app may be the obvious choice for your client.

In 2019, smartphone users installed about 204 billion(!) applications on their devices. On average, this is more than 26 applications per inhabitant of the planet Earth. And if this is not enough evidence, here’s one more point. The expected revenue of mobile applications will be $189 billion in 2020.

It sounds impressive, but this does not mean that a mobile application is something indispensable for every business. Not at all. Let's go through the pros and cons of a mobile application and try to understand when it is needed.

Pros

A new level of interaction. Mobile applications are a more convenient method of interaction. They load and process content faster. One more useful feature is notifications. Perhaps, applications are the best way to inform users about new updates, promotions, and other news (who will read long letters in the mail?).

Personalized targeting. Mobile applications are ideal for products or services that need to be used on an ongoing basis. The options like creating accounts, entering profile information, etc., make applications more personalized than websites. All this allows the business to target their audience more accurately without wasting money.

Offline usage. That’s another major advantage. Applications can provide users with access to content without an internet connection.

Cons

Development costs. In order to reach the maximum audience with a mobile app, it is necessary to cover two main operating systems — iOS and Android. Development for each OS can be too expensive for small business owners and they will have to make difficult choices. The way out of this situation is cross-platform development. Why? Because there’s no need to guess which platform targets prefer using — iOS or Android. Instead, you create just one app that runs seamlessly on both platforms.

Maintenance. The application is a technical product that needs constant support. Upgrades should be carried out in a timely manner. Often, users need to personally update applications by downloading a new version, which is annoying. Regular bug-fixing for various devices (smartphones, tablets) and different operating systems might be a real problem. Plus, any update should be confirmed by the store where the application is placed.

Suitable for businesses that provide interactive and personalized content (refers to all lifestyle and healthcare solutions), require regular app usage (for instance, to-do lists), rely on visual interaction and so on. For games, like Angry Birds, creating an app is also a wise choice.

Website

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4686357%2Fd4f5bf1fdd0d0e65fae38c7251f56f13%2Fimage1.jpg?generation=1584955919738648&alt=media" alt="">

In order to be convenient for users of mobile devices, a website should be responsive. We want to make an emphasis on this since it is critically important. Most of the traffic on the Internet comes from mobile devices, so your website should be adaptable, or in other words, mobile-friendly. If a mobile user needs to zoom in all the necessary elements and text to see something, they will immediately quit your website.

On the other hand, a responsive website has the following benefits.

Pros

Maintenance. Maintaining a website is less costly. When compared to applications where the user mu...

Facebook

Twitter

Click to copy link

Link copied

Cite

Hynek, Karel (2024). CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7409923

Data from: CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines

Explore at:

Dataset updated

Feb 29, 2024

Dataset provided by

Šiška, Pavel
Lukačovič, Andrej
Čejka, Tomáš
Hynek, Karel
Luxemburk, Jan

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size:

W-2022-44

Uncompressed Size: 19 GB Capture Period: 31.10.2022 - 6.11.2022 Number of flows: 32.6M W-2022-45

Uncompressed Size: 25 GB Capture Period: 7.11.2022 - 13.11.2022 Number of flows: 42.6M W-2022-46

Uncompressed Size: 20 GB Capture Period: 14.11.2022 - 20.11.2022 Number of flows: 33.7M W-2022-47

Uncompressed Size: 25 GB Capture Period: 21.11.2022 - 27.11.2022 Number of flows: 44.1M CESNET-QUIC22

Uncompressed Size: 89 GB Capture Period: 31.10.2022 - 27.11.2022 Number of flows: 153M

Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files:

ID: Unique identifier SRC_IP: Source IP address DST_IP: Destination IP address DST_ASN: Destination Autonomous System number SRC_PORT: Source port DST_PORT: Destination port PROTOCOL: Transport protocol QUIC_VERSION QUIC: protocol version QUIC_SNI: Server Name Indication domain QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION: Duration of the flow in seconds BYTES: Number of transmitted bytes from client to server BYTES_REV: Number of transmitted bytes from server to client PACKETS: Number of packets transmitted from client to server PACKETS_REV: Number of packets transmitted from server to client PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN: Number of packets in the PPI sequence PPI_DURATION: Duration of the PPI sequence in seconds PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence PHIST_SRC_SIZES: Histogram of packet sizes from client to server PHIST_DST_SIZES: Histogram of packet sizes from server to client PHIST_SRC_IPT: Histogram of inter-packet times from client to server PHIST_DST_IPT: Histogram of inter-packet times from server to client APP: Web service label CATEGORY: Service category FLOW_ENDREASON_IDLE: Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER: Flow was terminated for other reasons

Link to other CESNET datasets

https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ https://github.com/CESNET/cesnet-datazoo Please cite the original data article:

@article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }

Clear search

Close search

Google apps

Main menu

Data from: CESNET-QUIC22: A large one-month QUIC network traffic dataset...

Network traffic and code for machine learning classification

Daily website visitors (time series regression)

Context

Content

Inspiration

Data from: Analysis of the Quantitative Impact of Social Networks General...

Mill Road Project: Traffic Sensor Data

Wiki Dataset

Passive Operating System Fingerprinting Revisited - Network Flows Dataset

Google Analytics Sample

UK Truck Brands Dataset

Context

Content

Acknowledgements

Inspiration

Swash Web Browsing Clickstream Data - 1.5M Worldwide Users - GDPR Compliant

Data from: 3DHD CityScenes: High-Definition Maps in High-Density Point...

Number of internet users worldwide 2014-2029

World Traffic Map

ASAYAR: A Dataset for Arabic-Latin Text Detection

ASAYAR

Overview

Annotation format

Dataset structure

Import data

Convert to text format

Examples of Annotated Images

Website

Citation

Coresignal | Web Data | Company Data | Global / 71M+ Records / Largest...

Dataset Alerts - Open and Monitoring

Resources of IncRML: Incremental Knowledge Graph Construction from...

Instagram accounts with the most followers worldwide 2024

Countries with the most Facebook users 2024

How to choose the right product for your client?

Mobile app

Pros

Cons

Website

Pros

Data from: CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines