100+ datasets found
  1. Number of internet users worldwide 2014-2029

    • statista.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Area covered
    World
    Description

    The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.

  2. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  3. f

    Data from: Penalized and Constrained Optimization: An Application to...

    • tandf.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gareth M. James; Courtney Paulson; Paat Rusmevichientong (2023). Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising [Dataset]. http://doi.org/10.6084/m9.figshare.8023382.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Gareth M. James; Courtney Paulson; Paat Rusmevichientong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Firms are increasingly transitioning advertising budgets to Internet display campaigns, but this transition poses new challenges. These campaigns use numerous potential metrics for success (e.g., reach or click rate), and because each website represents a separate advertising opportunity, this is also an inherently high-dimensional problem. Further, advertisers often have constraints they wish to place on their campaign, such as targeting specific sub-populations or websites. These challenges require a method flexible enough to accommodate thousands of websites, as well as numerous metrics and campaign constraints. Motivated by this application, we consider the general constrained high-dimensional problem, where the parameters satisfy linear constraints. We develop the Penalized and Constrained optimization method (PaC) to compute the solution path for high-dimensional, linearly constrained criteria. PaC is extremely general; in addition to internet advertising, we show it encompasses many other potential applications, such as portfolio estimation, monotone curve estimation, and the generalized lasso. Computing the PaC coefficient path poses technical challenges, but we develop an efficient algorithm over a grid of tuning parameters. Through extensive simulations, we show PaC performs well. Finally, we apply PaC to a proprietary dataset in an exemplar Internet advertising case study and demonstrate its superiority over existing methods in this practical setting. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

  4. Data from: Youtube social network

    • kaggle.com
    zip
    Updated Sep 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo De Tomasi (2019). Youtube social network [Dataset]. https://www.kaggle.com/datasets/lodetomasi1995/youtube-social-network
    Explore at:
    zip(10604317 bytes)Available download formats
    Dataset updated
    Sep 1, 2019
    Authors
    Lorenzo De Tomasi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    Youtube social network and ground-truth communities Dataset information Youtube is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.

    We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper. As for the network, we provide the largest connected component.

    more info : https://snap.stanford.edu/data/com-Youtube.html

  5. Data from: The National Stream Internet project

    • agdatacommons.nal.usda.gov
    bin
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Isaak; Erin Peterson; Jay Ver Hoef; David Nagel (2023). The National Stream Internet project [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/The_National_Stream_Internet_project/24853041
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
    Authors
    Dan Isaak; Erin Peterson; Jay Ver Hoef; David Nagel
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The rate at which new information about stream resources is being created has accelerated with the recent development of spatial stream-network models (SSNMs), the growing availability of stream databases, and ongoing advances in geospatial science and computational efficiency. To further enhance information development, the National Stream Internet (NSI) project was developed as a means of providing a consistent, flexible analytical infrastructure that can be applied with many types of stream data anywhere in the country. A key part of that infrastructure is the NSI network, a digital GIS layer which has a specific topological structure that was designed to work effectively with SSNMs. The NSI network was derived from the National Hydrography Dataset Plus, Version 2 (NHDPlusV2) following technical procedures that ensure compatibility with SSNMs. The SSN models outperform traditional statistical techniques applied to stream data, enable predictions at unsampled locations to create status maps for river networks, and work particularly well with databases aggregated from multiple sources that contain clustered sampling locations. The NSI project is funded by the U.S. Fish & Wildlife Service's Landscape Conservation Cooperative program and has two simple objectives: 1) refine key spatial and statistical stream software and digital databases for compatibility so that a nationally consistent analytical infrastructure exists and is easy to apply; and 2) engage a grassroots user-base in application of this infrastructure so they are empowered to create new and valuable information from stream databases anywhere in the country. This website is a hub designed to connect users with software, data, and tools for creating that information. As better information is developed, it should enable stronger science, management, and conservation as pertains to stream ecosystems. Resources in this dataset:Resource Title: Website Pointer to the National Stream Internet. File Name: Web Page, url: https://www.fs.fed.us/rm/boise/AWAE/projects/NationalStreamInternet.html The National Stream Internet (NSI) is a network of people, data, and analytical techniques that interact synergistically to create information about streams. Elements and tools composing the NSI, including STARS, NHDPlusV2, and SSNs, enable integration of existing databases (e.g., water quality parameters, biological surveys, habitat condition) and development of new information using sophisticated spatial-statistical network models (SSNMs). The NSI provides a nationally consistent framework for analysis of stream data that can greatly improve the accuracy of status and trend assessments. The NSI project is described, together with an analytical infrastructure for using the spatial statistical network models with many types of stream datasets.

  6. Job Offers Web Scraping Search

    • kaggle.com
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Job Offers Web Scraping Search [Dataset]. https://www.kaggle.com/datasets/thedevastator/job-offers-web-scraping-search
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Offers Web Scraping Search

    Targeted Results to Find the Optimal Work Solution

    By [source]

    About this dataset

    This dataset collects job offers from web scraping which are filtered according to specific keywords, locations and times. This data gives users rich and precise search capabilities to uncover the best working solution for them. With the information collected, users can explore options that match with their personal situation, skillset and preferences in terms of location and schedule. The columns provide detailed information around job titles, employer names, locations, time frames as well as other necessary parameters so you can make a smart choice for your next career opportunity

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is a great resource for those looking to find an optimal work solution based on keywords, location and time parameters. With this information, users can quickly and easily search through job offers that best fit their needs. Here are some tips on how to use this dataset to its fullest potential:

    • Start by identifying what type of job offer you want to find. The keyword column will help you narrow down your search by allowing you to search for job postings that contain the word or phrase you are looking for.

    • Next, consider where the job is located – the Location column tells you where in the world each posting is from so make sure it’s somewhere that suits your needs!

    • Finally, consider when the position is available – look at the Time frame column which gives an indication of when each posting was made as well as if it’s a full-time/ part-time role or even if it’s a casual/temporary position from day one so make sure it meets your requirements first before applying!

    • Additionally, if details such as hours per week or further schedule information are important criteria then there is also info provided under Horari and Temps Oferta columns too! Now that all three criteria have been ticked off - key words, location and time frame - then take a look at Empresa (Company Name) and Nom_Oferta (Post Name) columns too in order to get an idea of who will be employing you should you land the gig!

      All these pieces of data put together should give any motivated individual all they need in order to seek out an optimal work solution - keep hunting good luck!

    Research Ideas

    • Machine learning can be used to groups job offers in order to facilitate the identification of similarities and differences between them. This could allow users to specifically target their search for a work solution.
    • The data can be used to compare job offerings across different areas or types of jobs, enabling users to make better informed decisions in terms of their career options and goals.
    • It may also provide an insight into the local job market, enabling companies and employers to identify where there is potential for new opportunities or possible trends that simply may have previously gone unnoticed

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: web_scraping_information_offers.csv | Column name | Description | |:-----------------|:------------------------------------| | Nom_Oferta | Name of the job offer. (String) | | Empresa | Company offering the job. (String) | | Ubicació | Location of the job offer. (String) | | Temps_Oferta | Time of the job offer. (String) | | Horari | Schedule of the job offer. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  7. Internet and Computer use, London - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2025). Internet and Computer use, London - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/internet-and-computer-use-london
    Explore at:
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    CKANhttps://ckan.org/
    Area covered
    London
    Description

    Statistics of how many adults access the internet and use different types of technology covering: home internet access how people connect to the web how often people use the web/computers whether people use mobile devices whether people buy goods over the web whether people carried out specified activities over the internet For more information see the ONS website and the UKDS website.

  8. Phishing Websites Detection

    • kaggle.com
    zip
    Updated May 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J Akshaya (2020). Phishing Websites Detection [Dataset]. https://www.kaggle.com/akshaya1508/phishing-websites-detection
    Explore at:
    zip(80950 bytes)Available download formats
    Dataset updated
    May 28, 2020
    Authors
    J Akshaya
    Description

    Context

    Phishing is a form of identity theft that occurs when a malicious website impersonates a legitimate one in order to acquire sensitive information such as passwords, account details, or credit card numbers. People generally tend to fall pray to this very easily. Kudos to the commendable craftsmanship of the attackers which makes people believe that it is a legitimate website. There is a need to identify the potential phishing websites and differentiate them from the legitimate ones. This dataset identifies the prominent features of the phishing websites, 10 such features have been identified.

    Content

    Generally, the open source datasets available on the internet do not comes with the code and the logic which arises certain problems i.e.:

    1. Limited Data: The ML algorithms can only be tested with the existing phishing URLs and no new phishing URLS can be checked for its validity.
    2. Outdated URLs: The datasets available on the internet has been uploaded long time ago, there are new kind of phishing URLs arising in every second.
    3. Outdated Features: The datasets available on the internet has been uploaded long time ago, there are new methodologies arising in phishing techniques.
    4. No Access to Backend: There is no stepwise guide describing how the feature has been derived.

    On the contrary we are trying to overcome all the above-mentioned problems.

    1. Real Time Data: Before applying a Machine Learning algorithm, we can run the script and fetch real time URLs from Phishtank (for phishing URLs) and from moz (for legitimate URLs) 2. Scalable Data: We can also specify the number of URLs we want to feed the model and hence the web scrapper will fetch that much amount of data from the websites. Presently we are using 1401 URLs in this project i.e. 901 Phishing URLs and 500 Legitimate URLS. 3. New Features: We have tried to implement the prominent new features that is there in the current phishing URLs and since we own the code, new features can also be added. 4. Source code on Github: The source code is published on GitHub for public use and can be used for further scope of improvements. This way there will be transparency to the logic and more creators can add there meaningful additions to the code.

    Link to the source code

    https://github.com/akshaya1508/detection_of_phishing_websites.git

    Inspiration

    The idea to develop the dataset and the code for this dataset has been inspired by various other creators who have worked on the similar lines.

  9. Cybersecurity: Suspicious Web Threat Interactions

    • kaggle.com
    Updated Apr 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JanCSG (2024). Cybersecurity: Suspicious Web Threat Interactions [Dataset]. https://www.kaggle.com/datasets/jancsg/cybersecurity-suspicious-web-threat-interactions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JanCSG
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    This dataset contains web traffic records collected through AWS CloudWatch, aimed at detecting suspicious activities and potential attack attempts.

    The data were generated by monitoring traffic to a production web server, using various detection rules to identify anomalous patterns.

    Context

    In today's cloud environments, cybersecurity is more crucial than ever. The ability to detect and respond to threats in real time can protect organizations from significant consequences. This dataset provides a view of web traffic that has been labeled as suspicious, offering a valuable resource for developers, data scientists, and security experts to enhance threat detection techniques.

    Dataset Content

    Each entry in the dataset represents a stream of traffic to a web server, including the following columns:

    bytes_in: Bytes received by the server.

    bytes_out: Bytes sent from the server.

    creation_time: Timestamp of when the record was created.

    end_time: Timestamp of when the connection ended.

    src_ip: Source IP address.

    src_ip_country_code: Country code of the source IP.

    protocol: Protocol used in the connection.

    response.code: HTTP response code.

    dst_port: Destination port on the server.

    dst_ip: Destination IP address.

    rule_names: Name of the rule that identified the traffic as suspicious.

    observation_name: Observations associated with the traffic.

    source.meta: Metadata related to the source.

    source.name: Name of the traffic source.

    time: Timestamp of the detected event.

    detection_types: Type of detection applied.

    Potential Uses

    This dataset is ideal for:

    • Anomaly Detection: Developing models to detect unusual behaviors in web traffic.
    • Classification Models: Training models to automatically classify traffic as normal or suspicious.
    • Security Analysis: Conducting security analyses to understand the tactics, techniques, and procedures of attackers.
  10. d

    Complete Domain Whois dataset (all zones)

    • datarade.ai
    .json, .csv
    Updated Dec 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netlas.io (2022). Complete Domain Whois dataset (all zones) [Dataset]. https://datarade.ai/data-products/complete-domain-whois-dataset-all-zones-netlas-io
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Dec 16, 2022
    Dataset provided by
    Netlas.io
    Area covered
    Spain, Lebanon, Timor-Leste, Cabo Verde, Mauritius, Fiji, Armenia, Slovenia, Latvia, Guadeloupe
    Description

    Netlas.io is a set of internet intelligence apps that provide accurate technical information on IP addresses, domain names, websites, web applications, IoT devices, and other online assets.

    Netlas.io maintains five general data collections: Responses (internet scan data), DNS Registry data, IP Whois data, Domain Whois data, SSL Certificates.

    This dataset contains Domain WHOIS data. It covers active domains only, including just registered, published and parked domains, domains on redeption grace period (waiting for renewal), and domains pending delete. This dataset doesn't include any historical records.

  11. phishing.arff

    • figshare.com
    txt
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ambroise Odonnat (2024). phishing.arff [Dataset]. http://doi.org/10.6084/m9.figshare.26232710.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ambroise Odonnat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the data from the Phishing Website dataset provided in [1]. All the features are categorical and were preprocessed in integer values. The data can be downloaded from https://archive.ics.uci.edu/dataset/327/phishing+websites. There are 11055 samples with 30 features. Websites belong to 2 domains: websites that use the IP address used instead of the domain name in the URL and websites that use the domain name in the URL. For reference, please refer to: [1] R. Mohammad, F. Thabtah, L. Mccluskey. An assessment of features related to phishing websites using an automated technique In International Conference for Internet Technology and Secured Transactions, 2012

  12. d

    Performance Metrics - Innovation & Technology - City Website Availability

    • catalog.data.gov
    • data.cityofchicago.org
    • +2more
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofchicago.org (2023). Performance Metrics - Innovation & Technology - City Website Availability [Dataset]. https://catalog.data.gov/dataset/performance-metrics-innovation-technology-city-website-availability
    Explore at:
    Dataset updated
    Dec 2, 2023
    Dataset provided by
    data.cityofchicago.org
    Description

    The City's Internet site allows residents to access City services online, learn more about the City of Chicago, and find other pertinent information. The percentage of the City’s Internet website uptime, the amount of time the site was available, and the target uptime for each week are available by mousing over columns. The target availability for this site is 99.5%.

  13. Internet Prices around 200+ countries in 2022.

    • kaggle.com
    Updated Sep 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ram Jas (2022). Internet Prices around 200+ countries in 2022. [Dataset]. https://www.kaggle.com/datasets/ramjasmaurya/1-gb-internet-price/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ram Jas
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    IF YOU THINK DATASET IS WORTHY OF UPVOTE ,THEN UPVOTE .DON'T THINK TOO MUCH

    https://trak.in/wp-content/uploads/2019/12/India-Keyboard-flag-Software-IT-services.jpg" alt="india internet">

    India's cheapest internet is the motivation behind this dataset.

    Internet in India began in 1986 and was available only to the educational and research community. General public access to the internet began on 15 August 1995, and as of 2020 there are 718.74 million active internet users that comprise 54.29% of the population.

    As of May 2014, the Internet is delivered to India mainly by 9 different undersea fibres, including SEA-ME-WE 3, Bay of Bengal Gateway and Europe India Gateway, arriving at 5 different landing points.India also has one overland internet connection, at the city of Agartala near the border with Bangladesh.

    The Indian Government has embarked on projects such as BharatNet, Digital India, Brand India and Startup India to further expedite the growth of internet-based ecosystems.

    ...know more from www.wikipedia.com

    Similar dataset based on above topic:

  14. d

    Forward DNS (A, MX, NS, CNAME and TXT records)

    • datarade.ai
    .json, .csv
    Updated Nov 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netlas.io (2021). Forward DNS (A, MX, NS, CNAME and TXT records) [Dataset]. https://datarade.ai/data-products/whole-dns-registry-a-mx-ns-cname-and-txt-records-netlas
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Nov 29, 2021
    Dataset provided by
    Netlas.io
    Area covered
    Ecuador, Åland Islands, Nauru, New Caledonia, Mauritius, Seychelles, Falkland Islands (Malvinas), India, Saint Helena, Djibouti
    Description

    Netlas.io is a set of internet intelligence apps that provide accurate technical information on IP addresses, domain names, websites, web applications, IoT devices, and other online assets.

    Netlas.io scans every IPv4 address and every known domain name utilizing such protocols as HTTP, FTP, SMTP, POP3, IMAP, SMB/CIFS, SSH, Telnet, SQL and others. Collected data is enriched with additional info and available in Netlas.io Search Engine. Some parts of Netlas.io database is available as downloadable datasets.

    Netlas.io accumulates domain names to make internet scan coverage as wide as possible. Domain names are collected from ICANN Centralized Zone Data Service, SSL Certificates, 301 & 302 HTTP redirects (while scanning) and other sources.

    This dataset contains domains and subdomains (all gTLD and ccTLD), that have at least one associated DNS registry entry (A, MX, NS, CNAME and TXT records).

  15. NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
    Explore at:
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    New York, Staten Island
    Description

    The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).

  16. Data from: VLC Data: A Multi-Class Network Traffic Dataset Covering Diverse...

    • zenodo.org
    • producciocientifica.uv.es
    • +1more
    bin
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Rau; Francisco Rau; Carlos Herranz Claveras; Carlos Herranz Claveras; Iñaki Val; Iñaki Val; Joaquin Perez; Joaquin Perez (2025). VLC Data: A Multi-Class Network Traffic Dataset Covering Diverse Applications and Platforms [Dataset]. http://doi.org/10.5281/zenodo.15121418
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco Rau; Francisco Rau; Carlos Herranz Claveras; Carlos Herranz Claveras; Iñaki Val; Iñaki Val; Joaquin Perez; Joaquin Perez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 1, 2025
    Description

    VLC Data: A Multi-Class Network Traffic Dataset Covering Diverse Applications and Platforms

    Valencia Data (VLC Data) is a network traffic dataset collected from various applications and platforms. It includes both encrypted and, when applicable, unencrypted protocols, capturing realistic usage scenarios and application-specific behavior.

    The dataset covers 18.5 hours, 58 pcapng files, and 24.26 GB, with traffic from:

    • Video streaming: Netflix and Prime Video (10–50 min) via Firefox.
    • Gaming: Roblox sessions on Windows (20–35 min), recorded outside of virtual machines, despite VM support.
    • Video conferencing: Microsoft Teams (20 min) via Firefox.
    • Web browsing: Wikipedia, BBC, Google, LinkedIn, Amazon, and OWIN6G (2–5 min) via Firefox or Chrome.
    • Audio streaming: Spotify (30–33 min) on multiple OS.
    • Web streaming: YouTube in 4K and Full HD (20–30 min).

    This dataset is publicly available for traffic analysis across different apps, protocols, and systems.

    Table Description:

    TypeApplicationsPlatformTime [min]CommentsFilenameSize (MB)
    Video StreamingNetflixLinux10Running Netflix on Firefox Browsernetflix_linux_10m_01 95.1
    Video StreamingNetflixLinux20Running Netflix on Firefox Browsernetflix_linux_20m_01 167.7
    Video StreamingNetflixLinux20Running Netflix on Firefox Browsernetflix_linux_20m_02 237.9
    Video StreamingNetflixLinux20Running Netflix on Firefox Browsernetflix_linux_20m_03 212.6
    Video StreamingNetflixLinux25Running Netflix on Firefox, but 2 min in Menunetflix_linux_25m_01 610.7
    Video StreamingNetflixLinux35Running Netflix on Firefox, but 1 min in Menunetflix_linux_35m_01 534.8
    Video StreamingNetflixLinux50Running Netflix on Firefox Browsernetflix_linux_50m_01 660.9
    Video StreamingNetflixWindows10Running Netflix on Firefox Browsernetflix_windows_10m_01 132.1
    Video StreamingNetflixWindows20Running Netflix on Firefox Browsernetflix_windows_20m_01 506.4
    Video StreamingPrime VideoLinux20Running Prime Video on Firefox Browserprime_linux_20m_01 767.3
    Video StreamingPrime VideoLinux20Running Prime Video on Firefox Browserprime_linux_20m_02 569.3
    Video StreamingPrime VideoWindows20Running Prime Video on Firefox Browserprime_windows_20m_01 512.3
    Video StreamingPrime VideoWindows20Running Prime Video on Firefox Browserprime_windows_20m_02 364.2
    GamingRobloxWindows20Doesn't run in VMroblox_windows_20m_01 127.5
    GamingRobloxWindows20Doesn't run in VMroblox_windows_20m_02 378.5
    GamingRobloxWindows20Doesn't run in VMroblox_windows_20m_03 458.9
    GamingRobloxWindows30Doesn't run in VMroblox_windows_30m_01 519.8
    GamingRobloxWindows30Doesn't run in VMroblox_windows_30m_02 357.3
    GamingRobloxWindows35Doesn't run in VMroblox_windows_35m_01 880.4
    Audio StreamingSpotifyLinux30Running Spotify app on Ubuntu-Linuxspotify_linux_30m_01 98.2
    Audio StreamingSpotifyLinux30Running Spotify app on Ubuntu-Linuxspotify_linux_30m_02 112.2
    Audio StreamingSpotifyLinux30Running Spotify app on Ubuntu-Linuxspotify_linux_30m_03 175.5
    Audio StreamingSpotifyWindows30Running Spotify app on Windowsspotify_windows_30m_01 50.7
    Audio StreamingSpotifyWindows30Doesn't run in VMspotify_windows_30m_02 63.2
    Audio StreamingSpotifyWindows33Running Spotify app on Windowsspotify_windows_33m_01 70.9
    Video ConferencingTeamsLinux20Running Teams on Firefox Browserteams_linux_20m_01 134.6
    Video ConferencingTeamsLinux20Running Teams on Firefox Browserteams_linux_20m_02 343.3
    Video ConferencingTeamsLinux20Running Teams on Firefox Browserteams_linux_20m_03 376.6
    Video ConferencingTeamsWindows20Running Teams on Firefox Browserteams_windows_20m_01 634.1
    Video ConferencingTeamsWindows20Running Teams on Firefox Browserteams_windows_20m_02 517.8
    Video ConferencingTeamsWindows20Running Teams on Firefox Browserteams_windows_20m_03 629.9
    Web BrowsingWebLinux2OWIN6G website on Firefox Browserweb_linux_2m_owin6g 1.2
    Web BrowsingWebLinux2Wikipedia website on Firefox Browserweb_linux_2m_wikipedia 19.7
    Web BrowsingWebLinux3OWIN6G website on Firefox Browserweb_linux_3m_owin6g 4.5
    Web BrowsingWebLinux3Wikipedia website on Firefox Browserweb_linux_3m_wikipedia 23.5
    Web BrowsingWebLinux5Amazon website on Chrome Browser web_linux_5m_amazon 262.9
    Web BrowsingWebLinux5BBC website on Firefox Browser web_linux_5m_bbc 55.7
    Web BrowsingWebLinux5Google website on Firefox Browser web_linux_5m_google 22.6
    Web BrowsingWebLinux5Linkedin website on Firefox Browserweb_linux_5m_linkedin 39.8
    Web BrowsingWebWindows3OWIN6G website on Firefox Browserweb_windows_3m_owin6g 32.6
    Web BrowsingWebWindows3Wikipedia website on Firefox Browserweb_windows_3m_wikipedia 94.9
    Web BrowsingWebWindows5Amazon website on Chrome Browser web_windows_5m_amazon 104.0
    Web BrowsingWebWindows5BBC website on Firefox Browser web_windows_5m_bbc 23.1
    Web BrowsingWebWindows5Google website on Firefox Browser web_windows_5m_google 31.5
    Web BrowsingWebWindows5Linkedin website on Firefox Browserweb_windows_5m_linkedin 104.1
    Web StreamingYoutubeLinux20One Video Streaming, 4Kyoutube_linux_20m_01 1,145.6
    Web StreamingYoutubeLinux20One Video Streaming, FullHDyoutube_linux_20m_02 389.4
    Web StreamingYoutubeLinux20One Video Streaming, FullHDyoutube_linux_20m_03 2,007.1
    Web StreamingYoutubeLinux20One Video Streaming, 4Kyoutube_linux_20m_04 390.4
    Web StreamingYoutubeLinux20One Video Streaming, FullHDyoutube_linux_20m_05 410.1
    Web

  17. High-Throughput Comp. Screening of MOFs

    • kaggle.com
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). High-Throughput Comp. Screening of MOFs [Dataset]. https://www.kaggle.com/datasets/thedevastator/high-throughput-comp-screening-of-mofs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 29, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    High-Throughput Comp. Screening of MOFs

    Open Metal Sites, Cavity Diameters and Free Paths

    By [source]

    About this dataset

    This dataset provides atomic coordinates for metal-organic frameworks (MOFs), enabling high-throughput computational screening of MOFs in a broad range of scenarios. The dataset is derived from the Cambridge Structural Database (CSD) and across the internet and offers an array of useful parameters, like accessible surface area (ASA), non-accessible surface area (NASA), largest cavity diameter (LCD), pore limiting diameter (PLD)and more. The results yielded by this dataset may prove to be very helpful in assessing the potential of MOFs as prospective materials for chemical separations, transformations and functional nanoporous materials. This can bring about improvements to many industries and help devise better products for consumers worldwide. If errors are found in this data, there is a feedback form available which can be used to report your findings. We appreciate your interest in our project and hope you will make good use out of this data!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide will introduce you to the CoRE MOF 2019 dataset and explain how to properly use it for high-throughput computational screenings. It will provide you with the necessary background information and knowledge for successful use of this dataset.

    The CoRE MOF 2019 Dataset contains atomic coordinates for metal-organic frameworks (MOFs) which can be used as inputs for simulation software packages, enabling high-throughput computational screening of these MOFs. This dataset is derived from both the Cambridge Structural Database (CSD) and World Wide Web sources, providing powerful data on which MOF systems are suitable for potential applications in chemical separations, transformations, and functional nanoporous materials.

    In order to make efficient use of this dataset, it is important that you familiarize yourself with all available columns. The columns contain information about a given MOF system such as LCD (largest cavity diameter), PLD (pore limiting diameter), LFPD (largest sphere along the free path), ASA (accessible surface area), NASA (non-accessible surface area), void fraction (AV_VF). Additionally there is also useful metadata such as public availability status, CSD overlap references in CoRE or CCDC databases, DOI details if available etc.. To get a full list of all these features please refer to the provided documentation or codebook on Kaggle website or your own research.

    Once you are familiar with column specifications it's time to move forward by downloading the actual database file from Kaggle servers. The downloaded file should be opened in MS Excel/CSV format where each row will represent a single distinct MOFS whereas each respective column represents its corresponding parameters value/range depending upon type(integer/float/boolean). Considering specific row from database shows us every information related to particular Molecular Framework System like AAC: Surface Area accessible by molecules outside pore (m^2). Using such info one can easily compare two different molecular framework systems directly without need for any pre processing algorithm or manual calculations typically required when comparing right values across different datasets holding same type of informations like respective project MCMC Algorithm running upon obtain structure hypothesis produces various mathematical linear variables whose direct comparison over simple values won't make much useful score out [until processed#naturally]. Thus after ensuring minimum data loss occurred during formatting one should seriously consider performing direct analysis involving entire set rather loopin[g #ASAP] into individual rows and perform direct comparisions though they might appear simpler at first instance

    Research Ideas

    • Create an open source library of automated SIM simulations for MOFs, which can be used to generate results quickly and accurately.
    • Update the existing Porous Materials Database (PMD) software with additional data fields that leverage insights from this dataset, allowing users to easily search and filter MOFs by specific structural characteristics.
    • Develop a web-based interface that allows researchers to visualize different MOF structures using realistic 3D images derived from the atomic data provided in the dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    ...

  18. Data from: A dataset of late 1990s and early 2000s web banner ads on...

    • zenodo.org
    • data.niaid.nih.gov
    json
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Lewei Huang; Richard Lewei Huang; Yufeng Zhao; Yufeng Zhao (2023). A dataset of late 1990s and early 2000s web banner ads on Chinese- and English-language web pages [Dataset]. http://doi.org/10.5281/zenodo.8408539
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard Lewei Huang; Richard Lewei Huang; Yufeng Zhao; Yufeng Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains information about 22,915 unique banner ad images appearing on Chinese- and English-language web pages in the late 1990s and early 2000s. The dataset is mined from 1,384,355 archived web page snapshots downloaded from the Wayback Machine, representing 77,747 unique HTTP URLs. The URLs are collected from six printed Internet directory books published in mainland China and the United States between 1999 and 2001, as part of a larger research project on Chinese-language web archiving.


    For each banner ad image, the dataset provides standard image metadata such as file format and dimension. The dataset also provides the original URLs of the web pages where the banner ad image was found, timestamps of the archived web page snapshots containing the image, archived URLs of the image file, and, if available, archived URLs of web pages to which the ad image is linked. Additionally, the dataset provides text data obtained from the banner ad images using optical character recognition (OCR). We expect the dataset to be useful for researchers across a variety of disciplines and fields such as visual culture, history, media studies, and business and marketing.

  19. m

    Network traffic and code for machine learning classification

    • data.mendeley.com
    Updated Feb 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
    Explore at:
    Dataset updated
    Feb 20, 2020
    Authors
    Víctor Labayen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

    Activities:

    Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

    The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

    The amount of data is stated as follows:

    Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

    The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.

  20. FacetE

    • kaggle.com
    Updated Mar 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Günther (2020). FacetE [Dataset]. http://doi.org/10.34740/kaggle/ds/540160
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Michael Günther
    Description

    Context

    The purpose of this dataset is to provide data for more comprehensive and flexible opportunities for word embedding evaluation. It is divided into 8 categories containing 250 facets with over 600 thousand word pairs. More information can be found in the paper "https://dl.acm.org/doi/pdf/10.1145/3395032.3395325">FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation

    Content

    The data is extracted from the Dresden Web Table Corpus (DWTC). It contains relations that frequently occur in web tables.

    License

    This dataset is extracted from the DWTC which in turn derives from the Common Crawl Corpus It is provided in accordance with the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Organization logo

Number of internet users worldwide 2014-2029

Explore at:
303 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
World
Description

The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.

Search
Clear search
Close search
Google apps
Main menu