69 datasets found

Number of internet users worldwide 2014-2029
statista.com
tokrwards.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
World
Description
The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.
Internet and Computer use, London - Dataset - data.gov.uk
ckan.publishing.service.gov.uk
Updated Jun 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2025). Internet and Computer use, London - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/internet-and-computer-use-london
Explore at:
Dataset updated
Jun 9, 2025
Dataset provided by
CKANhttps://ckan.org/
Area covered
London
Description
Statistics of how many adults access the internet and use different types of technology covering: home internet access how people connect to the web how often people use the web/computers whether people use mobile devices whether people buy goods over the web whether people carried out specified activities over the internet For more information see the ONS website and the UKDS website.
Daily time spent online by users worldwide Q3 2024
statista.com
tokrwards.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ani Petrosyan (2025). Daily time spent online by users worldwide Q3 2024 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Ani Petrosyan
Description
As of the third quarter of 2024, internet users spent six hours and 38 minutes online daily. This is a slight increase in comparison to the previous quarter. Overall, between the third quarter of 2015 and the third quarter of 2024, the average daily internet use has increased by 19 minutes. Most online countries Internet users between 16 and 64 years old in South Africa spent the longest time online daily, nine hours and 27 minutes, followed by Brazil and the Philippines. These figures include the time spent using the internet on any device. In Japan, internet users spent around three hours and 57 minutes online per day. Users in Denmark also spent relatively less time on the internet, reaching about five hours daily. Most common online activities According to a 2024 survey, more than six in 10 people worldwide used the internet to find information. Furthermore, the usage of communication platforms was also a common reason for going online, followed by online content consumption, such as watching videos, TV shows, or movies.

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

Data from: The National Stream Internet project
agdatacommons.nal.usda.gov
bin
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Isaak; Erin Peterson; Jay Ver Hoef; David Nagel (2023). The National Stream Internet project [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/The_National_Stream_Internet_project/24853041
Explore at:
binAvailable download formats
Dataset updated
Dec 18, 2023
Dataset provided by
U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
Authors
Dan Isaak; Erin Peterson; Jay Ver Hoef; David Nagel
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The rate at which new information about stream resources is being created has accelerated with the recent development of spatial stream-network models (SSNMs), the growing availability of stream databases, and ongoing advances in geospatial science and computational efficiency. To further enhance information development, the National Stream Internet (NSI) project was developed as a means of providing a consistent, flexible analytical infrastructure that can be applied with many types of stream data anywhere in the country. A key part of that infrastructure is the NSI network, a digital GIS layer which has a specific topological structure that was designed to work effectively with SSNMs. The NSI network was derived from the National Hydrography Dataset Plus, Version 2 (NHDPlusV2) following technical procedures that ensure compatibility with SSNMs. The SSN models outperform traditional statistical techniques applied to stream data, enable predictions at unsampled locations to create status maps for river networks, and work particularly well with databases aggregated from multiple sources that contain clustered sampling locations. The NSI project is funded by the U.S. Fish & Wildlife Service's Landscape Conservation Cooperative program and has two simple objectives: 1) refine key spatial and statistical stream software and digital databases for compatibility so that a nationally consistent analytical infrastructure exists and is easy to apply; and 2) engage a grassroots user-base in application of this infrastructure so they are empowered to create new and valuable information from stream databases anywhere in the country. This website is a hub designed to connect users with software, data, and tools for creating that information. As better information is developed, it should enable stronger science, management, and conservation as pertains to stream ecosystems. Resources in this dataset:Resource Title: Website Pointer to the National Stream Internet. File Name: Web Page, url: https://www.fs.fed.us/rm/boise/AWAE/projects/NationalStreamInternet.html The National Stream Internet (NSI) is a network of people, data, and analytical techniques that interact synergistically to create information about streams. Elements and tools composing the NSI, including STARS, NHDPlusV2, and SSNs, enable integration of existing databases (e.g., water quality parameters, biological surveys, habitat condition) and development of new information using sophisticated spatial-statistical network models (SSNMs). The NSI provides a nationally consistent framework for analysis of stream data that can greatly improve the accuracy of status and trend assessments. The NSI project is described, together with an analytical infrastructure for using the spatial statistical network models with many types of stream datasets.
Job Offers Web Scraping Search
kaggle.com
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Job Offers Web Scraping Search [Dataset]. https://www.kaggle.com/datasets/thedevastator/job-offers-web-scraping-search
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Job Offers Web Scraping Search

Targeted Results to Find the Optimal Work Solution

By [source]

About this dataset

This dataset collects job offers from web scraping which are filtered according to specific keywords, locations and times. This data gives users rich and precise search capabilities to uncover the best working solution for them. With the information collected, users can explore options that match with their personal situation, skillset and preferences in terms of location and schedule. The columns provide detailed information around job titles, employer names, locations, time frames as well as other necessary parameters so you can make a smart choice for your next career opportunity

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is a great resource for those looking to find an optimal work solution based on keywords, location and time parameters. With this information, users can quickly and easily search through job offers that best fit their needs. Here are some tips on how to use this dataset to its fullest potential:

Start by identifying what type of job offer you want to find. The keyword column will help you narrow down your search by allowing you to search for job postings that contain the word or phrase you are looking for.

Next, consider where the job is located – the Location column tells you where in the world each posting is from so make sure it’s somewhere that suits your needs!

Finally, consider when the position is available – look at the Time frame column which gives an indication of when each posting was made as well as if it’s a full-time/ part-time role or even if it’s a casual/temporary position from day one so make sure it meets your requirements first before applying!

Additionally, if details such as hours per week or further schedule information are important criteria then there is also info provided under Horari and Temps Oferta columns too! Now that all three criteria have been ticked off - key words, location and time frame - then take a look at Empresa (Company Name) and Nom_Oferta (Post Name) columns too in order to get an idea of who will be employing you should you land the gig!

All these pieces of data put together should give any motivated individual all they need in order to seek out an optimal work solution - keep hunting good luck!

Research Ideas

Machine learning can be used to groups job offers in order to facilitate the identification of similarities and differences between them. This could allow users to specifically target their search for a work solution.

The data can be used to compare job offerings across different areas or types of jobs, enabling users to make better informed decisions in terms of their career options and goals.

It may also provide an insight into the local job market, enabling companies and employers to identify where there is potential for new opportunities or possible trends that simply may have previously gone unnoticed

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: web_scraping_information_offers.csv | Column name | Description | |:-----------------|:------------------------------------| | Nom_Oferta | Name of the job offer. (String) | | Empresa | Company offering the job. (String) | | Ubicació | Location of the job offer. (String) | | Temps_Oferta | Time of the job offer. (String) | | Horari | Schedule of the job offer. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
f
Data from: Penalized and Constrained Optimization: An Application to...
tandf.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gareth M. James; Courtney Paulson; Paat Rusmevichientong (2023). Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising [Dataset]. http://doi.org/10.6084/m9.figshare.8023382.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8023382.v3
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Gareth M. James; Courtney Paulson; Paat Rusmevichientong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Firms are increasingly transitioning advertising budgets to Internet display campaigns, but this transition poses new challenges. These campaigns use numerous potential metrics for success (e.g., reach or click rate), and because each website represents a separate advertising opportunity, this is also an inherently high-dimensional problem. Further, advertisers often have constraints they wish to place on their campaign, such as targeting specific sub-populations or websites. These challenges require a method flexible enough to accommodate thousands of websites, as well as numerous metrics and campaign constraints. Motivated by this application, we consider the general constrained high-dimensional problem, where the parameters satisfy linear constraints. We develop the Penalized and Constrained optimization method (PaC) to compute the solution path for high-dimensional, linearly constrained criteria. PaC is extremely general; in addition to internet advertising, we show it encompasses many other potential applications, such as portfolio estimation, monotone curve estimation, and the generalized lasso. Computing the PaC coefficient path poses technical challenges, but we develop an efficient algorithm over a grid of tuning parameters. Through extensive simulations, we show PaC performs well. Finally, we apply PaC to a proprietary dataset in an exemplar Internet advertising case study and demonstrate its superiority over existing methods in this practical setting. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
E-commerce - Users of a French C2C fashion store
kaggle.com
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Mvutu Mabilama (2024). E-commerce - Users of a French C2C fashion store [Dataset]. https://www.kaggle.com/jmmvutu/ecommerce-users-of-a-french-c2c-fashion-store/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Dataset provided by
Kaggle
Authors
Jeffrey Mvutu Mabilama
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
Foreword

This users dataset is a preview of a much bigger dataset, with lots of related data (product listings of sellers, comments on listed products, etc...).

My Telegram bot will answer your queries and allow you to contact me.

Context

There are a lot of unknowns when running an E-commerce store, even when you have analytics to guide your decisions.

Users are an important factor in an e-commerce business. This is especially true in a C2C-oriented store, since they are both the suppliers (by uploading their products) AND the customers (by purchasing other user's articles).

This dataset aims to serve as a benchmark for an e-commerce fashion store. Using this dataset, you may want to try and understand what you can expect of your users and determine in advance how your grows may be.

For instance, if you see that most of your users are not very active, you may look into this dataset to compare your store's performance.

If you think this kind of dataset may be useful or if you liked it, don't forget to show your support or appreciation with an upvote/comment. You may even include how you think this dataset might be of use to you. This way, I will be more aware of specific needs and be able to adapt my datasets to suits more your needs.

This dataset is part of a preview of a much larger dataset. Please contact me for more.

Content

The data was scraped from a successful online C2C fashion store with over 10M registered users. The store was first launched in Europe around 2009 then expanded worldwide.

Visitors vs Users: Visitors do not appear in this dataset. Only registered users are included. "Visitors" cannot purchase an article but can view the catalog.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Questions you might want to answer using this dataset:

Are e-commerce users interested in social network feature ?

Are my users active enough (compared to those of this dataset) ?

How likely are people from other countries to sign up in a C2C website ?

How many users are likely to drop off after years of using my service ?

Example works:

Report(s) made using SQL queries can be found on the data.world page of the dataset.

Notebooks may be found on the Kaggle page of the dataset.

License

CC-BY-NC-SA 4.0

For other licensing options, contact me.
Cybersecurity: Suspicious Web Threat Interactions
kaggle.com
Updated Apr 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JanCSG (2024). Cybersecurity: Suspicious Web Threat Interactions [Dataset]. https://www.kaggle.com/datasets/jancsg/cybersecurity-suspicious-web-threat-interactions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
JanCSG
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
This dataset contains web traffic records collected through AWS CloudWatch, aimed at detecting suspicious activities and potential attack attempts.

The data were generated by monitoring traffic to a production web server, using various detection rules to identify anomalous patterns.

Context

In today's cloud environments, cybersecurity is more crucial than ever. The ability to detect and respond to threats in real time can protect organizations from significant consequences. This dataset provides a view of web traffic that has been labeled as suspicious, offering a valuable resource for developers, data scientists, and security experts to enhance threat detection techniques.

Dataset Content

Each entry in the dataset represents a stream of traffic to a web server, including the following columns:

bytes_in: Bytes received by the server.

bytes_out: Bytes sent from the server.

creation_time: Timestamp of when the record was created.

end_time: Timestamp of when the connection ended.

src_ip: Source IP address.

src_ip_country_code: Country code of the source IP.

protocol: Protocol used in the connection.

response.code: HTTP response code.

dst_port: Destination port on the server.

dst_ip: Destination IP address.

rule_names: Name of the rule that identified the traffic as suspicious.

observation_name: Observations associated with the traffic.

source.meta: Metadata related to the source.

source.name: Name of the traffic source.

time: Timestamp of the detected event.

detection_types: Type of detection applied.

Potential Uses

This dataset is ideal for:

Anomaly Detection: Developing models to detect unusual behaviors in web traffic.

Classification Models: Training models to automatically classify traffic as normal or suspicious.

Security Analysis: Conducting security analyses to understand the tactics, techniques, and procedures of attackers.
d
Complete Domain Whois dataset (all zones)
datarade.ai
.json, .csv
Updated Dec 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Netlas.io (2022). Complete Domain Whois dataset (all zones) [Dataset]. https://datarade.ai/data-products/complete-domain-whois-dataset-all-zones-netlas-io
Explore at:
.json, .csvAvailable download formats
Dataset updated
Dec 16, 2022
Dataset provided by
Netlas.io
Area covered
Spain, Fiji, Lebanon, Mauritius, Cabo Verde, Timor-Leste, Armenia, Latvia, Slovenia, Guadeloupe
Description
Netlas.io is a set of internet intelligence apps that provide accurate technical information on IP addresses, domain names, websites, web applications, IoT devices, and other online assets.

Netlas.io maintains five general data collections: Responses (internet scan data), DNS Registry data, IP Whois data, Domain Whois data, SSL Certificates.

This dataset contains Domain WHOIS data. It covers active domains only, including just registered, published and parked domains, domains on redeption grace period (waiting for renewal), and domains pending delete. This dataset doesn't include any historical records.
NYC STEW-MAP Staten Island organizations' website hyperlink webscrape
catalog.data.gov
s.cnmilf.com
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
New York, Staten Island
Description
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
d
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...
search.dataone.org
Updated Sep 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thakur, Nirmalya; Su, Vanessa; Shao, Mingchen; Patel, Kesha A.; Jeong, Hongseok; Knieling, Victoria; Bian, Andrew (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles [Dataset]. http://doi.org/10.7910/DVN/QTJ9HC
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/QTJ9HC
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Thakur, Nirmalya; Su, Vanessa; Shao, Mingchen; Patel, Kesha A.; Jeong, Hongseok; Knieling, Victoria; Bian, Andrew
Time period covered
Jan 1, 2024 - May 31, 2024
Area covered
YouTube
Description
Please cite the following paper when using this dataset: N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A.Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” arXiv [cs.CY], 2024. Available: http://arxiv.org/abs/2406.07693 Abstract This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
Newcastle Libraries Internet filtering - Dataset - data.gov.uk
ckan.publishing.service.gov.uk
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2023). Newcastle Libraries Internet filtering - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/newcastle-libraries-internet-filtering1
Explore at:
Dataset updated
Sep 20, 2023
Dataset provided by
CKANhttps://ckan.org/
Description
Newcastle Libraries provide citizens with access to the Internet. On public computers Internet is filtered which means access to some websites is blocked - either because of legal requirements, concerns over the security of the Council IT systems, Council policy or to safeguard vulnerable individuals. This data set lists the categories of websites blocked at a certain date. Additional information The data set lists the title of the category, the title of the sub-category and a description for each sub-category.
Z
Data from: SQL Injection Attack Netflow
data.niaid.nih.gov
portalcienciaytecnologia.jcyl.es
+2more
Updated Sep 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6907251
Explore at:
Dataset updated
Sep 28, 2022
Dataset provided by
Ignacio Crespo
Adrián Campazas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

Datasets

The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

The datasets contain both benign and malicious traffic. All collected datasets are balanced.

The version of NetFlow used to build the datasets is 5.

Dataset Aim Samples Benign-malicious traffic ratio D1 Training 400,003 50% D2 Test 57,239 50%

Infrastructure and implementation

Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

Parameters Description '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' Enumerate users, password hashes, privileges, roles, databases, tables and columns --level=5 Increase the probability of a false positive identification --risk=3 Increase the probability of extracting data --random-agent Select the User-Agent randomly --batch Never ask for user input, use the default behavior --answers="follow=Y" Predefined answers to yes

Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Phishing Websites Detection
kaggle.com
zip
Updated May 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J Akshaya (2020). Phishing Websites Detection [Dataset]. https://www.kaggle.com/akshaya1508/phishing-websites-detection
Explore at:
zip(80950 bytes)Available download formats
Dataset updated
May 28, 2020
Authors
J Akshaya
Description
Context

Phishing is a form of identity theft that occurs when a malicious website impersonates a legitimate one in order to acquire sensitive information such as passwords, account details, or credit card numbers. People generally tend to fall pray to this very easily. Kudos to the commendable craftsmanship of the attackers which makes people believe that it is a legitimate website. There is a need to identify the potential phishing websites and differentiate them from the legitimate ones. This dataset identifies the prominent features of the phishing websites, 10 such features have been identified.

Content

Generally, the open source datasets available on the internet do not comes with the code and the logic which arises certain problems i.e.:

Limited Data: The ML algorithms can only be tested with the existing phishing URLs and no new phishing URLS can be checked for its validity.

Outdated URLs: The datasets available on the internet has been uploaded long time ago, there are new kind of phishing URLs arising in every second.

Outdated Features: The datasets available on the internet has been uploaded long time ago, there are new methodologies arising in phishing techniques.

No Access to Backend: There is no stepwise guide describing how the feature has been derived.

On the contrary we are trying to overcome all the above-mentioned problems.

1. Real Time Data: Before applying a Machine Learning algorithm, we can run the script and fetch real time URLs from Phishtank (for phishing URLs) and from moz (for legitimate URLs) 2. Scalable Data: We can also specify the number of URLs we want to feed the model and hence the web scrapper will fetch that much amount of data from the websites. Presently we are using 1401 URLs in this project i.e. 901 Phishing URLs and 500 Legitimate URLS. 3. New Features: We have tried to implement the prominent new features that is there in the current phishing URLs and since we own the code, new features can also be added. 4. Source code on Github: The source code is published on GitHub for public use and can be used for further scope of improvements. This way there will be transparency to the logic and more creators can add there meaningful additions to the code.

Link to the source code

https://github.com/akshaya1508/detection_of_phishing_websites.git

Inspiration

The idea to develop the dataset and the code for this dataset has been inspired by various other creators who have worked on the similar lines.
Long-Term Agricultural Research (LTAR) network - Meteorological Collection
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Long-Term Agricultural Research (LTAR) network - Meteorological Collection [Dataset]. https://catalog.data.gov/dataset/long-term-agricultural-research-ltar-network-meteorological-collection-7d719
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
The LTAR network maintains stations for standard meteorological measurements including, generally, air temperature and humidity, shortwave (solar) irradiance, longwave (thermal) radiation, wind speed and direction, barometric pressure, and precipitation. Many sites also have extensive comparable legacy datasets. The LTAR scientific community decided that these needed to be made available to the public using a single web source in a consistent manner. To that purpose, each site sent data on a regular schedule, as frequently as hourly, to the National Agricultural Library, which has developed a web service to provide the data to the public in tabular or graphical form. This archive of the LTAR legacy database exports contains meteorological data through April 30, 2021. For current meteorological data, visit the GeoEvent Meteorology Resources page, which provides tools and dashboards to view and access data from the 18 LTAR sites across the United States. Resources in this dataset:Resource Title: Meteorological data. File Name: ltar_archive_DB.zipResource Description: This is an export of the meteorological data collected by LTAR sites and ingested by the NAL LTAR application. This export consists of an SQL schema definition file for creating database tables and the data itself. The data is provided in two formats: SQL insert statements (.sql) and CSV files (.csv). Please use the format most convenient for you. Note that the SQL insert statements take much longer to run since each row is an individual insert. Description of zip files The ltararchive*.zip files contain database exports. The schema is a .sql file; the data is exported as both SQL inserts and CSV for convenience. There is a README in markdown and PDF in the zips. Contains the database export of the schema and data for the site, site_station, and met tables as SQL insert statements. ltar_archive_db_sql_export_20201231.zip --> has data until 2020-12-31 ltar_archive_db_sql_export_20210430.zip --> has data until 2021-04-30 Contains the database export of the schema and data for the site, site_station, and met tables as CSV. ltar_archive_db_csv_export_20201231.zip --> has data until 2020-12-31 ltar_archive_db_csv_export_20210430.zip --> has data until 2021-04-30 Contains the raw CSV files that were sent to NAL from the LTAR sites/stations. ltar_rawcsv_archive.zip --> has data until 2021-04-30
d
Data from: A simple method for serving Web hypermaps with dynamic database...
catalog.data.gov
odgavaprod.ogopendata.com
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). A simple method for serving Web hypermaps with dynamic database drill-down [Dataset]. https://catalog.data.gov/dataset/a-simple-method-for-serving-web-hypermaps-with-dynamic-database-drill-down
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
National Institutes of Health
Description
Background HealthCyberMap aims at mapping parts of health information cyberspace in novel ways to deliver a semantically superior user experience. This is achieved through "intelligent" categorisation and interactive hypermedia visualisation of health resources using metadata, clinical codes and GIS. HealthCyberMap is an ArcView 3.1 project. WebView, the Internet extension to ArcView, publishes HealthCyberMap ArcView Views as Web client-side imagemaps. The basic WebView set-up does not support any GIS database connection, and published Web maps become disconnected from the original project. A dedicated Internet map server would be the best way to serve HealthCyberMap database-driven interactive Web maps, but is an expensive and complex solution to acquire, run and maintain. This paper describes HealthCyberMap simple, low-cost method for "patching" WebView to serve hypermaps with dynamic database drill-down functionality on the Web. Results The proposed solution is currently used for publishing HealthCyberMap GIS-generated navigational information maps on the Web while maintaining their links with the underlying resource metadata base. Conclusion The authors believe their map serving approach as adopted in HealthCyberMap has been very successful, especially in cases when only map attribute data change without a corresponding effect on map appearance. It should be also possible to use the same solution to publish other interactive GIS-driven maps on the Web, e.g., maps of real world health problems.
h
open-web-math
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
open-web-math, open-web-math [Dataset]. https://huggingface.co/datasets/open-web-math/open-web-math
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
open-web-math
Description
Keiran Paster*, Marco Dos Santos*, Zhangir Azerbayev, Jimmy Ba GitHub | ArXiv | PDF OpenWebMath is a dataset containing the majority of the high-quality, mathematical text from the internet. It is filtered and extracted from over 200B HTML files on Common Crawl down to a set of 6.3 million documents containing a total of 14.7B tokens. OpenWebMath is intended for use in pretraining and finetuninglarge language models. You can download the dataset using Hugging Face: from datasets import… See the full description on the dataset page: https://huggingface.co/datasets/open-web-math/open-web-math.

Number of global social network users 2017-2028

statista.com
grusthub.com
+4more

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Number of global social network users 2017-2028 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

How many people use social media?

              Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.

              Who uses social media?
              Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
              when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.

              How much time do people spend on social media?
              Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.

              What are the most popular social media platforms?
              Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.

m
Network traffic for machine learning classification
data.mendeley.com
Updated Feb 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen Guembe (2020). Network traffic for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.1
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.1
Dataset updated
Feb 12, 2020
Authors
Víctor Labayen Guembe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/

Number of internet users worldwide 2014-2029

Explore at:

303 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 11, 2025

Dataset provided by

Statistahttp://statista.com/

Authors

Statista Research Department

Area covered

World

Description

The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.

Clear search

Close search

Google apps

Main menu

Number of internet users worldwide 2014-2029

Internet and Computer use, London - Dataset - data.gov.uk

Daily time spent online by users worldwide Q3 2024

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Data from: The National Stream Internet project

Job Offers Web Scraping Search

Job Offers Web Scraping Search

Targeted Results to Find the Optimal Work Solution

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Data from: Penalized and Constrained Optimization: An Application to...

E-commerce - Users of a French C2C fashion store

Foreword

Context

Content

Acknowledgements

Inspiration

License

Cybersecurity: Suspicious Web Threat Interactions

Context

Dataset Content

Potential Uses

Complete Domain Whois dataset (all zones)

NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

Newcastle Libraries Internet filtering - Dataset - data.gov.uk

Data from: SQL Injection Attack Netflow

Phishing Websites Detection

Context

Content

Link to the source code

Inspiration

Long-Term Agricultural Research (LTAR) network - Meteorological Collection

Data from: A simple method for serving Web hypermaps with dynamic database...

open-web-math

Number of global social network users 2017-2028

Network traffic for machine learning classification

Number of internet users worldwide 2014-2029

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`