64 datasets found

Amount of data created, consumed, and stored 2010-2023, with forecasts to...
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2024
Area covered
Worldwide
Description
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
Data from: Internet users
ons.gov.uk
cy.ons.gov.uk
xlsx
Updated Apr 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2021). Internet users [Dataset]. https://www.ons.gov.uk/businessindustryandtrade/itandinternetindustry/datasets/internetusers
Explore at:
xlsxAvailable download formats
Dataset updated
Apr 6, 2021
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Internet use in the UK annual estimates by age, sex, disability, ethnic group, economic activity and geographical location, including confidence intervals.
Africa - Population and Internet users statistics
kaggle.com
Updated Dec 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishmeet singh (2020). Africa - Population and Internet users statistics [Dataset]. https://www.kaggle.com/datasets/ishmeet/africa-population-and-internet-users-statistics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 17, 2020
Dataset provided by
Kaggle
Authors
Ishmeet singh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
Africa
Description
Context

Africa - Population and Internet users statistics

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

Source: https://data.humdata.org/dataset/africa-population-and-internet-users-statistics Last updated at https://data.humdata.org/organization/openafrica : 2019-09-11

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

Attitudes towards the internet in Mexico 2025
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Bashir (2025). Attitudes towards the internet in Mexico 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Umair Bashir
Description
When asked about "Attitudes towards the internet", most Mexican respondents pick "It is important to me to have mobile internet access in any place" as an answer. 56 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
c
Anonymized Internet Traces 2016
catalog.caida.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CAIDA, Anonymized Internet Traces 2016 [Dataset]. https://catalog.caida.org/dataset/passive_2016_pcap
Explore at:
Dataset authored and provided by
CAIDA
License
https://www.caida.org/about/legal/aua/https://www.caida.org/about/legal/aua/
Time period covered
Jan 2016 - Dec 2016
Description
Packet headers (upto transport layer, inclusive) for Anonymized Internet Traces 2016 Dataset. Derived from OC192 traces on Equinix San Jose and Chicago monitors.
Cary Broadband Internet Access
catalog.data.gov
data.townofcary.org
+2more
Updated Oct 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Census Bureau (2024). Cary Broadband Internet Access [Dataset]. https://catalog.data.gov/dataset/cary-broadband-internet-access-american-community-survey
Explore at:
Dataset updated
Oct 19, 2024
Dataset provided by
United States Census Bureauhttp://census.gov/
Area covered
Cary
Description
Part of the What Works Cities criterion to achieve Certification, we need to meet the industry standard of at least 75% of our households have subscriptions / access to high-speed broadband servicesPart of the American Community Survey (ACS) asks the levels of internet access residents have. We use the 5-Year Estimates to have a greater level of precision to our data, according to the Distinguishing features of ACS 1-year, 1-year supplemental, 3-year, and 5-year estimates table.We query attributes of the DP02 (Selected Social Characteristics in the United States) Group of questions for years available.This dataset has been narrowed down to Cary township using following the geographies codes supported for the ACS dataset:state: 37county: 183county subdivision: 90536
Internet Traffic Data Set
kaggle.com
Updated May 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asfand Yar (2023). Internet Traffic Data Set [Dataset]. http://doi.org/10.34740/kaggle/dsv/5658579
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5658579
Dataset updated
May 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Asfand Yar
Description
This data set contains internet traffic data captured by an Internet Service Provider (ISP) using Mikrotik SDN Controller and packet sniffer tools. The data set includes traffic from over 2000 customers who use Fibre to the Home (FTTH) and Gpon internet connections. The data was collected over a period of several months and contains all traffic in its original format with headers and packets.

The data set contains information on inbound and outbound traffic, including web browsing, email, file transfers, and more. The data set can be used for research in areas such as network security, traffic analysis, and machine learning.

**Data Collection Method: ** The data was captured using Mikrotik SDN Controller and packet sniffer tools. These tools capture traffic data by monitoring network traffic in real-time. The data set contains all traffic data in its original format, including headers and packets.

**Data Set Content: ** The data set is provided in a CSV format and includes the following fields:

Timestamp: The date and time the traffic was captured

Source IP Address: The IP address of the device that sent the traffic Destination IP Address: The IP address of the device that received the traffic Protocol: The network protocol used for the traffic (e.g. TCP, UDP) Source Port: The port used by the source device for the traffic Destination Port: The port used by the destination device for the traffic Packet Size: The size of the packet in bytes Payload: The payload data of the packet The data set contains a large volume of traffic data from over 2000 customers. The data is organized by timestamp and includes all traffic data in its original format, including headers and packets. The data set contains both inbound and outbound traffic, and covers various types of internet traffic, including web browsing, email, file transfers, and more. one of listed protocols: ipsec-ah - IPsec AH protocol *ipsec-esp - IPsec ESP protocol ddp - datagram delivery protocol egp - exterior gateway protocol ggp - gateway-gateway protocol gre - general routing encapsulation hmp - host monitoring protocol idpr-cmtp - idpr control message transport icmp - internet control message protocol icmpv6 - internet control message protocol v6 igmp - internet group management protocol ipencap - ip encapsulated in ip ipip - ip encapsulation encap - ip encapsulation iso-tp4 - iso transport protocol class 4 ospf - open shortest path first pup - parc universal packet protocol pim - protocol independent multicast rspf - radio shortest path first rdp - reliable datagram protocol st - st datagram mode tcp - transmission control protocol udp - user datagram protocol vmtp - versatile message transport vrrp - virtual router redundancy protocol xns-idp - xerox xns idp xtp - xpress transfer protocol

MAC Protocol Examples 802.2 - 802.2 Frames (0x0004) arp - Address Resolution Protocol (0x0806) homeplug-av - HomePlug AV MME (0x88E1) ip - Internet Protocol version 4 (0x0800) ipv6 - Internet Protocol Version 6 (0x86DD) ipx - Internetwork Packet Exchange (0x8137) lldp - Link Layer Discovery Protocol (0x88CC) loop-protect - Loop Protect Protocol (0x9003) mpls-multicast - MPLS multicast (0x8848) mpls-unicast - MPLS unicast (0x8847) packing-compr - Encapsulated packets with compressed IP packing (0x9001) packing-simple - Encapsulated packets with simple IP packing (0x9000) pppoe - PPPoE Session Stage (0x8864) pppoe-discovery - PPPoE Discovery Stage (0x8863) rarp - Reverse Address Resolution Protocol (0x8035) service-vlan - Provider Bridging (IEEE 802.1ad) & Shortest Path Bridging IEEE 802.1aq (0x88A8) vlan - VLAN-tagged frame (IEEE 802.1Q) and Shortest Path Bridging IEEE 802.1aq with NNI compatibility (0x8100)

**Data Usage: ** The data set can be used for research in areas such as network security, traffic analysis, and machine learning. Researchers can use the data to develop new algorithms for detecting and preventing cyber attacks, analyzing internet traffic patterns, and more.

**Data Availability: ** If you are interested in using this data set for research purposes, please contact us at asfandyar250@gmail.com for more information and references. The data set is available for download on Kaggle and can be accessed by researchers who have obtained permission from the ISP.

We hope this data set will be useful for researchers in the field of network security and traffic analysis. If you have any questions or need further information, please do not hesitate to contact us. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5985737%2F61c81ce9eb393f8fc7c15540c9819b95%2FData.PNG?generation=1683750473536727&alt=media" alt=""> You can use Wireshark or other software's to view files
Internet Verification File (IVF)
catalog.data.gov
data.amerigeoss.org
Updated Aug 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Internet Verification File (IVF) [Dataset]. https://catalog.data.gov/dataset/internet-verification-file-ivf
Explore at:
Dataset updated
Aug 11, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
Internal listing of current employees and authorized users who can access SSA applications.
Z
Data from: #PraCegoVer dataset
data.niaid.nih.gov
Updated Jan 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Esther Luna Colombini
Gabriel Oliveira dos Santos
Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
Attitudes towards the internet in Australia 2025
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Bashir (2025). Attitudes towards the internet in Australia 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Umair Bashir
Description
When asked about "Attitudes towards the internet", most Australian respondents pick "It is important to me to have mobile internet access in any place" as an answer. 55 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
National Broadband Data
open.canada.ca
gimi9.com
+1more
csv, gpkg, kmz, shp +2
Updated Jun 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Innovation, Science and Economic Development Canada (2025). National Broadband Data [Dataset]. https://open.canada.ca/data/en/dataset/00a331db-121b-445d-b119-35dbbe3eedd9
Explore at:
txt, kmz, tab, csv, gpkg, shpAvailable download formats
Dataset updated
Jun 24, 2025
Dataset provided by
Innovation, Science and Economic Development Canadahttp://www.ic.gc.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The National Broadband Data represents coverage information across Canada for existing broadband service providers with their associated technology types. The coverage information is aggregated and deployed over a grid of hexagons, which cover areas of roughly 25 square km each. Broadband Internet service availability is provided for download/upload speed markers (5/1, 10/2, 25/5 and 50/10 Mbps) where more than 75% of total dwellings covered within the hexagon have access to broadband service offerings meeting these markers. In order to improve the granularity of the broadband data, ISED and the CRTC are providing aggregated and anonymous broadband services data based on the pseudo-household statistical model, hence achieving higher precision in depicting the broadband Internet service availability. This information is available below under the "NBD PHH Speeds" resource. For more information on the pseudo-household statistical model, refer to the Pseudo-Household Demographic Distribution dataset. A representation of broadband services per 250m road segments is now available for download under the “NBD Roads” resource. To generate this dataset, the NBD PHH Speeds information was projected over the nearest road arc from Statistics Canada’s Road Network File, and those roads were spliced in approximately 250m segments. NEW: The data has been augmented to include new presentation layers as published on the National Broadband Map.
c
Broadband Data by Town - 2023
broadbandmaps.ct.gov
data.ct.gov
+4more
Updated Nov 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Connecticut (2023). Broadband Data by Town - 2023 [Dataset]. https://broadbandmaps.ct.gov/datasets/broadband-data-by-town-2023
Explore at:
Dataset updated
Nov 25, 2023
Dataset authored and provided by
State of Connecticut
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
This feature layer includes all OPM collected data at the town level.-------------The Connecticut Broadband Availability and Adoption Maps were created to help citizens and policymakers understand the strengths and weaknesses of broadband infrastructure in the state. Data is aggregated to the block, tract, and town (county subdivision) levels and includes counts of locations classified as unserved, underserved, and served as well as whether they meet the state goal of 1000Mbps/100Mbps. This application splits its visualizations into block, tract, and town layers for both unserved locations and progress to the state goal.

This map uses OPM collected availability and adoption data.

As of 2023, OPM collected availability data was submitted by internet service providers pursuant to PA 21-159 and processed by the GIS Office in the Office of Policy and Management, cleaned, and matched to the CostQuest location fabric.

Metadata:

All feature layers, maps, and datasets including OPM's internal broadband availability data follows the same basic schema with additional fields added in some case for convenience.

Fields named no service, unserved, underserved, served, and GigC are counts of locations where a particular level of broadband service is provided, No service locations are those where there is no reported service at all. Unserved locations are locations where there is a provider offering wireline service, but not at or above 25 Mbps download and 3 Mbps upload. Underserved locations are locations where at least one provider offers wireline service of 25 Mbps download and 3 Mbps upload, but there is no provider offering wireline service of 100 Mbps download and 20 Mbps upload. Served locations are locations where there is wireline service of at least 100 Mbps download and 20 Mbps upload. GigC denotes the count of locations that have service at 1000 Mbps download and 100 Mbps upload. Accordingly, total locations is equal to the sum of no service, unserved, underserved, served, and "GigC" locations. Availability also includes fields for average download and upload speeds. These are calculated at the relevant level of census geography based on the maximum for all locations.

The final field included in all availability data is the provider list.

OPM collected adoption data:

OPM collected adoption data uses many of the same naming conventions as the availability data, but there are some notable differences.

Fields named unserved_Sub, underserved_Sub, served_Sub, and GigC _Sub are counts of subscriptions where a particular level of broadband service is currently subscribed to, Unserved subscriptions are subscriptions that do not meet the standard of 25 Mbps download and 3 Mbps upload. Underserved subscriptions are subscriptions with speeds of 25 Mbps download and 3 Mbps upload, but not meeting 100 Mbps download and 20 Mbps upload. Served subscriptions are subscriptions where speeds are between 100 Mbps download and 20 Mbps upload and 1000 Mbps download and 100 Mbps upload. GigC denotes the count of locations that have a subscription at 1000 Mbps download and 100 Mbps upload or higher. For subscription data these locations are NOT included in the "served" field as this does not directly apply to FCC use of the terms.
c
Data from: Dataset for Cyber-Physical Anomaly Detection in Smart Homes
research-data.cardiff.ac.uk
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasar Majib; Mohammed Alosaimi; Andre Asaturyan; Charith Perera (2024). Dataset for Cyber-Physical Anomaly Detection in Smart Homes [Dataset]. http://doi.org/10.17035/d.2023.0259651425
Explore at:
Unique identifier
https://doi.org/10.17035/d.2023.0259651425
Dataset updated
Sep 19, 2024
Dataset provided by
Cardiff University
Authors
Yasar Majib; Mohammed Alosaimi; Andre Asaturyan; Charith Perera
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Smart homes contain programmable electronic devices (mostly IoT) that enable home au- tomation. People who live in smart homes benefit from interconnected devices by controlling them either remotely or manually/autonomously. However, high interconnectivity comes with an increased attack surface, making the smart home an attractive target for adversaries. NCC Group and the Global Cyber Alliance recorded over 12,000 attacks to log into smart home devices maliciously. Recent statistics show that over 200 million smart homes can be subjected to these attacks. Conventional security systems are either focused on network traffic (e.g., firewalls) or physical environment (e.g., CCTV or basic motion sensors), but not both. A key challenge in de- veloping cyber-physical security systems is the lack of datasets and test beds. For cyber-physical datasets to be meaningful, they need to be collected in real smart home environments. Due to the inherited difficulties and challenges (e.g. effort, costs, test-bed availability), such cyber-physical smart home datasets are quite rare. This paper aims to fill this gap by contributing a dataset we collected in a real smart home with annotated labels. This paper explains the process we followed to collect the data and how we organised them to facilitate wider use within research communities.A related article can be found at https://doi.org/10.3389/friot.2023.1275080
e
Geography of digital inequality - Dataset - B2FIND
b2find.eudat.eu
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Geography of digital inequality - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/aac47d03-1fdf-5f48-8021-627e02f643e9
Explore at:
Dataset updated
Jun 27, 2023
Description
These data consist of measures of Internet use estimated using small area estimation. The small area estimation is based on census Output Areas (OAs) using the 2013 Oxford Internet Survey (OxIS) and the 2011 British census. There is an estimate for each OA in Great Britain. By combining the 2013 OxIS survey data with the comprehensive small area coverage of the 2011 British census we can use the strengths of one to offset the gaps in the other. Specifically, we follow a two-step process. First, we use the information that is reliably available in OxIS to create model that estimates the proportion of Internet users in OAs. Second, we use the parameters from this model combined with census data to estimate the proportion of Internet users each OA in Britain. Once these estimates are available, we aggregate the estimates up to higher levels of geography. In this way we can estimate Internet use in Glasgow, Manchester and Cardiff as well as other small areas in Britain. This procedure is referred to as indirect, model-based or synthetic estimation. In recent years such SAE techniques have been widely used throughout Europe and North America. See the project website for more details.The objective of the Geography of Digital Inequality project was to explore the geographical contours of Internet use and penetration in Britain. Specifically, the project assembled from existing datasets a new dataset which contains Internet information at fine-grained geographic levels, census output areas (OAs). From OAs we were able to aggregate to higher geographic levels such as counties, Welsh and Scottish Councils, metropolitan areas, or others. Through this unique dataset we explored digital divides and the geography of the Internet, a capability possessed by no other dataset. Specifically, we explored the extent of use versus non-use of the Internet. There were 2 datasets used to assemble this dataset. First, the 2013 Oxford Internet Survey (OxIS) is a random sample of the 2657 people age 14+ from the British population (England, Scotland & Wales). Interviews were conducted face-to-face by an independent survey research company. The response rate for 2013 was 51%. The data collection was a two-stage sample. A random sample of census output areas (OAs) was selected and respondents were randomly sampled within each selected OA. For details, see "Data collection technical report.pdf" which has been uploaded. We use six variables from OxIS: Internet use, region, age, lifestage, gender and education. The questionnaire for OxIS contains about 300 variables and it is available from the OxIS website, see the URL in the "related resources" section. Second, the 2011 British Census. For information on how the census was conducted,see the census website. The URL for the 2011 census is given below in "related resources".
Available Wireless Sensor Network and Internet of Things testbed facilities:...
data.europa.eu
unknown
Updated Oct 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). Available Wireless Sensor Network and Internet of Things testbed facilities: dataset [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7157221?locale=cs
Explore at:
unknown(2365963)Available download formats
Dataset updated
Oct 7, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this data set, we present data collected for the purpose of carrying out a systematic review of the available Wireless Sensor Network and Internet of Things testbed facilities. The data was collected through multiple stages and in each stage the pre-defined criteria were applied. We provide a dataset describing the hardware and software aspects of Wireless Sensor Network and Internet of Things testbed facilities available in the market and scientific community. The data were gathered through an extensive systematic review process of scientific articles published between the years 2011 and 2021. The review aims to obtain good quality data for people who are actively researching the Internet of Things facilities or anyone who is interested in that field.
m
Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...
data.mendeley.com
Updated Jul 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak [Dataset]. http://doi.org/10.17632/xmcg82mx9k.3
Explore at:
Unique identifier
https://doi.org/10.17632/xmcg82mx9k.3
Dataset updated
Jul 25, 2022
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2

Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)

The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
d
ICA243 - Percentage of Internet users who purchased Travel/Culture related...
datasalsa.com
csv, json-stat, px +1
Updated Jan 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Statistics Office (2025). ICA243 - Percentage of Internet users who purchased Travel/Culture related services online in the previous 3 months [Dataset]. https://datasalsa.com/dataset/?catalogue=data.gov.ie&name=ica243-rnet-users-who-purchased-travelculture-related-services-online-in-the-previous-3-months-8219
Explore at:
csv, px, json-stat, xlsxAvailable download formats
Dataset updated
Jan 4, 2025
Dataset authored and provided by
Central Statistics Office
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 8, 2025
Description
ICA243 - Percentage of Internet users who purchased Travel/Culture related services online in the previous 3 months. Published by Central Statistics Office. Available under the license Creative Commons Attribution 4.0 (CC-BY-4.0).Percentage of Internet users who purchased Travel/Culture related services online in the previous 3 months...
C
Internet Access Technology Options
data.ccrpc.org
csv
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Champaign County Regional Planning Commission (2022). Internet Access Technology Options [Dataset]. https://data.ccrpc.org/dataset/internet-access-options
Explore at:
csvAvailable download formats
Dataset updated
Jun 3, 2022
Dataset authored and provided by
Champaign County Regional Planning Commission
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
The Internet access indicator measures the prevalence of different Internet technology options available in Champaign County, Illinois, and the U.S., at two different speeds: 4/1 Mbps and 25/3 Mbps.

Seven types of connection options are evaluated: ADSL, cable, fiber, fixed wireless, satellite, "other" technology, and "any" technology, which includes the previous six options.

Satellite internet, at both speeds, is the most widely available in all three areas. One hundred percent of Champaign County residents have access to satellite internet at both speeds. Cable internet is also widely available across all three areas, and over 90 percent of Champaign County residents have access to cable internet. Fiber internet is the least widely available type of technology, aside from "other" technology. However, fiber internet is now available to almost 38 percent of Champaign County residents as of December 2020, an increase from approximately 25 percent in June 2020.

The ability of Champaign County residents to access the Internet has become key in many facets of life, especially during the COVID-19 pandemic. Internet access provides economic, educational, and social opportunities; having or not having Internet access has become not only a technological issue, but an equity issue.

This data was retrieved from the Federal Communications Commission’s Fixed Broadband Deployment Area Comparison, and dates from December 2020.

Source: Federal Communications Commission. (2020). Fixed Broadband Deployment. Area Comparison. https://broadbandmap.fcc.gov/#/. (Accessed 3 June 2022).
d
Job Accommodation Network Datasets
catalog.data.gov
datasets.ai
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Disability Employment Policy (2023). Job Accommodation Network Datasets [Dataset]. https://catalog.data.gov/dataset/job-accommodation-network-datasets-72ca8
Explore at:
Dataset updated
Aug 12, 2023
Dataset provided by
Office of Disability Employment Policy
Description
Data collected from interviews with employers, professionals, self-employed individuals, and individual workers who have been assisted by JAN

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/

Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028

Explore at:

Dataset updated

Jun 30, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

May 2024

Area covered

Worldwide

Description

The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.

Clear search

Close search

Google apps

Main menu

Amount of data created, consumed, and stored 2010-2023, with forecasts to...

Data from: Internet users

Africa - Population and Internet users statistics

Context

Content

Acknowledgements

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Attitudes towards the internet in Mexico 2025

Anonymized Internet Traces 2016

Cary Broadband Internet Access

Internet Traffic Data Set

Internet Verification File (IVF)

Data from: #PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

Attitudes towards the internet in Australia 2025

National Broadband Data

Broadband Data by Town - 2023

Data from: Dataset for Cyber-Physical Anomaly Detection in Smart Homes

Geography of digital inequality - Dataset - B2FIND

Available Wireless Sensor Network and Internet of Things testbed facilities:...

Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...

ICA243 - Percentage of Internet users who purchased Travel/Culture related...

Internet Access Technology Options

Job Accommodation Network Datasets

Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`