100+ datasets found

Number of internet users worldwide 2014-2029
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
World
Description
The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.
Attitudes towards the internet in Japan 2025
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Bashir (2025). Attitudes towards the internet in Japan 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Umair Bashir
Description
When asked about "Attitudes towards the internet", most Japanese respondents pick "I'm concerned that my data is being misused on the internet" as an answer. 35 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
Attitudes towards the internet in Mexico 2025
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Bashir (2025). Attitudes towards the internet in Mexico 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Umair Bashir
Description
When asked about "Attitudes towards the internet", most Mexican respondents pick "It is important to me to have mobile internet access in any place" as an answer. 56 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
My Digital Footprint
kaggle.com
zip
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Girish (2023). My Digital Footprint [Dataset]. https://www.kaggle.com/datasets/girish17019/my-digital-footprint
Explore at:
zip(874430159 bytes)Available download formats
Dataset updated
Jun 29, 2023
Authors
Girish
Description
Dataset Info:

MyDigitalFootprint (MDF) is a novel large-scale dataset composed of smartphone embedded sensors data, physical proximity information, and Online Social Networks interactions aimed at supporting multimodal context-recognition and social relationships modelling in mobile environments. The dataset includes two months of measurements and information collected from the personal mobile devices of 31 volunteer users by following the in-the-wild data collection approach: the data has been collected in the users' natural environment, without limiting their usual behaviour. Existing public datasets generally consist of a limited set of context data, aimed at optimising specific application domains (human activity recognition is the most common example). On the contrary, the dataset contains a comprehensive set of information describing the user context in the mobile environment.

The complete analysis of the data contained in MDF has been presented in the following publication:

https://www.sciencedirect.com/science/article/abs/pii/S1574119220301383?via%3Dihub

The full anonymised dataset is contained in the folder MDF. Moreover, in order to demonstrate the efficacy of MDF, there are three proof of concept context-aware applications based on different machine learning tasks:

A social link prediction algorithm based on physical proximity data,

The recognition of daily-life activities based on smartphone-embedded sensors data,

A pervasive context-aware recommender system.

For the sake of reproducibility, the data used to evaluate the proof-of-concept applications are contained in the folders link-prediction, context-recognition, and cars, respectively.
Data from: Evaluation of Internet Safety Materials Used by Internet Crimes...
catalog.data.gov
icpsr.umich.edu
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). Evaluation of Internet Safety Materials Used by Internet Crimes Against Children (ICAC) Task Forces in School and Community Settings, 2011-2012 [United States] [Dataset]. https://catalog.data.gov/dataset/evaluation-of-internet-safety-materials-used-by-internet-crimes-against-children-icac-2011
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
National Institute of Justicehttp://nij.ojp.gov/
Area covered
United States
Description
These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. The purpose of this study was to conduct content and process evaluations of current internet safety education (ISE) program materials and their use by law enforcement presenters and schools. The study was divided into four sub-projects. First, a systematic review or "meta-synthesis" was conducted to identify effective elements of prevention identified by the research across different youth problem areas such as drug abuse, sex education, smoking prevention, suicide, youth violence, and school failure. The process resulted in the development of a KEEP (Known Elements of Effective Prevention) Checklist. Second, a content analysis was conducted on four of the most well-developed and long-standing youth internet safety curricula: i-SAFE, iKeepSafe, Netsmartz, and Web Wise Kids. Third, a process evaluation was conducted to better understand how internet safety education programs are being implemented. The process evaluation was conducted via national surveys with three different groups of respondents: Internet Crimes Against Children (ICAC) Task Force commanders (N=43), ICAC Task Force presenters (N=91), and a sample of school professionals (N=139). Finally, researchers developed an internet safety education outcome survey focused on online harassment and digital citizenship. The intention for creating and piloting this survey was to provide the field with a research-based tool that can be used in future evaluation and program monitoring efforts.

IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

zenodo.org
data.niaid.nih.gov

Updated Aug 30, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa (2024). IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT [Dataset]. http://doi.org/10.5281/zenodo.8116338

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.8116338

Dataset updated

Aug 30, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Article Information

The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.

Please do cite the aforementioned article when using this dataset.

Abstract

The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.

ZIP Folder Content

The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.

To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.

This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.

Datasets' Content

Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.

Identified Key Features Within Bluetooth Dataset

Feature	Meaning
btle.advertising_header	BLE Advertising Packet Header
btle.advertising_header.ch_sel	BLE Advertising Channel Selection Algorithm
btle.advertising_header.length	BLE Advertising Length
btle.advertising_header.pdu_type	BLE Advertising PDU Type
btle.advertising_header.randomized_rx	BLE Advertising Rx Address
btle.advertising_header.randomized_tx	BLE Advertising Tx Address
btle.advertising_header.rfu.1	Reserved For Future 1
btle.advertising_header.rfu.2	Reserved For Future 2
btle.advertising_header.rfu.3	Reserved For Future 3
btle.advertising_header.rfu.4	Reserved For Future 4
btle.control.instant	Instant Value Within a BLE Control Packet
btle.crc.incorrect	Incorrect CRC
btle.extended_advertising	Advertiser Data Information
btle.extended_advertising.did	Advertiser Data Identifier
btle.extended_advertising.sid	Advertiser Set Identifier
btle.length	BLE Length
frame.cap_len	Frame Length Stored Into the Capture File
frame.interface_id	Interface ID
frame.len	Frame Length Wire
nordic_ble.board_id	Board ID
nordic_ble.channel	Channel Index
nordic_ble.crcok	Indicates if CRC is Correct
nordic_ble.flags	Flags
nordic_ble.packet_counter	Packet Counter
nordic_ble.packet_time	Packet time (start to end)
nordic_ble.phy	PHY
nordic_ble.protover	Protocol Version

Identified Key Features Within IP-Based Packets Dataset

Feature	Meaning
http.content_length	Length of content in an HTTP response
http.request	HTTP request being made
http.response.code	Sequential number of an HTTP response
http.response_number	Sequential number of an HTTP response
http.time	Time taken for an HTTP transaction
tcp.analysis.initial_rtt	Initial round-trip time for TCP connection
tcp.connection.fin	TCP connection termination with a FIN flag
tcp.connection.syn	TCP connection initiation with SYN flag
tcp.connection.synack	TCP connection establishment with SYN-ACK flags
tcp.flags.cwr	Congestion Window Reduced flag in TCP
tcp.flags.ecn	Explicit Congestion Notification flag in TCP
tcp.flags.fin	FIN flag in TCP
tcp.flags.ns	Nonce Sum flag in TCP
tcp.flags.res	Reserved flags in TCP
tcp.flags.syn	SYN flag in TCP
tcp.flags.urg	Urgent flag in TCP
tcp.urgent_pointer	Pointer to urgent data in TCP
ip.frag_offset	Fragment offset in IP packets
eth.dst.ig	Ethernet destination is in the internal network group
eth.src.ig	Ethernet source is in the internal network group
eth.src.lg	Ethernet source is in the local network group
eth.src_not_group	Ethernet source is not in any network group
arp.isannouncement	Indicates if an ARP message is an announcement

Identified Key Features Within IP-Based Flows Dataset

Feature	Meaning
proto	Transport layer protocol of the connection
service	Identification of an application protocol
orig_bytes	Originator payload bytes
resp_bytes	Responder payload bytes
history	Connection state history
orig_pkts	Originator sent packets
resp_pkts	Responder sent packets
flow_duration	Length of the flow in seconds
fwd_pkts_tot	Forward packets total
bwd_pkts_tot	Backward packets total
fwd_data_pkts_tot	Forward data packets total
bwd_data_pkts_tot	Backward data packets total
fwd_pkts_per_sec	Forward packets per second
bwd_pkts_per_sec	Backward packets per second
flow_pkts_per_sec	Flow packets per second
fwd_header_size	Forward header bytes
bwd_header_size	Backward header bytes
fwd_pkts_payload	Forward payload bytes
bwd_pkts_payload	Backward payload bytes
flow_pkts_payload	Flow payload bytes
fwd_iat	Forward inter-arrival time
bwd_iat	Backward inter-arrival time
flow_iat	Flow inter-arrival time
active	Flow active duration

G
Main benefits of Information and Communication Technology use by industry...
open.canada.ca
www150.statcan.gc.ca
+1more
csv, html, xml
Updated Jan 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics Canada (2023). Main benefits of Information and Communication Technology use by industry and size of enterprise [Dataset]. https://open.canada.ca/data/en/dataset/fa806ccd-6735-42e0-ad81-1f19cbfde560
Explore at:
xml, csv, htmlAvailable download formats
Dataset updated
Jan 17, 2023
Dataset provided by
Statistics Canada
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
Digital technology and Internet use, main benefits of Information and Communication Technology (ICT) use, by North American Industry Classification System (NAICS) and size of enterprise for Canada in 2012.
10000 Most Common Passwords
kaggle.com
Updated Dec 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam Bansal (2021). 10000 Most Common Passwords [Dataset]. https://www.kaggle.com/shivamb/10000-most-common-passwords/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 8, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shivam Bansal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
10000 Most Common Passwords

If your password is on this list of 10,000 most common passwords, you need a new password. A hacker can use or generate files like this, which may readily be compiled from breaches of sites such as Ashley Madison. Usually, passwords are not tried one-by-one against a system's secure server online; instead, a hacker might manage to gain access to a shadowed password file protected by a one-way encryption algorithm, then test each entry in a file like this to see whether it encrypted form matches what the server has on record. The passwords may then be tried against any account online that can be linked to the first, to test for passwords reused on other sites.

Acknowledgements

The dataset was procured by SecLists. SecLists is the security tester's companion. It's a collection of multiple types of lists used during security assessments, collected in one place. List types include usernames, passwords, URLs, sensitive data patterns, fuzzing payloads, web shells, and many more. The goal is to enable a security tester to pull this repository onto a new testing box and have access to every type of list that may be needed.
Most downloaded Zenodo datasets
kaggle.com
Updated Feb 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Gorgolewski (2020). Most downloaded Zenodo datasets [Dataset]. https://www.kaggle.com/chrisfilo/most-downloaded-zenodo-datasets/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Chris Gorgolewski
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Zenodo.org is a popular data repository hosted by CERN. There are tens of thousands of datasets in the repository, but not all of them are used to the same extent.

Content

This dataset includes names and links to the top 500 most downloaded datasets on Zenodo.

Inspiration

This dataset can be used to find datasets deposited on zenodo that would benefit from additional exposure to the DS/ML community by uploading them to Kaggle.
d
Custom dataset from any website on the Internet
datarade.ai
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ScrapeLabs (2022). Custom dataset from any website on the Internet [Dataset]. https://datarade.ai/data-products/custom-dataset-from-any-website-on-the-internet-scrapelabs
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Sep 21, 2022
Dataset authored and provided by
ScrapeLabs
Area covered
Kazakhstan, Bulgaria, India, Tunisia, Lebanon, Aruba, Guinea-Bissau, Jordan, Turks and Caicos Islands, Argentina
Description
We'll extract any data from any website on the Internet. You don't have to worry about buying and maintaining complex and expensive software, or hiring developers.

Some common use cases our customers use the data for: • Data Analysis • Market Research • Price Monitoring • Sales Leads • Competitor Analysis • Recruitment

We can get data from websites with pagination or scroll, with captchas, and even from behind logins. Text, images, videos, documents.

Receive data in any format you need: Excel, CSV, JSON, or any other.
P
Data from: Dataset to "Easing the Conscience with OPC UA: An Internet-Wide...
paperswithcode.com
opendatalab.com
Updated Oct 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Dataset to "Easing the Conscience with OPC UA: An Internet-Wide Study on Insecure Deployments" Dataset [Dataset]. https://paperswithcode.com/dataset/dataset-to-easing-the-conscience-with-opc-ua
Explore at:
Dataset updated
Oct 30, 2020
Description
This is the dataset to "Easing the Conscience with OPC UA: An Internet-Wide Study on Insecure Deployments" [In ACM Internet Measurement Conference (IMC ’20)]. It contains our weekly scanning results between 2020-02-09 and 2020-08-31 complied using our zgrab2 extensions, i.e, it contains an Internet-wide view on OPC UA deployments and their security configurations. To compile the dataset, we anonymized the output of zgrab2, i.e., we removed host and network identifiers from that dataset. More precisely, we mapped all IP addresses, fully qualified hostnames, and autonomous system IDs to numbers as well as removed certificates containing any identifiers. See the README file for more information. Using this dataset we showed that 93% of Internet-facing OPC UA deployments have problematic security configurations, e.g., missing access control (on 24% of hosts), disabled security functionality (24%), or use of deprecated cryptographic primitives (25%). Furthermore, we discover several hundred devices in multiple autonomous systems sharing the same security certificate, opening the door for impersonation attacks. Overall, with the analysis of this dataset we underpinned that secure protocols, in general, are no guarantee for secure deployments if they need to be configured correctly following regularly updated guidelines that account for basic primitives losing their security promises.
ACS Internet Access by Education Variables - Boundaries
covid-hub.gio.georgia.gov
mapdirect-fdep.opendata.arcgis.com
+2more
Updated Dec 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2018). ACS Internet Access by Education Variables - Boundaries [Dataset]. https://covid-hub.gio.georgia.gov/maps/62faad5b76b04b90adf47c020d7406ba
Explore at:
Dataset updated
Dec 7, 2018
Dataset authored and provided by
Esrihttp://esri.com/
Area covered

Description
This layer shows computer ownership and internet access by education. This is shown by tract, county, and state boundaries. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the percent of the population age 25+ who are high school graduates (includes equivalency) and have some college or associate's degree in households that have no computer. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): B28006 Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters).The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
English Word Frequency
kaggle.com
Updated Sep 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). English Word Frequency [Dataset]. https://www.kaggle.com/datasets/rtatman/english-word-frequency/discussion?sortBy=hot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rachael Tatman
Description
Context:

How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.

Content:

This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

Acknowledgements:

Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.

The code used to generate this dataset is distributed under the MIT License.

Inspiration:

Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?

What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the Brown Corpus or the TIMIT corpus? What might these differences tell us about how language is used?
Comparative Reviews Dataset's
kaggle.com
zip
Updated Jan 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Younis (2019). Comparative Reviews Dataset's [Dataset]. https://www.kaggle.com/umairyounis/comparative-reviews-datasets
Explore at:
zip(205233 bytes)Available download formats
Dataset updated
Jan 22, 2019
Authors
Umair Younis
Description
Context

To get improved results on Machine Learning Algorithms, and other techniques used in Data Mining.

Content

Comprises of two columns, the First row consists of comparative reviews, the second row contains polarities.

Acknowledgements

I pay thanks to my supervisor, Dr Muhammad Zubair Asghar, Assitant Professor, ICIT, Gomal University (KPK). Di.Khan. Without his guidance, I can't accomplish this task.

Inspiration

Comparative opinion mining is becoming the most popular research area in the field of Data Mining. These three comparative reviews datasets will help the researchers who are working in the area of opinion mining and sentiment analysis.
Attitudes towards the internet in China 2025
statista.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Bashir (2025). Attitudes towards the internet in China 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Umair Bashir
Description
When asked about "Attitudes towards the internet", most Chinese respondents pick "It is important to me to have mobile internet access in any place" as an answer. 48 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
Data from: NIST Internet Time Service
catalog.data.gov
datasets.ai
+3more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). NIST Internet Time Service [Dataset]. https://catalog.data.gov/dataset/nist-internet-time-service-ad780
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Distributes NIST estimate of official U.S. time over the Internet in real time, using Network Time Protocol (NTP) and other time data formats to automatically synchronize clocks in computers and network devices to official U.S. time as realized by NIST several billions of times per day. This official U.S. time is the NIST estimate of Coordinated Universal Time (UTC), and called UTC(NIST). The accuracy of UTC(NIST) as distributed through the Internet Time Service (ITS) is on the order of 0.001 seconds (one millisecond), although accuracy can vary depending on network conditions and other parameters. Note that unlike most traditional datasets, time is intrinsically a transient, ever-changing quantity. As soon as UTC(NIST) is transmitted to a client, that particular value of UTC(NIST) no longer reflects the current time, which is constantly changing. There is thus no static storage of any time data, apart from internal diagnostic information not released to the public which ensures that UTC(NIST) as disseminated through the Internet Time Service (ITS) is commensurate with the official UTC(NIST) realization within the uncertainties of the system. The vast majority of UTC(NIST) information distributed through ITS is provided freely, anonymously and automatically to the public. Any IP address can request UTC(NIST) through the ITS and the information is automatically and anonymously provided at no cost to the user. Full documentation of the ITS including all the source code is available to the public through the web site http://www.nist.gov/pml/div688/.NIST provides an authenticated version of ITS to a limited number of users (approximately 500 users near the end of calendar year 2015) who for various reasons want to ensure they are receiving UTC(NIST) without spoofing or interference with the information. This service uses public key encryption for the set of registered users to provide authenticated UTC(NIST).
Z
Dataset of DNS over HTTPS (DoH) Internet Servers
data.niaid.nih.gov
data.mendeley.com
+1more
Updated May 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joaquín Bogado (2022). Dataset of DNS over HTTPS (DoH) Internet Servers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6517360
Explore at:
Dataset updated
May 9, 2022
Dataset provided by
Karel Hynek
Dmitrii Vekshin
Joaquín Bogado
Armin Wasicek
Sebastián García
Tomas Cejka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

The DoH Internet Servers dataset comprises a verified list of Internet servers offering DNS over HTTPS (DoH). This is an updated 10.17632/ny4m53g6bw.1 The list was created through the aggregation of a previously existing, but incomplete, list of DoH servers. The servers in this dataset went through a verification phase where it was confirmed they were active and working as advertised. The verification was done between May 1st, 2022, and May 4th, 2022. The dataset contains a total of 254 unique DoH servers, out of which 136 are over IPv4 and 118 over IPv6. The DoH servers belong to 59 unique Autonomous Systems and are associated with a total of 106 unique domain names.

The following public lists of existing DoH servers were used to create this dataset:

https://developers.google.com/speed/public-dns/docs/doh/json

https://blog.nightly.mozilla.org/2018/06/01/improving-dns-privacy-in-firefox/

https://github.com/curl/curl/wiki/DNS-over-HTTPS

https://help.keenetic.com/hc/en-us/articles/360007687159-DNS-over-TLS-and-DNS-over-HTTPS-proxy-servers-for-DNS-requests-encryption

https://dnsprivacy.org/wiki/display/DP/DNS+Privacy+Public+Resolvers

https://kb.adguard.com/en/general/dns-providers

https://applied-privacy.net/services/dns/

https://www.pacnog.org/pacnog24/presentations/DoT-DoH-DNS-Privacy.pdf

https://www.privacytools.io/providers/dns/

The verification of the DoH servers was performed using a custom-made python script. The script is available at: https://github.com/stratosphereips/DoH-Research/tree/main/validation-script
Z
A dataset of media releases (Twitter, News and Comments, Youtube, Facebook)...
data.niaid.nih.gov
Updated Mar 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Jarynowski (2021). A dataset of media releases (Twitter, News and Comments, Youtube, Facebook) form Poland related to COVID-19 for open research [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3985567
Explore at:
Dataset updated
Mar 29, 2021
Dataset authored and provided by
Andrzej Jarynowski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Poland, YouTube
Description
Social behavior has a fundamental impact on the dynamics of infectious diseases (such as COVID-19), challenging public health mitigation strategies and possibly the political consensus. The widespread use of the traditional and social media on the Internet provides us with an invaluable source of information on societal dynamics during pandemics. With this dataset, we aim to understand mechanisms of COVID-19 epidemic-related social behavior in Poland deploying methods of computational social science and digital epidemiology. We have collected and analyzed COVID-19 perception on the Polish language Internet during 15.01-31.07(06.08) and labeled data quantitatively (Twitter, Youtube, Articles) and qualitatively (Facebook, Articles and Comments of Article) in the Internet by infomediological approach.

manually labelled1,449 articles / Facebook posts from Lower Silesia (facebook_articles_lower_silesia.zip) and 111 texts from outside this region;

-manually labelled 1000 most popular tweets (twits_annotated.xlsx) with cathegories is_fake (categorical and numeric) topic and sentiment;

-extracted 57,306 representative articles (articles_till_06_08.zip) in Polish using Eventregitry.org tool in language Polish and topic "Coronavirus" in article body;

extracted 1,015,199 (tweets_till_31_07_users.zip and tweets_till_31_07_text.zip) and Tweets from #Koronawirus in language Polish using Twitter API.

collected 1,574 videos (youtube_comments_till_31_07.zip and youtube_movie.csv) with keyword: Koronawirus on YouTube and 247,575 comments on them using Google API;

We supplemented the media observations with an analysis of 244 social empirical studies till 25.05 on COVID-19 in Poland (empirical_social_studies.csv).

Reports and analyzes and coding books can be found in Polish at: http://www.infodemia-koronawirusa.pl

Main report (in Polish) https://depot.ceon.pl/handle/123456789/19215

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

Random sample of Common Crawl domains from 2021
kaggle.com
Updated Aug 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HiHarshSinghal (2021). Random sample of Common Crawl domains from 2021 [Dataset]. https://www.kaggle.com/datasets/harshsinghal/random-sample-of-common-crawl-domains-from-2021/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HiHarshSinghal
Description
Context

Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

Content

I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

Acknowledgements

Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

Inspiration

My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

I am also interested in identifying fraudulent domains and understanding malicious URL patterns.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/

Number of internet users worldwide 2014-2029

Explore at:

304 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 11, 2025

Dataset provided by

Statistahttp://statista.com/

Authors

Statista Research Department

Area covered

World

Description

The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.

Clear search

Close search

Google apps

Main menu

Number of internet users worldwide 2014-2029

Attitudes towards the internet in Japan 2025

Attitudes towards the internet in Mexico 2025

My Digital Footprint

Dataset Info:

Data from: Evaluation of Internet Safety Materials Used by Internet Crimes...

IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

Article Information

Abstract

ZIP Folder Content

Datasets' Content

Main benefits of Information and Communication Technology use by industry...

10000 Most Common Passwords

10000 Most Common Passwords

Acknowledgements

Most downloaded Zenodo datasets

Context

Content

Inspiration

Custom dataset from any website on the Internet

Data from: Dataset to "Easing the Conscience with OPC UA: An Internet-Wide...

ACS Internet Access by Education Variables - Boundaries

English Word Frequency

Context:

Content:

Acknowledgements:

Inspiration:

Comparative Reviews Dataset's

Context

Content

Acknowledgements

Inspiration

Attitudes towards the internet in China 2025

Data from: NIST Internet Time Service

Dataset of DNS over HTTPS (DoH) Internet Servers

A dataset of media releases (Twitter, News and Comments, Youtube, Facebook)...

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Random sample of Common Crawl domains from 2021

Context

Content

Acknowledgements

Inspiration

Number of internet users worldwide 2014-2029

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`