100+ datasets found
  1. Number of internet users worldwide 2014-2029

    • statista.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Statista Research Department
    Area covered
    World
    Description

    The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.

  2. Attitudes towards the internet in Japan 2025

    • statista.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Bashir (2025). Attitudes towards the internet in Japan 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Umair Bashir
    Description

    When asked about "Attitudes towards the internet", most Japanese respondents pick "I'm concerned that my data is being misused on the internet" as an answer. 35 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.

  3. Attitudes towards the internet in Mexico 2025

    • statista.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Bashir (2025). Attitudes towards the internet in Mexico 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Umair Bashir
    Description

    When asked about "Attitudes towards the internet", most Mexican respondents pick "It is important to me to have mobile internet access in any place" as an answer. 56 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.

  4. My Digital Footprint

    • kaggle.com
    zip
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Girish (2023). My Digital Footprint [Dataset]. https://www.kaggle.com/datasets/girish17019/my-digital-footprint
    Explore at:
    zip(874430159 bytes)Available download formats
    Dataset updated
    Jun 29, 2023
    Authors
    Girish
    Description

    Dataset Info:

    MyDigitalFootprint (MDF) is a novel large-scale dataset composed of smartphone embedded sensors data, physical proximity information, and Online Social Networks interactions aimed at supporting multimodal context-recognition and social relationships modelling in mobile environments. The dataset includes two months of measurements and information collected from the personal mobile devices of 31 volunteer users by following the in-the-wild data collection approach: the data has been collected in the users' natural environment, without limiting their usual behaviour. Existing public datasets generally consist of a limited set of context data, aimed at optimising specific application domains (human activity recognition is the most common example). On the contrary, the dataset contains a comprehensive set of information describing the user context in the mobile environment.

    The complete analysis of the data contained in MDF has been presented in the following publication:

    https://www.sciencedirect.com/science/article/abs/pii/S1574119220301383?via%3Dihub

    The full anonymised dataset is contained in the folder MDF. Moreover, in order to demonstrate the efficacy of MDF, there are three proof of concept context-aware applications based on different machine learning tasks:

    1. A social link prediction algorithm based on physical proximity data,
    2. The recognition of daily-life activities based on smartphone-embedded sensors data,
    3. A pervasive context-aware recommender system.

    For the sake of reproducibility, the data used to evaluate the proof-of-concept applications are contained in the folders link-prediction, context-recognition, and cars, respectively.

  5. Data from: Evaluation of Internet Safety Materials Used by Internet Crimes...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Evaluation of Internet Safety Materials Used by Internet Crimes Against Children (ICAC) Task Forces in School and Community Settings, 2011-2012 [United States] [Dataset]. https://catalog.data.gov/dataset/evaluation-of-internet-safety-materials-used-by-internet-crimes-against-children-icac-2011
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justicehttp://nij.ojp.gov/
    Area covered
    United States
    Description

    These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. The purpose of this study was to conduct content and process evaluations of current internet safety education (ISE) program materials and their use by law enforcement presenters and schools. The study was divided into four sub-projects. First, a systematic review or "meta-synthesis" was conducted to identify effective elements of prevention identified by the research across different youth problem areas such as drug abuse, sex education, smoking prevention, suicide, youth violence, and school failure. The process resulted in the development of a KEEP (Known Elements of Effective Prevention) Checklist. Second, a content analysis was conducted on four of the most well-developed and long-standing youth internet safety curricula: i-SAFE, iKeepSafe, Netsmartz, and Web Wise Kids. Third, a process evaluation was conducted to better understand how internet safety education programs are being implemented. The process evaluation was conducted via national surveys with three different groups of respondents: Internet Crimes Against Children (ICAC) Task Force commanders (N=43), ICAC Task Force presenters (N=91), and a sample of school professionals (N=139). Finally, researchers developed an internet safety education outcome survey focused on online harassment and digital citizenship. The intention for creating and piloting this survey was to provide the field with a research-based tool that can be used in future evaluation and program monitoring efforts.

  6. IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

    • zenodo.org
    • data.niaid.nih.gov
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa (2024). IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT [Dataset]. http://doi.org/10.5281/zenodo.8116338
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    José Areia; José Areia; Ivo Afonso Bispo; Ivo Afonso Bispo; Leonel Santos; Leonel Santos; Rogério Luís Costa; Rogério Luís Costa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Article Information

    The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.

    Please do cite the aforementioned article when using this dataset.

    Abstract

    The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.

    ZIP Folder Content

    The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.

    To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.

    This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.

    Datasets' Content

    Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.

    Identified Key Features Within Bluetooth Dataset

    FeatureMeaning
    btle.advertising_headerBLE Advertising Packet Header
    btle.advertising_header.ch_selBLE Advertising Channel Selection Algorithm
    btle.advertising_header.lengthBLE Advertising Length
    btle.advertising_header.pdu_typeBLE Advertising PDU Type
    btle.advertising_header.randomized_rxBLE Advertising Rx Address
    btle.advertising_header.randomized_txBLE Advertising Tx Address
    btle.advertising_header.rfu.1Reserved For Future 1
    btle.advertising_header.rfu.2Reserved For Future 2
    btle.advertising_header.rfu.3Reserved For Future 3
    btle.advertising_header.rfu.4Reserved For Future 4
    btle.control.instantInstant Value Within a BLE Control Packet
    btle.crc.incorrectIncorrect CRC
    btle.extended_advertisingAdvertiser Data Information
    btle.extended_advertising.didAdvertiser Data Identifier
    btle.extended_advertising.sidAdvertiser Set Identifier
    btle.lengthBLE Length
    frame.cap_lenFrame Length Stored Into the Capture File
    frame.interface_idInterface ID
    frame.lenFrame Length Wire
    nordic_ble.board_idBoard ID
    nordic_ble.channelChannel Index
    nordic_ble.crcokIndicates if CRC is Correct
    nordic_ble.flagsFlags
    nordic_ble.packet_counterPacket Counter
    nordic_ble.packet_timePacket time (start to end)
    nordic_ble.phyPHY
    nordic_ble.protoverProtocol Version

    Identified Key Features Within IP-Based Packets Dataset

    FeatureMeaning
    http.content_lengthLength of content in an HTTP response
    http.requestHTTP request being made
    http.response.codeSequential number of an HTTP response
    http.response_numberSequential number of an HTTP response
    http.timeTime taken for an HTTP transaction
    tcp.analysis.initial_rttInitial round-trip time for TCP connection
    tcp.connection.finTCP connection termination with a FIN flag
    tcp.connection.synTCP connection initiation with SYN flag
    tcp.connection.synackTCP connection establishment with SYN-ACK flags
    tcp.flags.cwrCongestion Window Reduced flag in TCP
    tcp.flags.ecnExplicit Congestion Notification flag in TCP
    tcp.flags.finFIN flag in TCP
    tcp.flags.nsNonce Sum flag in TCP
    tcp.flags.resReserved flags in TCP
    tcp.flags.synSYN flag in TCP
    tcp.flags.urgUrgent flag in TCP
    tcp.urgent_pointerPointer to urgent data in TCP
    ip.frag_offsetFragment offset in IP packets
    eth.dst.igEthernet destination is in the internal network group
    eth.src.igEthernet source is in the internal network group
    eth.src.lgEthernet source is in the local network group
    eth.src_not_groupEthernet source is not in any network group
    arp.isannouncementIndicates if an ARP message is an announcement

    Identified Key Features Within IP-Based Flows Dataset

    FeatureMeaning
    protoTransport layer protocol of the connection
    serviceIdentification of an application protocol
    orig_bytesOriginator payload bytes
    resp_bytesResponder payload bytes
    historyConnection state history
    orig_pktsOriginator sent packets
    resp_pktsResponder sent packets
    flow_durationLength of the flow in seconds
    fwd_pkts_totForward packets total
    bwd_pkts_totBackward packets total
    fwd_data_pkts_totForward data packets total
    bwd_data_pkts_totBackward data packets total
    fwd_pkts_per_secForward packets per second
    bwd_pkts_per_secBackward packets per second
    flow_pkts_per_secFlow packets per second
    fwd_header_sizeForward header bytes
    bwd_header_sizeBackward header bytes
    fwd_pkts_payloadForward payload bytes
    bwd_pkts_payloadBackward payload bytes
    flow_pkts_payloadFlow payload bytes
    fwd_iatForward inter-arrival time
    bwd_iatBackward inter-arrival time
    flow_iatFlow inter-arrival time
    activeFlow active duration
  7. G

    Main benefits of Information and Communication Technology use by industry...

    • open.canada.ca
    • www150.statcan.gc.ca
    • +1more
    csv, html, xml
    Updated Jan 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics Canada (2023). Main benefits of Information and Communication Technology use by industry and size of enterprise [Dataset]. https://open.canada.ca/data/en/dataset/fa806ccd-6735-42e0-ad81-1f19cbfde560
    Explore at:
    xml, csv, htmlAvailable download formats
    Dataset updated
    Jan 17, 2023
    Dataset provided by
    Statistics Canada
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    Digital technology and Internet use, main benefits of Information and Communication Technology (ICT) use, by North American Industry Classification System (NAICS) and size of enterprise for Canada in 2012.

  8. 10000 Most Common Passwords

    • kaggle.com
    Updated Dec 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Bansal (2021). 10000 Most Common Passwords [Dataset]. https://www.kaggle.com/shivamb/10000-most-common-passwords/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivam Bansal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    10000 Most Common Passwords

    If your password is on this list of 10,000 most common passwords, you need a new password. A hacker can use or generate files like this, which may readily be compiled from breaches of sites such as Ashley Madison. Usually, passwords are not tried one-by-one against a system's secure server online; instead, a hacker might manage to gain access to a shadowed password file protected by a one-way encryption algorithm, then test each entry in a file like this to see whether it encrypted form matches what the server has on record. The passwords may then be tried against any account online that can be linked to the first, to test for passwords reused on other sites.

    Acknowledgements

    The dataset was procured by SecLists. SecLists is the security tester's companion. It's a collection of multiple types of lists used during security assessments, collected in one place. List types include usernames, passwords, URLs, sensitive data patterns, fuzzing payloads, web shells, and many more. The goal is to enable a security tester to pull this repository onto a new testing box and have access to every type of list that may be needed.

  9. Most downloaded Zenodo datasets

    • kaggle.com
    Updated Feb 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Gorgolewski (2020). Most downloaded Zenodo datasets [Dataset]. https://www.kaggle.com/chrisfilo/most-downloaded-zenodo-datasets/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Chris Gorgolewski
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Zenodo.org is a popular data repository hosted by CERN. There are tens of thousands of datasets in the repository, but not all of them are used to the same extent.

    Content

    This dataset includes names and links to the top 500 most downloaded datasets on Zenodo.

    Inspiration

    This dataset can be used to find datasets deposited on zenodo that would benefit from additional exposure to the DS/ML community by uploading them to Kaggle.

  10. d

    Custom dataset from any website on the Internet

    • datarade.ai
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ScrapeLabs (2022). Custom dataset from any website on the Internet [Dataset]. https://datarade.ai/data-products/custom-dataset-from-any-website-on-the-internet-scrapelabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Sep 21, 2022
    Dataset authored and provided by
    ScrapeLabs
    Area covered
    Kazakhstan, Bulgaria, India, Tunisia, Lebanon, Aruba, Guinea-Bissau, Jordan, Turks and Caicos Islands, Argentina
    Description

    We'll extract any data from any website on the Internet. You don't have to worry about buying and maintaining complex and expensive software, or hiring developers.

    Some common use cases our customers use the data for: • Data Analysis • Market Research • Price Monitoring • Sales Leads • Competitor Analysis • Recruitment

    We can get data from websites with pagination or scroll, with captchas, and even from behind logins. Text, images, videos, documents.

    Receive data in any format you need: Excel, CSV, JSON, or any other.

  11. P

    Data from: Dataset to "Easing the Conscience with OPC UA: An Internet-Wide...

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Dataset to "Easing the Conscience with OPC UA: An Internet-Wide Study on Insecure Deployments" Dataset [Dataset]. https://paperswithcode.com/dataset/dataset-to-easing-the-conscience-with-opc-ua
    Explore at:
    Dataset updated
    Oct 30, 2020
    Description

    This is the dataset to "Easing the Conscience with OPC UA: An Internet-Wide Study on Insecure Deployments" [In ACM Internet Measurement Conference (IMC ’20)]. It contains our weekly scanning results between 2020-02-09 and 2020-08-31 complied using our zgrab2 extensions, i.e, it contains an Internet-wide view on OPC UA deployments and their security configurations. To compile the dataset, we anonymized the output of zgrab2, i.e., we removed host and network identifiers from that dataset. More precisely, we mapped all IP addresses, fully qualified hostnames, and autonomous system IDs to numbers as well as removed certificates containing any identifiers. See the README file for more information. Using this dataset we showed that 93% of Internet-facing OPC UA deployments have problematic security configurations, e.g., missing access control (on 24% of hosts), disabled security functionality (24%), or use of deprecated cryptographic primitives (25%). Furthermore, we discover several hundred devices in multiple autonomous systems sharing the same security certificate, opening the door for impersonation attacks. Overall, with the analysis of this dataset we underpinned that secure protocols, in general, are no guarantee for secure deployments if they need to be configured correctly following regularly updated guidelines that account for basic primitives losing their security promises.

  12. ACS Internet Access by Education Variables - Boundaries

    • covid-hub.gio.georgia.gov
    • mapdirect-fdep.opendata.arcgis.com
    • +2more
    Updated Dec 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2018). ACS Internet Access by Education Variables - Boundaries [Dataset]. https://covid-hub.gio.georgia.gov/maps/62faad5b76b04b90adf47c020d7406ba
    Explore at:
    Dataset updated
    Dec 7, 2018
    Dataset authored and provided by
    Esrihttp://esri.com/
    Area covered
    Description

    This layer shows computer ownership and internet access by education. This is shown by tract, county, and state boundaries. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the percent of the population age 25+ who are high school graduates (includes equivalency) and have some college or associate's degree in households that have no computer. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): B28006 Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters).The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.

  13. English Word Frequency

    • kaggle.com
    Updated Sep 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2017). English Word Frequency [Dataset]. https://www.kaggle.com/datasets/rtatman/english-word-frequency/discussion?sortBy=hot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 6, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rachael Tatman
    Description

    Context:

    How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.

    Content:

    This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

    Acknowledgements:

    Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.

    The code used to generate this dataset is distributed under the MIT License.

    Inspiration:

    • Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?
    • What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the Brown Corpus or the TIMIT corpus? What might these differences tell us about how language is used?
  14. Comparative Reviews Dataset's

    • kaggle.com
    zip
    Updated Jan 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Younis (2019). Comparative Reviews Dataset's [Dataset]. https://www.kaggle.com/umairyounis/comparative-reviews-datasets
    Explore at:
    zip(205233 bytes)Available download formats
    Dataset updated
    Jan 22, 2019
    Authors
    Umair Younis
    Description

    Context

    To get improved results on Machine Learning Algorithms, and other techniques used in Data Mining.

    Content

    Comprises of two columns, the First row consists of comparative reviews, the second row contains polarities.

    Acknowledgements

    I pay thanks to my supervisor, Dr Muhammad Zubair Asghar, Assitant Professor, ICIT, Gomal University (KPK). Di.Khan. Without his guidance, I can't accomplish this task.

    Inspiration

    Comparative opinion mining is becoming the most popular research area in the field of Data Mining. These three comparative reviews datasets will help the researchers who are working in the area of opinion mining and sentiment analysis.

  15. Attitudes towards the internet in China 2025

    • statista.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Bashir (2025). Attitudes towards the internet in China 2025 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Umair Bashir
    Description

    When asked about "Attitudes towards the internet", most Chinese respondents pick "It is important to me to have mobile internet access in any place" as an answer. 48 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.

  16. Data from: NIST Internet Time Service

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). NIST Internet Time Service [Dataset]. https://catalog.data.gov/dataset/nist-internet-time-service-ad780
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Distributes NIST estimate of official U.S. time over the Internet in real time, using Network Time Protocol (NTP) and other time data formats to automatically synchronize clocks in computers and network devices to official U.S. time as realized by NIST several billions of times per day. This official U.S. time is the NIST estimate of Coordinated Universal Time (UTC), and called UTC(NIST). The accuracy of UTC(NIST) as distributed through the Internet Time Service (ITS) is on the order of 0.001 seconds (one millisecond), although accuracy can vary depending on network conditions and other parameters. Note that unlike most traditional datasets, time is intrinsically a transient, ever-changing quantity. As soon as UTC(NIST) is transmitted to a client, that particular value of UTC(NIST) no longer reflects the current time, which is constantly changing. There is thus no static storage of any time data, apart from internal diagnostic information not released to the public which ensures that UTC(NIST) as disseminated through the Internet Time Service (ITS) is commensurate with the official UTC(NIST) realization within the uncertainties of the system. The vast majority of UTC(NIST) information distributed through ITS is provided freely, anonymously and automatically to the public. Any IP address can request UTC(NIST) through the ITS and the information is automatically and anonymously provided at no cost to the user. Full documentation of the ITS including all the source code is available to the public through the web site http://www.nist.gov/pml/div688/.NIST provides an authenticated version of ITS to a limited number of users (approximately 500 users near the end of calendar year 2015) who for various reasons want to ensure they are receiving UTC(NIST) without spoofing or interference with the information. This service uses public key encryption for the set of registered users to provide authenticated UTC(NIST).

  17. Z

    Dataset of DNS over HTTPS (DoH) Internet Servers

    • data.niaid.nih.gov
    • data.mendeley.com
    • +1more
    Updated May 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joaquín Bogado (2022). Dataset of DNS over HTTPS (DoH) Internet Servers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6517360
    Explore at:
    Dataset updated
    May 9, 2022
    Dataset provided by
    Karel Hynek
    Dmitrii Vekshin
    Joaquín Bogado
    Armin Wasicek
    Sebastián García
    Tomas Cejka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    The DoH Internet Servers dataset comprises a verified list of Internet servers offering DNS over HTTPS (DoH). This is an updated 10.17632/ny4m53g6bw.1 The list was created through the aggregation of a previously existing, but incomplete, list of DoH servers. The servers in this dataset went through a verification phase where it was confirmed they were active and working as advertised. The verification was done between May 1st, 2022, and May 4th, 2022. The dataset contains a total of 254 unique DoH servers, out of which 136 are over IPv4 and 118 over IPv6. The DoH servers belong to 59 unique Autonomous Systems and are associated with a total of 106 unique domain names.

    The following public lists of existing DoH servers were used to create this dataset:

    https://developers.google.com/speed/public-dns/docs/doh/json

    https://blog.nightly.mozilla.org/2018/06/01/improving-dns-privacy-in-firefox/

    https://github.com/curl/curl/wiki/DNS-over-HTTPS

    https://help.keenetic.com/hc/en-us/articles/360007687159-DNS-over-TLS-and-DNS-over-HTTPS-proxy-servers-for-DNS-requests-encryption

    https://dnsprivacy.org/wiki/display/DP/DNS+Privacy+Public+Resolvers

    https://kb.adguard.com/en/general/dns-providers

    https://applied-privacy.net/services/dns/

    https://www.pacnog.org/pacnog24/presentations/DoT-DoH-DNS-Privacy.pdf

    https://www.privacytools.io/providers/dns/

    The verification of the DoH servers was performed using a custom-made python script. The script is available at: https://github.com/stratosphereips/DoH-Research/tree/main/validation-script

  18. Z

    A dataset of media releases (Twitter, News and Comments, Youtube, Facebook)...

    • data.niaid.nih.gov
    Updated Mar 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Jarynowski (2021). A dataset of media releases (Twitter, News and Comments, Youtube, Facebook) form Poland related to COVID-19 for open research [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3985567
    Explore at:
    Dataset updated
    Mar 29, 2021
    Dataset authored and provided by
    Andrzej Jarynowski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Poland, YouTube
    Description

    Social behavior has a fundamental impact on the dynamics of infectious diseases (such as COVID-19), challenging public health mitigation strategies and possibly the political consensus. The widespread use of the traditional and social media on the Internet provides us with an invaluable source of information on societal dynamics during pandemics. With this dataset, we aim to understand mechanisms of COVID-19 epidemic-related social behavior in Poland deploying methods of computational social science and digital epidemiology. We have collected and analyzed COVID-19 perception on the Polish language Internet during 15.01-31.07(06.08) and labeled data quantitatively (Twitter, Youtube, Articles) and qualitatively (Facebook, Articles and Comments of Article) in the Internet by infomediological approach.

    • manually labelled1,449 articles / Facebook posts from Lower Silesia (facebook_articles_lower_silesia.zip) and 111 texts from outside this region;

    -manually labelled 1000 most popular tweets (twits_annotated.xlsx) with cathegories is_fake (categorical and numeric) topic and sentiment;

    -extracted 57,306 representative articles (articles_till_06_08.zip) in Polish using Eventregitry.org tool in language Polish and topic "Coronavirus" in article body;

    • extracted 1,015,199 (tweets_till_31_07_users.zip and tweets_till_31_07_text.zip) and Tweets from #Koronawirus in language Polish using Twitter API.

    • collected 1,574 videos (youtube_comments_till_31_07.zip and youtube_movie.csv) with keyword: Koronawirus on YouTube and 247,575 comments on them using Google API;

    • We supplemented the media observations with an analysis of 244 social empirical studies till 25.05 on COVID-19 in Poland (empirical_social_studies.csv).

    Reports and analyzes and coding books can be found in Polish at: http://www.infodemia-koronawirusa.pl

    Main report (in Polish) https://depot.ceon.pl/handle/123456789/19215

  19. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  20. Random sample of Common Crawl domains from 2021

    • kaggle.com
    Updated Aug 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HiHarshSinghal (2021). Random sample of Common Crawl domains from 2021 [Dataset]. https://www.kaggle.com/datasets/harshsinghal/random-sample-of-common-crawl-domains-from-2021/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HiHarshSinghal
    Description

    Context

    Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

    Content

    I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

    Acknowledgements

    Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

    Inspiration

    My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

    I am also interested in identifying fraudulent domains and understanding malicious URL patterns.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista Research Department (2025). Number of internet users worldwide 2014-2029 [Dataset]. https://www.statista.com/topics/1145/internet-usage-worldwide/
Organization logo

Number of internet users worldwide 2014-2029

Explore at:
304 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 11, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Area covered
World
Description

The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.

Search
Clear search
Close search
Google apps
Main menu