10 datasets found
  1. S

    Big Data Statistics By Market Size, Usage, Adoption and Facts (2025)

    • sci-tech-today.com
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sci-Tech Today (2025). Big Data Statistics By Market Size, Usage, Adoption and Facts (2025) [Dataset]. https://www.sci-tech-today.com/stats/big-data-statistics/
    Explore at:
    Dataset updated
    Nov 3, 2025
    Dataset authored and provided by
    Sci-Tech Today
    License

    https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy

    Time period covered
    2022 - 2032
    Area covered
    Global
    Description

    Introduction

    Big Data Statistics: By 2025, the world is expected to generate an unimaginable 181 zettabytes of data, representing an average annual growth rate of 23.13%, with 2.5 quintillion bytes created daily. This indicates an influx of roughly 29 terabytes every second, yes, you read it right.

    While over 97% of businesses are now investing in Big Data, only about 40% are truly effective at leveraging analytics. The global Big Data analytics market, valued at approximately $348.21 billion in 2024, is on a trajectory to reach over $961.89 billion by 2032, exhibiting a robust CAGR of 13.5%.

    The dominance of some unstructured data, which constitutes about 90% of all global data, underscores the complex statistical challenge. Industries, from healthcare, which could save up to $300 billion annually in the US alone through data initiatives, to entertainment, where Netflix saves an estimated $1 billion per year using data-driven recommendation algorithms, which are being fundamentally changed by Big Data. So, let's dive deeper into this stats article discussing everything about big data from its current insights to future forecasts. Let’s get started.

  2. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +2more
    pdf, tsv
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  3. English Wikipedia Articles 2017-08-20 SQLite

    • kaggle.com
    zip
    Updated Nov 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason King (2018). English Wikipedia Articles 2017-08-20 SQLite [Dataset]. https://www.kaggle.com/jkkphys/english-wikipedia-articles-20170820-sqlite
    Explore at:
    zip(7139277542 bytes)Available download formats
    Dataset updated
    Nov 27, 2018
    Authors
    Jason King
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Context

    This dataset was originally intended for the Data Science Nashville November 2018 meetup: Introduction to Gensim. I wanted to provide a large text corpus in a format often seen in industry, so I pulled the english Wikipedia dump from 2017-08-20, extracted the text using Gensim's excellent segment_wiki script, and finally wrote some custom code to populate a SQLite database.

    The dataset encompasses nearly 5 million articles, with more than 23 million individual sections. Only article text is included, all links have been stripped and no metadata (e.g., behind the scene discussion or version history) is included. Even then, I just barely met the file size limit, coming in at just below 20 GB.

    Content

    I wanted to keep things simple, so everything is in a single table: articles. There is an index on article_id.

    • article_id: Int, identifier for each unique title
    • article_title: Str, article titles
    • section_title: Str, subsection title from each article
    • section_text: Str, text from each subsection

    I've also pre-trained some simple topic models and word embeddings based on this dataset. At time of upload, the file size limit is 20 GB, so I created another dataset that contains the pre-trained gensim models: English Wikipedia Articles 2017-08-20 Models.

    Acknowledgements

    As per The Wikimedia Foundation's requirements, this dataset is provided under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Permission is granted to copy, distribute, and/or modify Wikipedia's text under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License and, unless otherwise noted, the GNU Free Documentation License. unversioned, with no invariant sections, front-cover texts, or back-cover texts.

    The banner image is provided by Lysander Yuen on Unsplash.

  4. World Internet Usage Data (2023 Updated)

    • kaggle.com
    zip
    Updated Dec 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanchana1990 (2024). World Internet Usage Data (2023 Updated) [Dataset]. https://www.kaggle.com/datasets/kanchana1990/world-internet-usage-data-2023-updated
    Explore at:
    zip(3946 bytes)Available download formats
    Dataset updated
    Dec 21, 2024
    Authors
    Kanchana1990
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Dataset Overview

    This dataset provides a comprehensive overview of internet usage across countries as of 2024. It includes data on the percentage of the population using the internet, sourced from multiple organizations such as the World Bank (WB), International Telecommunication Union (ITU), and the CIA. The dataset covers all United Nations member states, excluding North Korea, and provides insights into internet penetration rates, user counts, and trends over recent years. The data is derived from household surveys and internet subscription statistics, offering a reliable snapshot of global digital connectivity.

    Data Science Applications

    This dataset can be used in various data science applications, including: - Digital Divide Analysis: Evaluate disparities in internet access between developed and developing nations. - Trend Analysis: Study the growth of internet penetration over time across different regions. - Policy Recommendations: Assist policymakers in identifying underserved areas and strategizing for improved connectivity. - Market Research: Help businesses identify potential markets for digital products or services. - Correlation Studies: Analyze relationships between internet penetration and socioeconomic indicators like GDP, education levels, or urbanization.

    Column Descriptors

    The dataset contains the following columns: 1. Location: Country or region name. 2. Rate (WB): Percentage of the population using the internet (World Bank data). 3. Year (WB): Year corresponding to the World Bank data. 4. Rate (ITU): Percentage of the population using the internet (ITU data). 5. Year (ITU): Year corresponding to the ITU data. 6. Users (CIA): Estimated number of internet users in absolute terms (CIA data). 7. Year (CIA): Year corresponding to the CIA data. 8. Notes: Additional notes or observations about specific entries.

    Ethically Mined Data

    The data has been sourced from publicly available and reputable organizations such as the World Bank, ITU, and CIA. These sources ensure transparency and ethical collection methods through household surveys and official statistics. The dataset excludes North Korea due to limited reliable information on its internet usage.

    Acknowledgements

    This dataset is based on information compiled from: - World Bank - International Telecommunication Union - CIA World Factbook - Wikipedia's "List of countries by number of Internet users" page

    Special thanks to these organizations for providing open access to this valuable information, enabling deeper insights into global digital connectivity trends.

    Citations: [1] https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users [2] https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users

  5. MEDICINA-corpus_reducido+MIR+wiki

    • kaggle.com
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel González Martínez (2023). MEDICINA-corpus_reducido+MIR+wiki [Dataset]. https://www.kaggle.com/datasets/manuelgonzlezmartnez/medicina-corpus-reducido-mir-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Manuel González Martínez
    Description

    This datasets contains the tokenized version of a dataset containing 60% of OSCAR spanish corpus, wiki data from multiple countries and medicine books. As the weight is so big i needed to cut the OSCAR corpus to make it a little bit smaller, for the same reason i uploaded the tokenized version as If you want/need to work with this dataset inside kaggle you do not have enough space for tokenizing the dataset.

    I have also uploaded the code used for tokenize the dataset.

    If you want me to upload the entire dataset divided in 4 parts ask for It. :)

  6. Data Science Fields Salary Categorization

    • kaggle.com
    Updated Sep 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Data Science Fields Salary Categorization [Dataset]. https://www.kaggle.com/datasets/whenamancodes/data-science-fields-salary-categorization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Chauhan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data Science Fields Salary Categorization Dataset contains 9 columns :- | Dimension | Description | | --- | --- | | Working Year | The year the salary was paid ( 2020, 2021, 2022 ) | | Designation | The role worked in during the year | | Experience | The experience level in the job during the year. [ EN - Entry level / Junior, MI - Mid level / Intermediate, SE - Senior level / Expert, EX - Executive level / Director ]| | Employment Status | The type of employment for the role. [ PT - Part time, FT - Full time, CT - Contract, FL - Freelance ]| | Salary In Rupees | The total gross salary amount paid. | | Employee Location | Employee's primary country of residence in during the work year as an ISO 3166 country code.( PFB Link to ISO 3166 country code ) | | Company Location | The country of the employer's main office or contracting branch. | | Company Size | The median number of people that worked for the company during the year. [ S(small) - Less than 50 employees , M(medium) - 50 to 250 employees , L(large) - More than 250 employees ]| | Remote Working Ratio | The overall amount of work done remotely. [ 0 - No Remote Work (less than 20%), 50 - Partially Remote, 100 - Fully Remote (more than 80%) ]|

    I have collected the data from ai-jobs.net & modified it for my own convenience Original Data Source - https://salaries.ai-jobs.net/download/ ISO 3166 Country Code - https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes

  7. Kaggle Global Trends

    • kaggle.com
    zip
    Updated Apr 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tensor Girl (2021). Kaggle Global Trends [Dataset]. https://www.kaggle.com/usharengaraju/kaggle-global-trends
    Explore at:
    zip(9848 bytes)Available download formats
    Dataset updated
    Apr 18, 2021
    Authors
    Tensor Girl
    Description

    Context

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Source : https://en.wikipedia.org/wiki/Kaggle

    Content

    The dataset contains trends over time , region for News , Web and Youtube Search for "Kaggle"

    Acknowledgements

    The dataset is generated from Google Trends

  8. Kaggle Tweets

    • kaggle.com
    zip
    Updated Apr 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tensor Girl (2021). Kaggle Tweets [Dataset]. https://www.kaggle.com/usharengaraju/kaggle-tweets-2010-2021
    Explore at:
    zip(38796821 bytes)Available download formats
    Dataset updated
    Apr 18, 2021
    Authors
    Tensor Girl
    Description

    Context

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Source : https://en.wikipedia.org/wiki/Kaggle

    Content

    The dataset contains tweets regarding "Kaggle" from verified twitter accounts

    Acknowledgements

    "Kaggle" Tweets are scraped using Twint.

    Twint is an advanced Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API.

    https://pypi.org/project/twint/

  9. Communication Graphs

    • kaggle.com
    zip
    Updated Nov 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Communication Graphs [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-communication/discussion
    Explore at:
    zip(66715371 bytes)Available download formats
    Dataset updated
    Nov 15, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    email-EuAll: EU email communication network

    The network was generated using email data from a large European research institution. For a period from October 2003 to May 2005 (18 months) we have anonymized information about all incoming and outgoing email of the research institution. For each sent or received email message we know the time, the sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution. Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either non-existing, mistyped or spam.

    Given a set of email messages, each node corresponds to an email address. We create a directed edge between nodes i and j, if i sent at least one message to j.

    email-Enron: Enron email network

    Enron email communication network covers all the email communication within a dataset of around half million emails. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.

    The Enron email data was originally released by William Cohen at CMU.

    wiki-Talk: Wikipedia Talk network

    Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page, that she and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all user talk page changes and created a network.

    The network contains all the users and discussion from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node i to node j represents that user i at least once edited a talk page of user j.

    comm-f2f-Resistance: Dynamic Face-to-Face Interaction Networks

    The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.

    The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).

    Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

    The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

    SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

    http://snap.stanford.edu/data/index.html#email

  10. Protein Protein Interactions Networks

    • kaggle.com
    zip
    Updated Apr 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Chervov (2021). Protein Protein Interactions Networks [Dataset]. https://www.kaggle.com/alexandervc/protein-protein-interactions
    Explore at:
    zip(126896183 bytes)Available download formats
    Dataset updated
    Apr 14, 2021
    Authors
    Alexander Chervov
    Description

    Context

    Data - protein protein interaction networks. (See https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction )

    That are typical biological data discussed by many graph-data-science studies.

    Many studies try to produce biological insights by graph-data-science analysis of these networks.

    Content

    Some files are downloaded from BioGrid database - https://downloads.thebiogrid.org/BioGRID

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sci-Tech Today (2025). Big Data Statistics By Market Size, Usage, Adoption and Facts (2025) [Dataset]. https://www.sci-tech-today.com/stats/big-data-statistics/

Big Data Statistics By Market Size, Usage, Adoption and Facts (2025)

Explore at:
Dataset updated
Nov 3, 2025
Dataset authored and provided by
Sci-Tech Today
License

https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy

Time period covered
2022 - 2032
Area covered
Global
Description

Introduction

Big Data Statistics: By 2025, the world is expected to generate an unimaginable 181 zettabytes of data, representing an average annual growth rate of 23.13%, with 2.5 quintillion bytes created daily. This indicates an influx of roughly 29 terabytes every second, yes, you read it right.

While over 97% of businesses are now investing in Big Data, only about 40% are truly effective at leveraging analytics. The global Big Data analytics market, valued at approximately $348.21 billion in 2024, is on a trajectory to reach over $961.89 billion by 2032, exhibiting a robust CAGR of 13.5%.

The dominance of some unstructured data, which constitutes about 90% of all global data, underscores the complex statistical challenge. Industries, from healthcare, which could save up to $300 billion annually in the US alone through data initiatives, to entertainment, where Netflix saves an estimated $1 billion per year using data-driven recommendation algorithms, which are being fundamentally changed by Big Data. So, let's dive deeper into this stats article discussing everything about big data from its current insights to future forecasts. Let’s get started.

Search
Clear search
Close search
Google apps
Main menu