10 datasets found

S
Big Data Statistics By Market Size, Usage, Adoption and Facts (2025)
sci-tech-today.com
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sci-Tech Today (2025). Big Data Statistics By Market Size, Usage, Adoption and Facts (2025) [Dataset]. https://www.sci-tech-today.com/stats/big-data-statistics/
Explore at:
Dataset updated
Nov 3, 2025
Dataset authored and provided by
Sci-Tech Today
License
https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy
Time period covered
2022 - 2032
Area covered
Global
Description
Introduction

Big Data Statistics: By 2025, the world is expected to generate an unimaginable 181 zettabytes of data, representing an average annual growth rate of 23.13%, with 2.5 quintillion bytes created daily. This indicates an influx of roughly 29 terabytes every second, yes, you read it right.

While over 97% of businesses are now investing in Big Data, only about 40% are truly effective at leveraging analytics. The global Big Data analytics market, valued at approximately $348.21 billion in 2024, is on a trajectory to reach over $961.89 billion by 2032, exhibiting a robust CAGR of 13.5%.

The dominance of some unstructured data, which constitutes about 90% of all global data, underscores the complex statistical challenge. Industries, from healthcare, which could save up to $300 billion annually in the US alone through data initiatives, to entertainment, where Netflix saves an estimated $1 billion per year using data-driven recommendation algorithms, which are being fundamentally changed by Big Data. So, let's dive deeper into this stats article discussing everything about big data from its current insights to future forecasts. Letâ€™s get started.
Wikipedia Knowledge Graph dataset
zenodo.org
produccioncientifica.ugr.es
+2more
pdf, tsv
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Explore at:
tsv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6346900
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
English Wikipedia Articles 2017-08-20 SQLite
kaggle.com
zip
Updated Nov 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason King (2018). English Wikipedia Articles 2017-08-20 SQLite [Dataset]. https://www.kaggle.com/jkkphys/english-wikipedia-articles-20170820-sqlite
Explore at:
zip(7139277542 bytes)Available download formats
Dataset updated
Nov 27, 2018
Authors
Jason King
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Context

This dataset was originally intended for the Data Science Nashville November 2018 meetup: Introduction to Gensim. I wanted to provide a large text corpus in a format often seen in industry, so I pulled the english Wikipedia dump from 2017-08-20, extracted the text using Gensim's excellent segment_wiki script, and finally wrote some custom code to populate a SQLite database.

The dataset encompasses nearly 5 million articles, with more than 23 million individual sections. Only article text is included, all links have been stripped and no metadata (e.g., behind the scene discussion or version history) is included. Even then, I just barely met the file size limit, coming in at just below 20 GB.

Content

I wanted to keep things simple, so everything is in a single table: articles. There is an index on article_id.

article_id: Int, identifier for each unique title

article_title: Str, article titles

section_title: Str, subsection title from each article

section_text: Str, text from each subsection

I've also pre-trained some simple topic models and word embeddings based on this dataset. At time of upload, the file size limit is 20 GB, so I created another dataset that contains the pre-trained gensim models: English Wikipedia Articles 2017-08-20 Models.

Acknowledgements

As per The Wikimedia Foundation's requirements, this dataset is provided under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Permission is granted to copy, distribute, and/or modify Wikipedia's text under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License and, unless otherwise noted, the GNU Free Documentation License. unversioned, with no invariant sections, front-cover texts, or back-cover texts.

The banner image is provided by Lysander Yuen on Unsplash.
World Internet Usage Data (2023 Updated)
kaggle.com
zip
Updated Dec 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanchana1990 (2024). World Internet Usage Data (2023 Updated) [Dataset]. https://www.kaggle.com/datasets/kanchana1990/world-internet-usage-data-2023-updated
Explore at:
zip(3946 bytes)Available download formats
Dataset updated
Dec 21, 2024
Authors
Kanchana1990
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Dataset Overview

This dataset provides a comprehensive overview of internet usage across countries as of 2024. It includes data on the percentage of the population using the internet, sourced from multiple organizations such as the World Bank (WB), International Telecommunication Union (ITU), and the CIA. The dataset covers all United Nations member states, excluding North Korea, and provides insights into internet penetration rates, user counts, and trends over recent years. The data is derived from household surveys and internet subscription statistics, offering a reliable snapshot of global digital connectivity.

Data Science Applications

This dataset can be used in various data science applications, including: - Digital Divide Analysis: Evaluate disparities in internet access between developed and developing nations. - Trend Analysis: Study the growth of internet penetration over time across different regions. - Policy Recommendations: Assist policymakers in identifying underserved areas and strategizing for improved connectivity. - Market Research: Help businesses identify potential markets for digital products or services. - Correlation Studies: Analyze relationships between internet penetration and socioeconomic indicators like GDP, education levels, or urbanization.

Column Descriptors

The dataset contains the following columns: 1. Location: Country or region name. 2. Rate (WB): Percentage of the population using the internet (World Bank data). 3. Year (WB): Year corresponding to the World Bank data. 4. Rate (ITU): Percentage of the population using the internet (ITU data). 5. Year (ITU): Year corresponding to the ITU data. 6. Users (CIA): Estimated number of internet users in absolute terms (CIA data). 7. Year (CIA): Year corresponding to the CIA data. 8. Notes: Additional notes or observations about specific entries.

Ethically Mined Data

The data has been sourced from publicly available and reputable organizations such as the World Bank, ITU, and CIA. These sources ensure transparency and ethical collection methods through household surveys and official statistics. The dataset excludes North Korea due to limited reliable information on its internet usage.

Acknowledgements

This dataset is based on information compiled from: - World Bank - International Telecommunication Union - CIA World Factbook - Wikipedia's "List of countries by number of Internet users" page

Special thanks to these organizations for providing open access to this valuable information, enabling deeper insights into global digital connectivity trends.

Citations: [1] https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users [2] https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users
MEDICINA-corpus_reducido+MIR+wiki
kaggle.com
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel González Martínez (2023). MEDICINA-corpus_reducido+MIR+wiki [Dataset]. https://www.kaggle.com/datasets/manuelgonzlezmartnez/medicina-corpus-reducido-mir-wiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Manuel González Martínez
Description
This datasets contains the tokenized version of a dataset containing 60% of OSCAR spanish corpus, wiki data from multiple countries and medicine books. As the weight is so big i needed to cut the OSCAR corpus to make it a little bit smaller, for the same reason i uploaded the tokenized version as If you want/need to work with this dataset inside kaggle you do not have enough space for tokenizing the dataset.

I have also uploaded the code used for tokenize the dataset.

If you want me to upload the entire dataset divided in 4 parts ask for It. :)
Data Science Fields Salary Categorization
kaggle.com
Updated Sep 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Chauhan (2022). Data Science Fields Salary Categorization [Dataset]. https://www.kaggle.com/datasets/whenamancodes/data-science-fields-salary-categorization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aman Chauhan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data Science Fields Salary Categorization Dataset contains 9 columns :- | Dimension | Description | | --- | --- | | Working Year | The year the salary was paid ( 2020, 2021, 2022 ) | | Designation | The role worked in during the year | | Experience | The experience level in the job during the year. [ EN - Entry level / Junior, MI - Mid level / Intermediate, SE - Senior level / Expert, EX - Executive level / Director ]| | Employment Status | The type of employment for the role. [ PT - Part time, FT - Full time, CT - Contract, FL - Freelance ]| | Salary In Rupees | The total gross salary amount paid. | | Employee Location | Employee's primary country of residence in during the work year as an ISO 3166 country code.( PFB Link to ISO 3166 country code ) | | Company Location | The country of the employer's main office or contracting branch. | | Company Size | The median number of people that worked for the company during the year. [ S(small) - Less than 50 employees , M(medium) - 50 to 250 employees , L(large) - More than 250 employees ]| | Remote Working Ratio | The overall amount of work done remotely. [ 0 - No Remote Work (less than 20%), 50 - Partially Remote, 100 - Fully Remote (more than 80%) ]|

I have collected the data from ai-jobs.net & modified it for my own convenience Original Data Source - https://salaries.ai-jobs.net/download/ ISO 3166 Country Code - https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes
Kaggle Global Trends
kaggle.com
zip
Updated Apr 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tensor Girl (2021). Kaggle Global Trends [Dataset]. https://www.kaggle.com/usharengaraju/kaggle-global-trends
Explore at:
zip(9848 bytes)Available download formats
Dataset updated
Apr 18, 2021
Authors
Tensor Girl
Description
Context

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Source : https://en.wikipedia.org/wiki/Kaggle

Content

The dataset contains trends over time , region for News , Web and Youtube Search for "Kaggle"

Acknowledgements

The dataset is generated from Google Trends
Kaggle Tweets
kaggle.com
zip
Updated Apr 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tensor Girl (2021). Kaggle Tweets [Dataset]. https://www.kaggle.com/usharengaraju/kaggle-tweets-2010-2021
Explore at:
zip(38796821 bytes)Available download formats
Dataset updated
Apr 18, 2021
Authors
Tensor Girl
Description
Context

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Source : https://en.wikipedia.org/wiki/Kaggle

Content

The dataset contains tweets regarding "Kaggle" from verified twitter accounts

Acknowledgements

"Kaggle" Tweets are scraped using Twint.

Twint is an advanced Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API.

https://pypi.org/project/twint/
Communication Graphs
kaggle.com
zip
Updated Nov 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Communication Graphs [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-communication/discussion
Explore at:
zip(66715371 bytes)Available download formats
Dataset updated
Nov 15, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
email-EuAll: EU email communication network

The network was generated using email data from a large European research institution. For a period from October 2003 to May 2005 (18 months) we have anonymized information about all incoming and outgoing email of the research institution. For each sent or received email message we know the time, the sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution. Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either non-existing, mistyped or spam.

Given a set of email messages, each node corresponds to an email address. We create a directed edge between nodes i and j, if i sent at least one message to j.

email-Enron: Enron email network

Enron email communication network covers all the email communication within a dataset of around half million emails. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.

The Enron email data was originally released by William Cohen at CMU.

wiki-Talk: Wikipedia Talk network

Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page, that she and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all user talk page changes and created a network.

The network contains all the users and discussion from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node i to node j represents that user i at least once edited a talk page of user j.

comm-f2f-Resistance: Dynamic Face-to-Face Interaction Networks

The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.

The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).

Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

http://snap.stanford.edu/data/index.html#email
Protein Protein Interactions Networks
kaggle.com
zip
Updated Apr 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Chervov (2021). Protein Protein Interactions Networks [Dataset]. https://www.kaggle.com/alexandervc/protein-protein-interactions
Explore at:
zip(126896183 bytes)Available download formats
Dataset updated
Apr 14, 2021
Authors
Alexander Chervov
Description
Context

Data - protein protein interaction networks. (See https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction )

That are typical biological data discussed by many graph-data-science studies.

Many studies try to produce biological insights by graph-data-science analysis of these networks.

Content

Some files are downloaded from BioGrid database - https://downloads.thebiogrid.org/BioGRID
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sci-Tech Today (2025). Big Data Statistics By Market Size, Usage, Adoption and Facts (2025) [Dataset]. https://www.sci-tech-today.com/stats/big-data-statistics/

Big Data Statistics By Market Size, Usage, Adoption and Facts (2025)

Explore at:

Dataset updated

Nov 3, 2025

Dataset authored and provided by

Sci-Tech Today

License

https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy

Time period covered

2022 - 2032

Area covered

Global

Description

Introduction

Big Data Statistics: By 2025, the world is expected to generate an unimaginable 181 zettabytes of data, representing an average annual growth rate of 23.13%, with 2.5 quintillion bytes created daily. This indicates an influx of roughly 29 terabytes every second, yes, you read it right.

While over 97% of businesses are now investing in Big Data, only about 40% are truly effective at leveraging analytics. The global Big Data analytics market, valued at approximately $348.21 billion in 2024, is on a trajectory to reach over $961.89 billion by 2032, exhibiting a robust CAGR of 13.5%.

The dominance of some unstructured data, which constitutes about 90% of all global data, underscores the complex statistical challenge. Industries, from healthcare, which could save up to $300 billion annually in the US alone through data initiatives, to entertainment, where Netflix saves an estimated $1 billion per year using data-driven recommendation algorithms, which are being fundamentally changed by Big Data. So, let's dive deeper into this stats article discussing everything about big data from its current insights to future forecasts. Letâ€™s get started.

Clear search

Close search

Google apps

Main menu

Big Data Statistics By Market Size, Usage, Adoption and Facts (2025)

Introduction

Wikipedia Knowledge Graph dataset

English Wikipedia Articles 2017-08-20 SQLite

Context

Content

Acknowledgements

World Internet Usage Data (2023 Updated)

Dataset Overview

Data Science Applications

Column Descriptors

Ethically Mined Data

Acknowledgements

MEDICINA-corpus_reducido+MIR+wiki

Data Science Fields Salary Categorization

Kaggle Global Trends

Context

Content

Acknowledgements

Kaggle Tweets

Context

Content

Acknowledgements

Communication Graphs

email-EuAll: EU email communication network

email-Enron: Enron email network

wiki-Talk: Wikipedia Talk network

comm-f2f-Resistance: Dynamic Face-to-Face Interaction Networks

Protein Protein Interactions Networks

Context

Content

Big Data Statistics By Market Size, Usage, Adoption and Facts (2025)

Introduction