Facebook
Twitterhttps://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy
Big Data Statistics: By 2025, the world is expected to generate an unimaginable 181 zettabytes of data, representing an average annual growth rate of 23.13%, with 2.5 quintillion bytes created daily. This indicates an influx of roughly 29 terabytes every second, yes, you read it right.
While over 97% of businesses are now investing in Big Data, only about 40% are truly effective at leveraging analytics. The global Big Data analytics market, valued at approximately $348.21 billion in 2024, is on a trajectory to reach over $961.89 billion by 2032, exhibiting a robust CAGR of 13.5%.
The dominance of some unstructured data, which constitutes about 90% of all global data, underscores the complex statistical challenge. Industries, from healthcare, which could save up to $300 billion annually in the US alone through data initiatives, to entertainment, where Netflix saves an estimated $1 billion per year using data-driven recommendation algorithms, which are being fundamentally changed by Big Data. So, let's dive deeper into this stats article discussing everything about big data from its current insights to future forecasts. Let’s get started.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.
There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).
The document Dataset_summary includes a detailed description of the dataset.
Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset was originally intended for the Data Science Nashville November 2018 meetup: Introduction to Gensim. I wanted to provide a large text corpus in a format often seen in industry, so I pulled the english Wikipedia dump from 2017-08-20, extracted the text using Gensim's excellent segment_wiki script, and finally wrote some custom code to populate a SQLite database.
The dataset encompasses nearly 5 million articles, with more than 23 million individual sections. Only article text is included, all links have been stripped and no metadata (e.g., behind the scene discussion or version history) is included. Even then, I just barely met the file size limit, coming in at just below 20 GB.
I wanted to keep things simple, so everything is in a single table: articles. There is an index on article_id.
I've also pre-trained some simple topic models and word embeddings based on this dataset. At time of upload, the file size limit is 20 GB, so I created another dataset that contains the pre-trained gensim models: English Wikipedia Articles 2017-08-20 Models.
As per The Wikimedia Foundation's requirements, this dataset is provided under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Permission is granted to copy, distribute, and/or modify Wikipedia's text under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License and, unless otherwise noted, the GNU Free Documentation License. unversioned, with no invariant sections, front-cover texts, or back-cover texts.
The banner image is provided by Lysander Yuen on Unsplash.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset provides a comprehensive overview of internet usage across countries as of 2024. It includes data on the percentage of the population using the internet, sourced from multiple organizations such as the World Bank (WB), International Telecommunication Union (ITU), and the CIA. The dataset covers all United Nations member states, excluding North Korea, and provides insights into internet penetration rates, user counts, and trends over recent years. The data is derived from household surveys and internet subscription statistics, offering a reliable snapshot of global digital connectivity.
This dataset can be used in various data science applications, including: - Digital Divide Analysis: Evaluate disparities in internet access between developed and developing nations. - Trend Analysis: Study the growth of internet penetration over time across different regions. - Policy Recommendations: Assist policymakers in identifying underserved areas and strategizing for improved connectivity. - Market Research: Help businesses identify potential markets for digital products or services. - Correlation Studies: Analyze relationships between internet penetration and socioeconomic indicators like GDP, education levels, or urbanization.
The dataset contains the following columns: 1. Location: Country or region name. 2. Rate (WB): Percentage of the population using the internet (World Bank data). 3. Year (WB): Year corresponding to the World Bank data. 4. Rate (ITU): Percentage of the population using the internet (ITU data). 5. Year (ITU): Year corresponding to the ITU data. 6. Users (CIA): Estimated number of internet users in absolute terms (CIA data). 7. Year (CIA): Year corresponding to the CIA data. 8. Notes: Additional notes or observations about specific entries.
The data has been sourced from publicly available and reputable organizations such as the World Bank, ITU, and CIA. These sources ensure transparency and ethical collection methods through household surveys and official statistics. The dataset excludes North Korea due to limited reliable information on its internet usage.
This dataset is based on information compiled from: - World Bank - International Telecommunication Union - CIA World Factbook - Wikipedia's "List of countries by number of Internet users" page
Special thanks to these organizations for providing open access to this valuable information, enabling deeper insights into global digital connectivity trends.
Citations: [1] https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users [2] https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users
Facebook
TwitterThis datasets contains the tokenized version of a dataset containing 60% of OSCAR spanish corpus, wiki data from multiple countries and medicine books. As the weight is so big i needed to cut the OSCAR corpus to make it a little bit smaller, for the same reason i uploaded the tokenized version as If you want/need to work with this dataset inside kaggle you do not have enough space for tokenizing the dataset.
I have also uploaded the code used for tokenize the dataset.
If you want me to upload the entire dataset divided in 4 parts ask for It. :)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Science Fields Salary Categorization Dataset contains 9 columns :- | Dimension | Description | | --- | --- | | Working Year | The year the salary was paid ( 2020, 2021, 2022 ) | | Designation | The role worked in during the year | | Experience | The experience level in the job during the year. [ EN - Entry level / Junior, MI - Mid level / Intermediate, SE - Senior level / Expert, EX - Executive level / Director ]| | Employment Status | The type of employment for the role. [ PT - Part time, FT - Full time, CT - Contract, FL - Freelance ]| | Salary In Rupees | The total gross salary amount paid. | | Employee Location | Employee's primary country of residence in during the work year as an ISO 3166 country code.( PFB Link to ISO 3166 country code ) | | Company Location | The country of the employer's main office or contracting branch. | | Company Size | The median number of people that worked for the company during the year. [ S(small) - Less than 50 employees , M(medium) - 50 to 250 employees , L(large) - More than 250 employees ]| | Remote Working Ratio | The overall amount of work done remotely. [ 0 - No Remote Work (less than 20%), 50 - Partially Remote, 100 - Fully Remote (more than 80%) ]|
I have collected the data from ai-jobs.net & modified it for my own convenience Original Data Source - https://salaries.ai-jobs.net/download/ ISO 3166 Country Code - https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes
Facebook
TwitterKaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Source : https://en.wikipedia.org/wiki/Kaggle
The dataset contains trends over time , region for News , Web and Youtube Search for "Kaggle"
The dataset is generated from Google Trends
Facebook
TwitterKaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Source : https://en.wikipedia.org/wiki/Kaggle
The dataset contains tweets regarding "Kaggle" from verified twitter accounts
"Kaggle" Tweets are scraped using Twint.
Twint is an advanced Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The network was generated using email data from a large European research institution. For a period from October 2003 to May 2005 (18 months) we have anonymized information about all incoming and outgoing email of the research institution. For each sent or received email message we know the time, the sender and the recipient of the email. Overall we have 3,038,531 emails between 287,755 different email addresses. Note that we have a complete email graph for only 1,258 email addresses that come from the research institution. Furthermore, there are 34,203 email addresses that both sent and received email within the span of our dataset. All other email addresses are either non-existing, mistyped or spam.
Given a set of email messages, each node corresponds to an email address. We create a directed edge between nodes i and j, if i sent at least one message to j.
Enron email communication network covers all the email communication within a dataset of around half million emails. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.
The Enron email data was originally released by William Cohen at CMU.
Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page, that she and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all user talk page changes and created a network.
The network contains all the users and discussion from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node i to node j represents that user i at least once edited a talk page of user j.
The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.
The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).
Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.
The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.
SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.
Facebook
TwitterData - protein protein interaction networks. (See https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction )
That are typical biological data discussed by many graph-data-science studies.
Many studies try to produce biological insights by graph-data-science analysis of these networks.
Some files are downloaded from BioGrid database - https://downloads.thebiogrid.org/BioGRID
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy
Big Data Statistics: By 2025, the world is expected to generate an unimaginable 181 zettabytes of data, representing an average annual growth rate of 23.13%, with 2.5 quintillion bytes created daily. This indicates an influx of roughly 29 terabytes every second, yes, you read it right.
While over 97% of businesses are now investing in Big Data, only about 40% are truly effective at leveraging analytics. The global Big Data analytics market, valued at approximately $348.21 billion in 2024, is on a trajectory to reach over $961.89 billion by 2032, exhibiting a robust CAGR of 13.5%.
The dominance of some unstructured data, which constitutes about 90% of all global data, underscores the complex statistical challenge. Industries, from healthcare, which could save up to $300 billion annually in the US alone through data initiatives, to entertainment, where Netflix saves an estimated $1 billion per year using data-driven recommendation algorithms, which are being fundamentally changed by Big Data. So, let's dive deeper into this stats article discussing everything about big data from its current insights to future forecasts. Let’s get started.