The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset investigates the relationship between Wordle answers and Google search spikes, particularly for uncommon words. It spans from June 21, 2021 to June 24, 2025.
It includes daily data for each Wordle answer, its search trend on that day, and frequency-based commonality indicators.
Each Wordle answer causes a spike in search volume on the day it appears — more so if the word is rare.
This dataset supports exploration of:
Column | Description |
---|---|
date | Date of the Wordle puzzle |
word | Correct 5-letter Wordle answer |
game | Wordle game number |
wordfreq_commonality | Normalized frequency score using Python’s wordfreq library |
subtlex_commonality | Normalized frequency score using SUBTLEX-US dataset |
trend_day_global | Google search interest on the day (global, all categories) |
trend_avg_200_global | 200-day average search interest (global, all categories) |
trend_day_language | Search interest on Wordle day (Language Resources category) |
trend_avg_200_language | 200-day average search interest (Language Resources category) |
Notes: - All trend values are relative (0–100 scale, per Google Trends)
wordfreq
Python librarypytrends
Can find analysis done using this data in the blog post
This statistic shows the daily digital data engagement interactions per person worldwide from 2010 to 2025. The average number of data interactions per connected person per day is expected to increase dramatically, from *** interactions per day in 2010 to almost ************* interactions per day by 2025.
Updated daily between 3:00 pm to 5:00 pm Data are updated daily in the early afternoon and reflect laboratory results reported to the Washington State Department of Health as of midnight the day before. Data for previous dates will be updated as new results are entered, interviews are conducted, and data errors are corrected. Many people test positive but do not require hospitalization. The counts of positive cases do not necessarily indicate levels of demand at local hospitals. Reporting of test results to the Washington State Department of Health may be delayed by several days and will be updated when data are available. Only positive or negative test results are reflected in the counts and exclude tests where results are pending, inconclusive or were not performed.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset collects job offers from web scraping which are filtered according to specific keywords, locations and times. This data gives users rich and precise search capabilities to uncover the best working solution for them. With the information collected, users can explore options that match with their personal situation, skillset and preferences in terms of location and schedule. The columns provide detailed information around job titles, employer names, locations, time frames as well as other necessary parameters so you can make a smart choice for your next career opportunity
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is a great resource for those looking to find an optimal work solution based on keywords, location and time parameters. With this information, users can quickly and easily search through job offers that best fit their needs. Here are some tips on how to use this dataset to its fullest potential:
Start by identifying what type of job offer you want to find. The keyword column will help you narrow down your search by allowing you to search for job postings that contain the word or phrase you are looking for.
Next, consider where the job is located – the Location column tells you where in the world each posting is from so make sure it’s somewhere that suits your needs!
Finally, consider when the position is available – look at the Time frame column which gives an indication of when each posting was made as well as if it’s a full-time/ part-time role or even if it’s a casual/temporary position from day one so make sure it meets your requirements first before applying!
Additionally, if details such as hours per week or further schedule information are important criteria then there is also info provided under Horari and Temps Oferta columns too! Now that all three criteria have been ticked off - key words, location and time frame - then take a look at Empresa (Company Name) and Nom_Oferta (Post Name) columns too in order to get an idea of who will be employing you should you land the gig!
All these pieces of data put together should give any motivated individual all they need in order to seek out an optimal work solution - keep hunting good luck!
- Machine learning can be used to groups job offers in order to facilitate the identification of similarities and differences between them. This could allow users to specifically target their search for a work solution.
- The data can be used to compare job offerings across different areas or types of jobs, enabling users to make better informed decisions in terms of their career options and goals.
- It may also provide an insight into the local job market, enabling companies and employers to identify where there is potential for new opportunities or possible trends that simply may have previously gone unnoticed
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: web_scraping_information_offers.csv | Column name | Description | |:-----------------|:------------------------------------| | Nom_Oferta | Name of the job offer. (String) | | Empresa | Company offering the job. (String) | | UbicaciĂł | Location of the job offer. (String) | | Temps_Oferta | Time of the job offer. (String) | | Horari | Schedule of the job offer. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
We can offer the news data in two formats: 1) News flow: all news flow for our company coverage including articles and tweets. 2) ESG Incidents: highlights any pressing issues that companies are facing in the news.
Our system executes around 100,000 searches per day across the internet. We search specific websites deemed to be high-quality and informationally additive for news about our whole company coverage.
These include: • Mainstream publications like Reuters, CNN, CNBC, NBC News etc. • NGO websites such as Ethical Consumer and Anti-Slavery International • Investigative journalist websites like MLex • National papers like the Japan Times • Trade publications like Insurance Journal • Sustainability publications like Edie.net
Each article that we download goes through rigorous processing. This includes cleaning the body of the article and adding its metadata e.g., the date that it was published.
We then calculate our proprietary “relevance” scores. This is a calculation to determine how relevant the article is to the company, CEO, biggest Insider and biggest Outsider.
Natural Language Processing (NLP) techniques are used to calculate the similarity and sentiment scores for each article for each news topic.
We use Twitter’s API to download the latest tweets from Thought Leader Accounts. We track over 100 Thought Leaders such as Ceres and Science Based Targets.
These tweets are then searched to see if any of our company coverage is mentioned.
Afterwards, the same processing and calculation steps are followed as for the news articles.
ESG Incidents is the second news feed that we display for users. It is designed to show any pressing issues that a company is facing in the news in real-time.
To get ESG Incidents outputs we follow these steps: 1. Choose a time period of news to look at e.g., 3 months. 2. For each news topic (we have around 50) pick out the article(s) that have the highest relevance to a company and the highest similarity score over that time period. We multiply these two scores together to calculate an “Incidence Score”. 3. Calculate how many times that new topic has come up in the news over the chosen time period as a proportion of the total articles for that company.
We are then able to see emerging trends and incidents for a particular company over a time period and also have the ability to see the most relevant articles for each news topic. This allows investors to see any emerging incidents or scandals for a company in real-time.
This dataset is historical only and ends at 5/7/2021. For more information, please see http://dev.cityofchicago.org/open%20data/data%20portal/2021/05/04/covid-19-testing-by-person.html. The recommended alternative dataset for similar data beyond that date is https://data.cityofchicago.org/Health-Human-Services/COVID-19-Daily-Testing-By-Test/gkdw-2tgv.
This is the source data for some of the metrics available at https://www.chicago.gov/city/en/sites/covid-19/home/latest-data.html.
For all datasets related to COVID-19, see https://data.cityofchicago.org/browse?limitTo=datasets&sortBy=alpha&tags=covid-19.
This dataset contains counts of people tested for COVID-19 and their results. This dataset differs from https://data.cityofchicago.org/d/gkdw-2tgv in that each person is in this dataset only once, even if tested multiple times. In the other dataset, each test is counted, even if multiple tests are performed on the same person, although a person should not appear in that dataset more than once on the same day unless he/she had both a positive and not-positive test.
Only Chicago residents are included based on the home address as provided by the medical provider.
Molecular (PCR) and antigen tests are included, and only one test is counted for each individual. Tests are counted on the day the specimen was collected. A small number of tests collected prior to 3/1/2020 are not included in the table.
Not-positive lab results include negative results, invalid results, and tests not performed due to improper collection. Chicago Department of Public Health (CDPH) does not receive all not-positive results.
Demographic data are more complete for those who test positive; care should be taken when calculating percentage positivity among demographic groups.
All data are provisional and subject to change. Information is updated as additional details are received.
Data Source: Illinois National Electronic Disease Surveillance System
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook is fast approaching 3 billion monthly active users. That’s about 36% of the world’s entire population that log in and use Facebook at least once a month.
How much time do people spend on social media? As of 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
The M4 dataset is a collection of 100,000 time series used for the fourth edition of the Makridakis forecasting Competition. The M4 dataset consists of time series of yearly, quarterly, monthly and other (weekly, daily and hourly) data, which are divided into training and test sets. The minimum numbers of observations in the training test are 13 for yearly, 16 for quarterly, 42 for monthly, 80 for weekly, 93 for daily and 700 for hourly series. The participants were asked to produce the following numbers of forecasts beyond the available data that they had been given: six for yearly, eight for quarterly, 18 for monthly series, 13 for weekly series and 14 and 48 forecasts respectively for the daily and hourly ones.
The M4 dataset was created by selecting a random sample of 100,000 time series from the ForeDeCk database. The selected series were then scaled to prevent negative observations and values lower than 10, thus avoiding possible problems when calculating various error measures. The scaling was performed by simply adding a constant to the series so that their minimum value was equal to 10 (29 occurrences across the whole dataset). In addition, any information that could possibly lead to the identification of the original series was removed so as to ensure the objectivity of the results. This included the starting dates of the series, which did not become available to the participants until the M4 had ended.
Factori's AI & ML training data is thoroughly tested and reviewed to ensure that what you receive on your end is of the best quality.
Integrate the comprehensive AI & ML training data provided by Grepsr and develop a superior AI & ML model.
Whether you're training algorithms for natural language processing, sentiment analysis, or any other AI application, we can deliver comprehensive datasets tailored to fuel your machine learning initiatives.
Enhanced Data Quality: We have rigorous data validation processes and also conduct quality assurance checks to guarantee the integrity and reliability of the training data for you to develop the AI & ML models.
Gain a competitive edge, drive innovation, and unlock new opportunities by leveraging the power of tailored Artificial Intelligence and Machine Learning training data with Factori.
We offer web activity data of users that are browsing popular websites around the world. This data can be used to analyze web behavior across the web and build highly accurate audience segments based on web activity for targeting ads based on interest categories and search/browsing intent.
Web Data Reach: Our reach data represents the total number of data counts available within various categories and comprises attributes such as Country, Anonymous ID, IP addresses, Search Query, and so on.
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).
Data Attributes: Anonymous_id IDType Timestamp Estid Ip userAgent browserFamily deviceType Os Url_metadata_canonical_url Url_metadata_raw_query_params refDomain mappedEvent Channel searchQuery Ttd_id Adnxs_id Keywords Categories Entities Concepts
DPH is updating and streamlining the COVID-19 cases, deaths, and testing data. As of 6/27/2022, the data will be published in four tables instead of twelve. The COVID-19 Cases, Deaths, and Tests by Day dataset contains cases and test data by date of sample submission. The death data are by date of death. This dataset is updated daily and contains information back to the beginning of the pandemic. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Cases-Deaths-and-Tests-by-Day/g9vi-2ahj. The COVID-19 State Metrics dataset contains over 93 columns of data. This dataset is updated daily and currently contains information starting June 21, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-State-Level-Data/qmgw-5kp6 . The COVID-19 County Metrics dataset contains 25 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-County-Level-Data/ujiq-dy22 . The COVID-19 Town Metrics dataset contains 16 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Town-Level-Data/icxw-cada . To protect confidentiality, if a town has fewer than 5 cases or positive NAAT tests over the past 7 days, those data will be suppressed. COVID-19 test results by date of specimen collection, including total, positive, negative, and indeterminate for molecular and antigen tests. Molecular tests reported include polymerase chain reaction (PCR) and nucleic acid amplicfication (NAAT) tests. Test results may be reported several days after the result. Data are incomplete for the most recent days. Data from previous dates are routinely updated. Records with a null date field summarize tests reported that were missing the date of collection. Starting in July 2020, this dataset will be updated every weekday.
The Southern Great Plains 1997 (SGP97) Hydrology Experiment originated from an interdisciplinary investigation, "Soil Moisture Mapping at Satellite Temporal and Spatial Scales" (PI: Thomas J. Jackson, USDA Agricultural Research Service, Beltsville, MD) selected under the NASA Research Announcement 95-MTPE-03. The region selected for investigation is the best instrumented site for surface soil moisture, hydrology and meteorology in the world. This includes the USDA/ARS Little Washita Watershed, the USDA/ARS facility at El Reno, Oklahoma, the ARM/CART central facility, as well as the Oklahoma Mesonet. The National Climatic Data Center (NCDC) Summary of the Day Co-operative Dataset is one of several surface datasets provided for the Southern Great Plains (SGP) 1997 project. This NCDC Co-operative Observer (COOP) dataset contains data from sixty-two stations for the SGP 1997 time period (18 June 1997 through 18 July 1997) and in the SGP 1997 domain (approximately 97W to 99W longitude and 34.5N to 37N latitude). The primary thrust of the cooperative observing program is the recording of 24-hour precipitation amounts, but approximately 55% of the stations also record maximum and minimum temperatures. The observations are for the 24-hour period ending at the time of observation. Observer convenience or special program needs mean that observing times vary from station to station. However, the vast majority of observations are taken near either 7:00 AM or 7:00 PM local time. The NCDC Summary of the Day Co-operative Dataset (TD-3200) contains eight metadata parameters and fifteen data parameters and flags. The metadata parameters describe the date/time, network, station and location at which the data were collected. All times are UTC. Data values are valid for the 24 hours preceding the time of observation. Resources in this dataset:Resource Title: GeoData catalog record. File Name: Web Page, url: https://geodata.nal.usda.gov/geonetwork/srv/eng/catalog.search#/metadata/SGP97COOP_jjm_2015-05-04_0918
This dataset package is focused on U.S construction materials and three construction companies: Cemex, Martin Marietta & Vulcan.
In this package, SpaceKnow tracks manufacturing and processing facilities for construction material products all over the US. By tracking these facilities, we are able to give you near-real-time data on spending on these materials, which helps to predict residential and commercial real estate construction and spending in the US.
The dataset includes 40 indices focused on asphalt, cement, concrete, and building materials in general. You can look forward to receiving country-level and regional data (activity in the North, East, West, and South of the country) and the aforementioned company data.
SpaceKnow uses satellite (SAR) data to capture activity and building material manufacturing and processing facilities in the US.
Data is updated daily, has an average lag of 4-6 days, and history back to 2017.
The insights provide you with level and change data for refineries, storage, manufacturing, logistics, and employee parking-based locations.
SpaceKnow offers 3 delivery options: CSV, API, and Insights Dashboard
Available Indices Companies: Cemex (CX): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates Martin Marietta (MLM): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates Vulcan (VMC): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates
USA Indices:
Aggregates USA Asphalt USA Cement USA Cement Refinery USA Cement Storage USA Concrete USA Construction Materials USA Construction Mining USA Construction Parking Lots USA Construction Materials Transfer Hub US Cement - Midwest, Northeast, South, West Cement Refinery - Midwest, Northeast, South, West Cement Storage - Midwest, Northeast, South, West
Why get SpaceKnow's U.S Construction Materials Package?
Monitor Construction Market Trends: Near-real-time insights into the construction industry allow clients to understand and anticipate market trends better.
Track Companies Performance: Monitor the operational activities, such as the volume of sales
Assess Risk: Use satellite activity data to assess the risks associated with investing in the construction industry.
Index Methodology Summary Continuous Feed Index (CFI) is a daily aggregation of the area of metallic objects in square meters. There are two types of CFI indices; CFI-R index gives the data in levels. It shows how many square meters are covered by metallic objects (for example employee cars at a facility). CFI-S index gives the change in data. It shows how many square meters have changed within the locations between two consecutive satellite images.
How to interpret the data SpaceKnow indices can be compared with the related economic indicators or KPIs. If the economic indicator is in monthly terms, perform a 30-day rolling sum and pick the last day of the month to compare with the economic indicator. Each data point will reflect approximately the sum of the month. If the economic indicator is in quarterly terms, perform a 90-day rolling sum and pick the last day of the 90-day to compare with the economic indicator. Each data point will reflect approximately the sum of the quarter.
Where the data comes from SpaceKnow brings you the data edge by applying machine learning and AI algorithms to synthetic aperture radar and optical satellite imagery. The company’s infrastructure searches and downloads new imagery every day, and the computations of the data take place within less than 24 hours.
In contrast to traditional economic data, which are released in monthly and quarterly terms, SpaceKnow data is high-frequency and available daily. It is possible to observe the latest movements in the construction industry with just a 4-6 day lag, on average.
The construction materials data help you to estimate the performance of the construction sector and the business activity of the selected companies.
The foundation of delivering high-quality data is based on the success of defining each location to observe and extract the data. All locations are thoroughly researched and validated by an in-house team of annotators and data analysts.
See below how our Construction Materials index performs against the US Non-residential construction spending benchmark
Each individual location is precisely defined to avoid noise in the data, which may arise from traffic or changing vegetation due to seasonal reasons.
SpaceKnow uses radar imagery and its own unique algorithms, so the indices do not lose their significance in bad weather conditions such as rain or heavy clouds.
→ Reach out to get free trial
...
The Google Trends dataset will provide critical signals that individual users and businesses alike can leverage to make better data-driven decisions. This dataset simplifies the manual interaction with the existing Google Trends UI by automating and exposing anonymized, aggregated, and indexed search data in BigQuery. This dataset includes the Top 25 stories and Top 25 Rising queries from Google Trends. It will be made available as two separate BigQuery tables, with a set of new top terms appended daily. Each set of Top 25 and Top 25 rising expires after 30 days, and will be accompanied by a rolling five-year window of historical data in 210 distinct locations in the United States. This Google dataset is hosted in Google BigQuery as part of Google Cloud's Datasets solution and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Google Search Trends: Economic Measures: Mortgage Loan data was reported at 10.000 Score in 14 May 2025. This records a decrease from the previous number of 12.000 Score for 13 May 2025. Google Search Trends: Economic Measures: Mortgage Loan data is updated daily, averaging 10.000 Score from Dec 2021 (Median) to 14 May 2025, with 1261 observations. The data reached an all-time high of 47.000 Score in 21 Apr 2023 and a record low of 0.000 Score in 14 Feb 2023. Google Search Trends: Economic Measures: Mortgage Loan data remains active status in CEIC and is reported by Google Trends. The data is categorized under Global Database’s Spain – Table ES.Google.GT: Google Search Trends: by Categories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Eritrea Google Search Trends: Computer & Electronics: Apple data was reported at 0.000 Score in 15 May 2025. This stayed constant from the previous number of 0.000 Score for 14 May 2025. Eritrea Google Search Trends: Computer & Electronics: Apple data is updated daily, averaging 0.000 Score from Dec 2021 (Median) to 15 May 2025, with 1262 observations. The data reached an all-time high of 100.000 Score in 19 Apr 2025 and a record low of 0.000 Score in 15 May 2025. Eritrea Google Search Trends: Computer & Electronics: Apple data remains active status in CEIC and is reported by Google Trends. The data is categorized under Global Database’s Eritrea – Table ER.Google.GT: Google Search Trends: by Categories.
As of the third quarter of 2024, internet users in South Africa spent more than **** hours and ** minutes online per day, ranking first among the regions worldwide. Brazil followed, with roughly **** hours of daily online usage. As of the examined period, Japan registered the lowest number of daily hours spent online, with users in the country spending an average of over **** hours per day using the internet. The data includes the daily time spent online on any device. Social media usage In recent years, social media has become integral to internet users' daily lives, with users spending an average of *** minutes daily on social media activities. In April 2024, global social network penetration reached **** percent, highlighting its widespread adoption. Among the various platforms, YouTube stands out, with over *** billion monthly active users, making it one of the most popular social media platforms. YouTube’s global popularity In 2023, the keyword "YouTube" ranked among the most popular search queries on Google, highlighting the platform's immense popularity. YouTube generated most of its traffic through mobile devices, with about 98 billion visits. This popularity was particularly evident in the United Arab Emirates, where YouTube penetration reached approximately **** percent, the highest in the world.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset artifact contains the intermediate datasets from pipeline executions necessary to reproduce the results of the paper. We share this artifact in hopes of providing a starting point for other researchers to extend the analysis on notebooks, discover more about their accessibility, and offer solutions to make data science more accessible. The scripts needed to generate these datasets and analyse them are shared in the GitHub repository for this work.
The dataset contains large files of approximately 60 GB so please exercise caution when extracting the data from compressed files.
The dataset contains files which could take a significant amount of run time of the scripts to generate/reproduce.
Dataset Contents
We briefly summarize the included files in our dataset. Please refer to the documentation for specific information about the structure of the data in these files, the scripts to generate them, and runtimes for various parts of our data processing pipeline.
epoch_9_loss_0.04706_testAcc_0.96867_X_resnext101_docSeg.pth: We share this model file, originally provided by Jobin et al., to enable the classification of figures found in our dataset. Please place this into the model/
directory.
model-results.csv: This file contains results from the classification performed on the figures found in the notebooks in our dataset.
Performing this classification may take upto a day.
a11y-scan-dataset.zip: This archive contains two files and results in datasets of approximately 60GB when extracted. Please ensure that you have sufficient disk space to uncompress this zip archive. The archive contains:
a11y/a11y-detailed-result.csv: This dataset contains the accessibility scan results from the scans run on the 100k notebooks across themes.
The detailed result file can be really large (> 60 GB) and can be time-consuming to construct.
a11y/a11y-aggregate-scan.csv: This file is an aggregate of the detailed result that contains the number of each type of error found in each notebook.
This file is also shared outside the compressed directory.
errors-different-counts-a11y-analyze-errors-summary.csv: This file contains the counts of errors that occur in notebooks across different themes.
nb_processed_cell_html.csv: This file contains metadata corresponding to each cell extracted from the html exports of our notebooks.
nb_first_interactive_cell.csv: This file contains the necessary metadata to compute the first interactive element, as defined in our paper, in each notebook.
nb_processed.csv: This file contains the necessary data after processing the notebooks extracting the number of images, imports, languages, and cell level information.
processed_function_calls.csv: This file contains the information about the notebooks, the various imports and function calls used within the notebooks.
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.