You can check the fields description in the documentation: current Full database: https://docs.dataforseo.com/v3/databases/google/full/?bash; Historical Full database: https://docs.dataforseo.com/v3/databases/google/history/full/?bash.
Full Google Database is a combination of the Advanced Google SERP Database and Google Keyword Database.
Google SERP Database offers millions of SERPs collected in 67 regions with most of Google’s advanced SERP features, including featured snippets, knowledge graphs, people also ask sections, top stories, and more.
Google Keyword Database encompasses billions of search terms enriched with related Google Ads data: search volume trends, CPC, competition, and more.
This database is available in JSON format only.
You don’t have to download fresh data dumps in JSON – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Venessa
Released under Database: Open Database, Contents: Database Contents
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset compares how much U.S. adults trust ChatGPT relative to Google Search, including responses from a 2025 national survey measuring perceptions of AI accuracy and reliability.
This dashboard provide insights by month on how people find State of Iowa agency listings on the web via Google Search and Maps, and what they do once they find it to include providing reviews (ratings), accessing agency websites, requesting directions, and making calls.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental studies in the area of Psychology and Behavioral Economics have suggested that people change their search pattern in response to positive and negative events. Using Internet search data provided by Google, we investigated the relationship between stock-specific events and related Google searches. We studied daily data from 13 stocks from the Dow-Jones and NASDAQ100 indices, over a period of 4 trading years. Focusing on periods in which stocks were extensively searched (Intensive Search Periods), we found a correlation between the magnitude of stock returns at the beginning of the period and the volume, peak, and duration of search generated during the period. This relation between magnitudes of stock returns and subsequent searches was considerably magnified in periods following negative stock returns. Yet, we did not find that intensive search periods following losses were associated with more Google searches than periods following gains. Thus, rather than increasing search, losses improved the fit between people’s search behavior and the extent of real-world events triggering the search. The findings demonstrate the robustness of the attentional effect of losses.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank
This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.
For more information, see the World Bank website.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population
http://data.worldbank.org/data-catalog/ed-stats
https://cloud.google.com/bigquery/public-data/world-bank-education
Citation: The World Bank: Education Statistics
Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @till_indeman from Unplash.
Of total government spending, what percentage is spent on education?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.
Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!
ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.
The release of this dataset was featured further in a Kaggle blog post here.
https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">
See here for more information.
This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json
format. This file contains an entry for each paper, containing:
- id
: ArXiv ID (can be used to access the paper, see below)
- submitter
: Who submitted the paper
- authors
: Authors of the paper
- title
: Title of the paper
- comments
: Additional info, such as number of pages and figures
- journal-ref
: Information about the journal the paper was published in
- doi
: https://www.doi.org
- abstract
: The abstract of the paper
- categories
: Categories / tags in the ArXiv system
- versions
: A version history
You can access each paper directly on ArXiv using these links:
- https://arxiv.org/abs/{id}
: Page for this paper including its abstract and further links
- https://arxiv.org/pdf/{id}
: Direct link to download the PDF
The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset
or through Google API (json documentation and xml documentation).
You can use for example gsutil to download the data to your local machine. ```
gsutil cp gs://arxiv-dataset/arxiv/
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/
gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```
We're automatically updating the metadata as well as the GCS bucket on a weekly basis.
Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.
The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.
We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
This dataset provides insights by month on how people find State of Iowa agency listings on the web via Google Search and Maps, and what they do once they find it to include providing reviews (ratings), accessing agency websites, requesting directions, and making calls.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset investigates the relationship between Wordle answers and Google search spikes, particularly for uncommon words. It spans from June 21, 2021 to June 24, 2025.
It includes daily data for each Wordle answer, its search trend on that day, and frequency-based commonality indicators.
Each Wordle answer causes a spike in search volume on the day it appears — more so if the word is rare.
This dataset supports exploration of:
Column | Description |
---|---|
date | Date of the Wordle puzzle |
word | Correct 5-letter Wordle answer |
game | Wordle game number |
wordfreq_commonality | Normalized frequency score using Python’s wordfreq library |
subtlex_commonality | Normalized frequency score using SUBTLEX-US dataset |
trend_day_global | Google search interest on the day (global, all categories) |
trend_avg_200_global | 200-day average search interest (global, all categories) |
trend_day_language | Search interest on Wordle day (Language Resources category) |
trend_avg_200_language | 200-day average search interest (Language Resources category) |
Notes: - All trend values are relative (0–100 scale, per Google Trends)
wordfreq
Python librarypytrends
Can find analysis done using this data in the blog post
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game "Quick, Draw!". The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.
Example drawings: https://raw.githubusercontent.com/googlecreativelab/quickdraw-dataset/master/preview.jpg" alt="preview">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IndianFoodNet-30 is created by Ritu Agarwal, Nikunj Bansal, Tanupriya Choudhury, Tanmay Sarkar & Neelu Jyothi Ahuja with a goal of building an Indian Food detection model. It contains more than 5500 images of 30 popular Indian food items.
We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.
This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).
If you find our dataset useful, please cite us as:
@dataset{dataset,
author = {Agarwal, Ritu and Bansal, Nikunj and Choudhury, Tanupriya and Sarkar, Tanmay and J.Ahuja, Neelu},
year = {2023},
title = {IndianFoodNet-30 Dataset},
publisher = {Roboflow Universe},
url = {https://universe.roboflow.com/indianfoodnet/indianfoodnet},
}
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The CMS National Plan and Provider Enumeration System (NPPES) was developed as part of the Administrative Simplification provisions in the original HIPAA act. The primary purpose of NPPES was to develop a unique identifier for each physician that billed medicare and medicaid. This identifier is now known as the National Provider Identifier Standard (NPI) which is a required 10 digit number that is unique to an individual provider at the national level.
Once an NPI record is assigned to a healthcare provider, parts of the NPI record that have public relevance, including the provider’s name, speciality, and practice address are published in a searchable website as well as downloadable file of zipped data containing all of the FOIA disclosable health care provider data in NPPES and a separate PDF file of code values which documents and lists the descriptions for all of the codes found in the data file.
The dataset contains the latest NPI downloadable file in an easy to query BigQuery table, npi_raw. In addition, there is a second table, npi_optimized which harnesses the power of Big Query’s next-generation columnar storage format to provide an analytical view of the NPI data containing description fields for the codes based on the mappings in Data Dissemination Public File - Code Values documentation as well as external lookups to the healthcare provider taxonomy codes . While this generates hundreds of columns, BigQuery makes it possible to process all this data effectively and have a convenient single lookup table for all provider information.
Fork this kernel to get started.
https://console.cloud.google.com/marketplace/details/hhs/nppes?filter=category:science-research
Dataset Source: Center for Medicare and Medicaid Services. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @rawpixel from Unplash.
What are the top ten most common types of physicians in Mountain View?
What are the names and phone numbers of dentists in California who studied public health?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Update — December 7, 2014. – Evidence-based medicine (EBM) is not working for many reasons, for example: 1. Incorrect in their foundations (paradox): hierarchical levels of evidence are supported by opinions (i.e., lowest strength of evidence according to EBM) instead of real data collected from different types of study designs (i.e., evidence). http://dx.doi.org/10.6084/m9.figshare.1122534 2. The effect of criminal practices by pharmaceutical companies is only possible because of the complicity of others: healthcare systems, professional associations, governmental and academic institutions. Pharmaceutical companies also corrupt at the personal level, politicians and political parties are on their payroll, medical professionals seduced by different types of gifts in exchange of prescriptions (i.e., bribery) which very likely results in patients not receiving the proper treatment for their disease, many times there is no such thing: healthy persons not needing pharmacological treatments of any kind are constantly misdiagnosed and treated with unnecessary drugs. Some medical professionals are converted in K.O.L. which is only a puppet appearing on stage to spread lies to their peers, a person supposedly trained to improve the well-being of others, now deceits on behalf of pharmaceutical companies. Probably the saddest thing is that many honest doctors are being misled by these lies created by the rules of pharmaceutical marketing instead of scientific, medical, and ethical principles. Interpretation of EBM in this context was not anticipated by their creators. “The main reason we take so many drugs is that drug companies don’t sell drugs, they sell lies about drugs.” ―Peter C. Gøtzsche “doctors and their organisations should recognise that it is unethical to receive money that has been earned in part through crimes that have harmed those people whose interests doctors are expected to take care of. Many crimes would be impossible to carry out if doctors weren’t willing to participate in them.” —Peter C Gøtzsche, The BMJ, 2012, Big pharma often commits corporate crime, and this must be stopped. Pending (Colombia): Health Promoter Entities (In Spanish: EPS ―Empresas Promotoras de Salud).
As global communities responded to COVID-19, we heard from public health officials that the same type of aggregated, anonymized insights we use in products such as Google Maps would be helpful as they made critical decisions to combat COVID-19. These Community Mobility Reports aimed to provide insights into what changed in response to policies aimed at combating COVID-19. The reports charted movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Allergen30 is created by Mayank Mishra, Nikunj Bansal, Tanmay Sarkar and Tanupriya Choudhury with a goal of building a robust detection model that can assist people in avoiding possible allergic reactions.
It contains more than 6,000 images of 30 commonly used food items which can cause an adverse reaction within a human body. This dataset is one of the first research attempts in training a deep learning based computer vision model to detect the presence of such food items from images. It also serves as a benchmark for evaluating the efficacy of object detection methods in learning the otherwise difficult visual cues related to food items.
There are multiple food items pertaining to specific food intolerances which can trigger an allergic reaction. Such food intolerance primarily include Lactose, Histamine, Gluten, Salicylate, Caffeine and Ovomucoid intolerance.
https://github.com/mmayank74567/mmayank74567.github.io/blob/master/images/FoodIntol.png?raw=true" alt="Food intolerance">
The following table contains the description relating to the 30 class labels in our dataset.
S. No. | Allergen | Food label | Description |
---|---|---|---|
1 | Ovomucoid | egg | Images of egg with yolk (e.g. sunny side up eggs) |
2 | Ovomucoid | whole_egg_boiled | Images of soft and hard boiled eggs |
3 | Lactose/Histamine | milk | Images of milk in a glass |
4 | Lactose | icecream | Images of icecream scoops |
5 | Lactose | cheese | Images of swiss cheese |
6 | Lactose/ Caffeine | milk_based_beverage | Images of tea/ coffee with milk in a cup/glass |
7 | Lactose/Caffeine | chocolate | Images of chocolate bars |
8 | Caffeine | non_milk_based_beverage | Images of soft drinks and tea/coffee without milk in a cup/glass |
9 | Histamine | cooked_meat | Images of cooked meat |
10 | Histamine | raw_meat | Images of raw meat |
11 | Histamine | alcohol | Images of alcohol bottles |
12 | Histamine | alcohol_glass | Images of wine glasses with alcohol |
13 | Histamine | spinach | Images of spinach bundle |
14 | Histamine | avocado | Images of avocado sliced in half |
15 | Histamine | eggplant | Images of eggplant |
16 | Salicylate | blueberry | Images of blueberry |
17 | Salicylate | blackberry | Images of blackberry |
18 | Salicylate | strawberry | Images of strawberry |
19 | Salicylate | pineapple | Images of pineapple |
20 | Salicylate | capsicum | Images of bell pepper |
21 | Salicylate | mushroom | Images of mushrooms |
22 | Salicylate | dates | Images of dates |
23 | Salicylate | almonds | Images of almonds |
24 | Salicylate | pistachios | Images of pistachios |
25 | Salicylate | tomato | Images of tomato and tomato slices |
26 | Gluten | roti | Images of roti |
27 | Gluten | pasta | Images of one serving of penne pasta |
28 | Gluten | bread | Images of bread slices |
29 | Gluten | bread_loaf | Images of bread loaf |
30 | Gluten | pizza | Images of pizza and pizza slices |
We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.
This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).
The global number of smartphone users in was forecast to continuously increase between 2024 and 2029 by in total 1.8 billion users (+42.62 percent). After the ninth consecutive increasing year, the smartphone user base is estimated to reach 6.1 billion users and therefore a new peak in 2029. Notably, the number of smartphone users of was continuously increasing over the past years.Smartphone users here are limited to internet users of any age using a smartphone. The shown figures have been derived from survey data that has been processed to estimate missing demographics.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of smartphone users in countries like Australia & Oceania and Asia.
The American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about our nation and its people by contacting over 3.5 million households across the country. The resulting data provides incredibly detailed demographic information across the US aggregated at various geographic levels which helps determine how more than $675 billion in federal and state funding are distributed each year. Businesses use ACS data to inform strategic decision-making. ACS data can be used as a component of market research, provide information about concentrations of potential employees with a specific education or occupation, and which communities could be good places to build offices or facilities. For example, someone scouting a new location for an assisted-living center might look for an area with a large proportion of seniors and a large proportion of people employed in nursing occupations. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics. Public officials, planners, and entrepreneurs use this information to assess the past and plan the future. For more information, see the Census Bureau's ACS Information Guide . This public dataset is hosted in Google BigQuery as part of the Google Cloud Public Datasets Program , with Carto providing cleaning and onboarding support. It is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
A dataset providing information about local council services in Leeds. Leeds City Council uses this information to populate the Knowledge Panels on the Google search website. The dataset includes type of service, contact information and opening times. What is a Knowledge Panel? When people search for a business on Google, they may see information about that business in a box that appears to the right of their search results. The information in the box, called the Knowledge Panel, can help customers discover and contact your business. Is the information correct? If you spot any information which you believe to be incorrect please contact us on webmaster@leeds.gov.uk . We can then investigate this and update this dataset and the Google Knowledge Panel. Automated update This dataset is automatically updated on a fortnightly basis
IoTeX is a decentralized crypto system, a new generation of blockchain platform for the development of the Internet of things (IoT). The project team is sure that the users do not have such an application that would motivate to implement the technology of the Internet of things in life. And while this will not be created, people will not have the desire to spend money and time on IoT. The developers of IoTeX decided to implement not the application itself, but the platform for creation. It is through the platform that innovative steps in the space of the Internet of things will be encouraged. Learn more... This dataset is one of many crypto datasets that are available within the Google Cloud Public Datasets . As with other Google Cloud public datasets, you can query this dataset for free, up to 1TB/month of free processing, every month. Watch this short video to learn how to get started with the public datasets. Want to know how the data from these blockchains were brought into BigQuery, and learn how to analyze the data? En savoir plus
You can check the fields description in the documentation: current Full database: https://docs.dataforseo.com/v3/databases/google/full/?bash; Historical Full database: https://docs.dataforseo.com/v3/databases/google/history/full/?bash.
Full Google Database is a combination of the Advanced Google SERP Database and Google Keyword Database.
Google SERP Database offers millions of SERPs collected in 67 regions with most of Google’s advanced SERP features, including featured snippets, knowledge graphs, people also ask sections, top stories, and more.
Google Keyword Database encompasses billions of search terms enriched with related Google Ads data: search volume trends, CPC, competition, and more.
This database is available in JSON format only.
You don’t have to download fresh data dumps in JSON – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.