5 datasets found

h
SemBenchmarkSearchQueries
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vCache (2025). SemBenchmarkSearchQueries [Dataset]. https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries
Explore at:
Dataset updated
May 28, 2025
Authors
vCache
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The SemCacheSearchQueries benchmark is designed to evaluate semantic caching in open-domain search applications. Large-scale search engines, such as Google, increasingly rely on LLMs to generate direct answers to natural language queries. While this improves user experience, it introduces significant latency and cost, particularly at the scale of millions of daily queries. Many queries issued to search engines are paraphrased variations of earlier inputs, making semantic caching a natural fit… See the full description on the dataset page: https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries.
COVID-19 Search Trends symptoms dataset
console.cloud.google.com
Updated Jan 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program&hl=ca&inv=1&invt=Ab5IOA (2023). COVID-19 Search Trends symptoms dataset [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-search-trends?hl=ca
Explore at:
Dataset updated
Jan 5, 2023
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Description
The COVID-19 Search Trends symptoms dataset shows aggregated, anonymized trends in Google searches for a broad set of health symptoms, signs, and conditions. The dataset provides a daily or weekly time series for each region showing the relative volume of searches for each symptom. This dataset is intended to help researchers to better understand the impact of COVID-19. It shouldn't be used for medical diagnostic, prognostic, or treatment purposes. It also isn't intended to be used for guidance on personal travel plans. To learn more about the dataset, how we generate it and preserve privacy, read the data documentation . To visualize the data, try exploring these interactive charts and map of symptom search trends . As of Dec. 15, 2020, the dataset was expanded to include trends for Australia, Ireland, New Zealand, Singapore, and the United Kingdom. This expanded data is available in new tables that provide data at country and two subregional levels. We will not be updating existing state/county tables going forward. All bytes processed in queries against this dataset will be zeroed out, making this part of the query free. Data joined with the dataset will be billed at the normal rate to prevent abuse. After September 15, queries over these datasets will revert to the normal billing rate. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Data from: Examining bias perpetuation in academic search engines: an...
zenodo.org
bin, csv, zip
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ulloa Roberto; Ulloa Roberto (2024). Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar [Dataset]. http://doi.org/10.5281/zenodo.10636247
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10636247
Dataset updated
Feb 8, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ulloa Roberto; Ulloa Roberto
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Main dataset (main.csv)

The main file contains an entry (N=28530) per search result in all collected pages. It comprises the following columns:

id: Unique identifier of the file (corresponds to the last part of the filename)

filename: Name of the file associated with the row (the file is in serp_html.zip)

engine: The search engine used (Google Scholar or Semantic Scholar).

browser: The web browser used for the search (Firefox or Chrome)

region: The geographical region where the search was made.

year: The year when the search was made

month: The month when the search was made

day: The day when the search was made

query: The full search query that was used

query_type: The type of the search query (health or technology)

topic: The topic associated with the search query ('covid vaccines', 'cryptocurrencies', 'internet', 'social media', 'vaccines', 'coffee')

trt: Treatment variable associated with the search (benefits or risks).

url: The URL of the (article) search result

title: The title of the (article) search result.

authorship: The author(s) of the (article) search result.

abstract_id: Unique identifier for the abstract of the (article) search result which connects with annotated-abstracts_v0.6.xlsx

abstract_hash: Hash value of the abstract for data integrity

link_n: The total number of results in the search page

rank: The rank of the search result on the search engine results page.

annotation: Any annotations associated with the (article's abstract) search result. One of: '3. Confirms both benefits and risks', '4. Confirms neither benefits nor risks', '1. Confirms benefits', '2. Confirms risks', '5. Abstract not related to {topic}')

valence: -1 for abstracts containing risks, 0 for neutral abstracts, 1 for abstracts only containing benefits

Annotated abstracts (annotated-abstracts_v0.6.xlsx)

Manually annotated abstracts resulting from the searches.

Raw search engine result pages (serp_html.zip)

The zip contains an HTML per search engine result page collected (N=2853). See column filename from the main dataset.
Query auto-completions for German politicians of the 18th Bundestag
zenodo.org
csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasiia Samokhina; Malte Bonart; Malte Bonart; Philipp Schaer; Philipp Schaer; Gernot Heisenberg; Anastasiia Samokhina; Gernot Heisenberg (2020). Query auto-completions for German politicians of the 18th Bundestag [Dataset]. http://doi.org/10.5281/zenodo.3462046
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3462046
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anastasiia Samokhina; Malte Bonart; Malte Bonart; Philipp Schaer; Philipp Schaer; Gernot Heisenberg; Anastasiia Samokhina; Gernot Heisenberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Germany
Description
bundestag.csv - UTF-8 encoded comma separated text file

This dataset contains the members of the 18th German Bundestag in the constitution of late 2016.

terms.csv - UTF-8 encoded comma separated text file

This dataset contains the unordered and pooled auto-completions for the German politicians from Bing search (http://api.bing.net/osjson.aspx), from Duck-Duck-Go (https://duckduckgo.com/ac/) and from Google search (http://clients1.google.de/complete/search). The data was crawled on (mostly) two times per day from 2017/02/03 to 2017/06/19. German language settings were used for Google and Bing, English language setting was used for Duck-Duck-Go. The API requests were sent with an IP address from Cologne, Germany.

: google, bing or ddg
Bitcoin Blockchain Historical Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/bigquery/bitcoin-blockchain
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

Content

In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

Method & Acknowledgements

Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

Photo by Andre Francois on Unsplash.

Inspiration

How many bitcoins are sent each day?

How many addresses receive bitcoin each day?

Compare transaction volume to historical prices by joining with other available data sources
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

vCache (2025). SemBenchmarkSearchQueries [Dataset]. https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries

SemBenchmarkSearchQueries

vCache/SemBenchmarkSearchQueries

Explore at:

Dataset updated

May 28, 2025

Authors

vCache

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

The SemCacheSearchQueries benchmark is designed to evaluate semantic caching in open-domain search applications. Large-scale search engines, such as Google, increasingly rely on LLMs to generate direct answers to natural language queries. While this improves user experience, it introduces significant latency and cost, particularly at the scale of millions of daily queries. Many queries issued to search engines are paraphrased variations of earlier inputs, making semantic caching a natural fit… See the full description on the dataset page: https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries.

Clear search

Close search

Google apps

Main menu

SemBenchmarkSearchQueries

COVID-19 Search Trends symptoms dataset

Data from: Examining bias perpetuation in academic search engines: an...

Main dataset (main.csv)

Annotated abstracts (annotated-abstracts_v0.6.xlsx)

Raw search engine result pages (serp_html.zip)

Query auto-completions for German politicians of the 18th Bundestag

Bitcoin Blockchain Historical Data

Context

Content

Querying BigQuery tables

Method & Acknowledgements

Inspiration

SemBenchmarkSearchQueries

vCache/SemBenchmarkSearchQueries