5 datasets found
  1. h

    SemBenchmarkSearchQueries

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vCache (2025). SemBenchmarkSearchQueries [Dataset]. https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    vCache
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The SemCacheSearchQueries benchmark is designed to evaluate semantic caching in open-domain search applications. Large-scale search engines, such as Google, increasingly rely on LLMs to generate direct answers to natural language queries. While this improves user experience, it introduces significant latency and cost, particularly at the scale of millions of daily queries. Many queries issued to search engines are paraphrased variations of earlier inputs, making semantic caching a natural fit… See the full description on the dataset page: https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries.

  2. COVID-19 Search Trends symptoms dataset

    • console.cloud.google.com
    Updated Jan 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program&hl=ca&inv=1&invt=Ab5IOA (2023). COVID-19 Search Trends symptoms dataset [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-search-trends?hl=ca
    Explore at:
    Dataset updated
    Jan 5, 2023
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Description

    The COVID-19 Search Trends symptoms dataset shows aggregated, anonymized trends in Google searches for a broad set of health symptoms, signs, and conditions. The dataset provides a daily or weekly time series for each region showing the relative volume of searches for each symptom. This dataset is intended to help researchers to better understand the impact of COVID-19. It shouldn't be used for medical diagnostic, prognostic, or treatment purposes. It also isn't intended to be used for guidance on personal travel plans. To learn more about the dataset, how we generate it and preserve privacy, read the data documentation . To visualize the data, try exploring these interactive charts and map of symptom search trends . As of Dec. 15, 2020, the dataset was expanded to include trends for Australia, Ireland, New Zealand, Singapore, and the United Kingdom. This expanded data is available in new tables that provide data at country and two subregional levels. We will not be updating existing state/county tables going forward. All bytes processed in queries against this dataset will be zeroed out, making this part of the query free. Data joined with the dataset will be billed at the normal rate to prevent abuse. After September 15, queries over these datasets will revert to the normal billing rate. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  3. Data from: Examining bias perpetuation in academic search engines: an...

    • zenodo.org
    bin, csv, zip
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ulloa Roberto; Ulloa Roberto (2024). Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar [Dataset]. http://doi.org/10.5281/zenodo.10636247
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ulloa Roberto; Ulloa Roberto
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Main dataset (main.csv)

    The main file contains an entry (N=28530) per search result in all collected pages. It comprises the following columns:

    1. id: Unique identifier of the file (corresponds to the last part of the filename)
    2. filename: Name of the file associated with the row (the file is in serp_html.zip)
    3. engine: The search engine used (Google Scholar or Semantic Scholar).
    4. browser: The web browser used for the search (Firefox or Chrome)
    5. region: The geographical region where the search was made.
    6. year: The year when the search was made
    7. month: The month when the search was made
    8. day: The day when the search was made
    9. query: The full search query that was used
    10. query_type: The type of the search query (health or technology)
    11. topic: The topic associated with the search query ('covid vaccines', 'cryptocurrencies', 'internet', 'social media', 'vaccines', 'coffee')
    12. trt: Treatment variable associated with the search (benefits or risks).
    13. url: The URL of the (article) search result
    14. title: The title of the (article) search result.
    15. authorship: The author(s) of the (article) search result.
    16. abstract_id: Unique identifier for the abstract of the (article) search result which connects with annotated-abstracts_v0.6.xlsx
    17. abstract_hash: Hash value of the abstract for data integrity
    18. link_n: The total number of results in the search page
    19. rank: The rank of the search result on the search engine results page.
    20. annotation: Any annotations associated with the (article's abstract) search result. One of: '3. Confirms both benefits and risks', '4. Confirms neither benefits nor risks', '1. Confirms benefits', '2. Confirms risks', '5. Abstract not related to {topic}')
    21. valence: -1 for abstracts containing risks, 0 for neutral abstracts, 1 for abstracts only containing benefits

    Annotated abstracts (annotated-abstracts_v0.6.xlsx)

    Manually annotated abstracts resulting from the searches.

    Raw search engine result pages (serp_html.zip)

    The zip contains an HTML per search engine result page collected (N=2853). See column filename from the main dataset.

  4. Query auto-completions for German politicians of the 18th Bundestag

    • zenodo.org
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasiia Samokhina; Malte Bonart; Malte Bonart; Philipp Schaer; Philipp Schaer; Gernot Heisenberg; Anastasiia Samokhina; Gernot Heisenberg (2020). Query auto-completions for German politicians of the 18th Bundestag [Dataset]. http://doi.org/10.5281/zenodo.3462046
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anastasiia Samokhina; Malte Bonart; Malte Bonart; Philipp Schaer; Philipp Schaer; Gernot Heisenberg; Anastasiia Samokhina; Gernot Heisenberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    bundestag.csv - UTF-8 encoded comma separated text file

    This dataset contains the members of the 18th German Bundestag in the constitution of late 2016.

    terms.csv - UTF-8 encoded comma separated text file

    This dataset contains the unordered and pooled auto-completions for the German politicians from Bing search (http://api.bing.net/osjson.aspx), from Duck-Duck-Go (https://duckduckgo.com/ac/) and from Google search (http://clients1.google.de/complete/search). The data was crawled on (mostly) two times per day from 2017/02/03 to 2017/06/19. German language settings were used for Google and Bing, English language setting was used for Duck-Duck-Go. The API requests were sent with an IP address from Cologne, Germany.

    : google, bing or ddg

  5. Bitcoin Blockchain Historical Data

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/bigquery/bitcoin-blockchain
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

    Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

    Content

    In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

    Method & Acknowledgements

    Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

    Photo by Andre Francois on Unsplash.

    Inspiration

    • How many bitcoins are sent each day?
    • How many addresses receive bitcoin each day?
    • Compare transaction volume to historical prices by joining with other available data sources
  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
vCache (2025). SemBenchmarkSearchQueries [Dataset]. https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries

SemBenchmarkSearchQueries

vCache/SemBenchmarkSearchQueries

Explore at:
Dataset updated
May 28, 2025
Authors
vCache
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

The SemCacheSearchQueries benchmark is designed to evaluate semantic caching in open-domain search applications. Large-scale search engines, such as Google, increasingly rely on LLMs to generate direct answers to natural language queries. While this improves user experience, it introduces significant latency and cost, particularly at the scale of millions of daily queries. Many queries issued to search engines are paraphrased variations of earlier inputs, making semantic caching a natural fit… See the full description on the dataset page: https://huggingface.co/datasets/vCache/SemBenchmarkSearchQueries.

Search
Clear search
Close search
Google apps
Main menu