14 datasets found
  1. Wordle Answer Search Trends Dataset (2021–2025)

    • kaggle.com
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankush Kamboj (2025). Wordle Answer Search Trends Dataset (2021–2025) [Dataset]. https://www.kaggle.com/datasets/kambojankush/wordle-answer-search-trends-dataset-20212025/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ankush Kamboj
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    This dataset investigates the relationship between Wordle answers and Google search spikes, particularly for uncommon words. It spans from June 21, 2021 to June 24, 2025.

    It includes daily data for each Wordle answer, its search trend on that day, and frequency-based commonality indicators.

    🔍 Hypothesis

    Each Wordle answer causes a spike in search volume on the day it appears — more so if the word is rare.

    This dataset supports exploration of:

    • Wordle Answers
    • Trends for wordle answers
    • Correlation between wordle answer rarity and search interest

    Columns

    ColumnDescription
    dateDate of the Wordle puzzle
    wordCorrect 5-letter Wordle answer
    gameWordle game number
    wordfreq_commonalityNormalized frequency score using Python’s wordfreq library
    subtlex_commonalityNormalized frequency score using SUBTLEX-US dataset
    trend_day_globalGoogle search interest on the day (global, all categories)
    trend_avg_200_global200-day average search interest (global, all categories)
    trend_day_languageSearch interest on Wordle day (Language Resources category)
    trend_avg_200_language200-day average search interest (Language Resources category)

    Notes: - All trend values are relative (0–100 scale, per Google Trends)

    🧮 Methodology

    • Wordle answers were scraped from wordfinder.yourdictionary.com
    • Commonality scores were computed using:
      • wordfreq Python library
      • SUBTLEX-US dataset (subtitle frequency, approximating spoken English)
    • Trend data was fetched using Google Trends API via pytrends

    📊 Analysis

    Can find analysis done using this data in the blog post

  2. Drive_Stats

    • huggingface.co
    Updated Apr 10, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Backblaze (2013). Drive_Stats [Dataset]. https://huggingface.co/datasets/backblaze/Drive_Stats
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2013
    Dataset provided by
    Backblaze
    Backblazehttp://backblaze.com/
    Authors
    Backblaze
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Drive Stats

    Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.

  3. d

    MLP-based Learnable Window Size Dataset for Bitcoin Market Price

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajabi, Shahab (2023). MLP-based Learnable Window Size Dataset for Bitcoin Market Price [Dataset]. http://doi.org/10.7910/DVN/5YBLKV
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Rajabi, Shahab
    Description

    The dataset of this paper is collected based on Google, Blockchain, and the Bitcoin market. Generally, there is a total of 26 features, however, a feature whose correlation rate is lower than 0.3 between the variations of price and the variations of feature has been eliminated. Hence, a total of 21 practical features including Market capitalization, Trade-volume, Transaction-fees USD, Average confirmation time, Difficulty, High price, Low price, Total hash rate, Block-size, Miners-revenue, N-transactions-total, Google searches, Open price, N-payments-per Block, Total circulating Bitcoin, Cost-per-transaction percent, Fees-USD-per transaction, N-unique-addresses, N-transactions-per block, and Output-volume have been selected. In addition to the values of these features, for each feature, a new one is created that includes the difference between the previous day and the day before the previous day as a supportive feature. From the point of view of the number and history of the dataset used, a total of 1275 training data were used in the proposed model to extract patterns of Bitcoin price and they were collected from 12 Nov 2018 to 4 Jun 2021.

  4. COVID19 - The New York Times

    • kaggle.com
    zip
    Updated May 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2020). COVID19 - The New York Times [Dataset]. https://www.kaggle.com/bigquery/covid19-nyt
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    May 18, 2020
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    Description

    Context

    This is the US Coronavirus data repository from The New York Times . This data includes COVID-19 cases and deaths reported by state and county. The New York Times compiled this data based on reports from state and local health agencies. More information on the data repository is available here . For additional reporting and data visualizations, see The New York Times’ U.S. coronavirus interactive site

    Sample Queries

    Query 1

    Which US counties have the most confirmed cases per capita? This query determines which counties have the most cases per 100,000 residents. Note that this may differ from similar queries of other datasets because of differences in reporting lag, methodologies, or other dataset differences.

    SELECT covid19.county, covid19.state_name, total_pop AS county_population, confirmed_cases, ROUND(confirmed_cases/total_pop *100000,2) AS confirmed_cases_per_100000, deaths, ROUND(deaths/total_pop *100000,2) AS deaths_per_100000 FROM bigquery-public-data.covid19_nyt.us_counties covid19 JOIN bigquery-public-data.census_bureau_acs.county_2017_5yr acs ON covid19.county_fips_code = acs.geo_id WHERE date = DATE_SUB(CURRENT_DATE(),INTERVAL 1 day) AND covid19.county_fips_code != "00000" ORDER BY confirmed_cases_per_100000 desc

    Query 2

    How do I calculate the number of new COVID-19 cases per day? This query determines the total number of new cases in each state for each day available in the dataset SELECT b.state_name, b.date, MAX(b.confirmed_cases - a.confirmed_cases) AS daily_confirmed_cases FROM (SELECT state_name AS state, state_fips_code , confirmed_cases, DATE_ADD(date, INTERVAL 1 day) AS date_shift FROM bigquery-public-data.covid19_nyt.us_states WHERE confirmed_cases + deaths > 0) a JOIN bigquery-public-data.covid19_nyt.us_states b ON a.state_fips_code = b.state_fips_code AND a.date_shift = b.date GROUP BY b.state_name, date ORDER BY date desc

  5. Data (i.e., evidence) about evidence based medicine

    • figshare.com
    • search.datacite.org
    png
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorge H Ramirez (2023). Data (i.e., evidence) about evidence based medicine [Dataset]. http://doi.org/10.6084/m9.figshare.1093997.v24
    Explore at:
    pngAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jorge H Ramirez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update — December 7, 2014. – Evidence-based medicine (EBM) is not working for many reasons, for example: 1. Incorrect in their foundations (paradox): hierarchical levels of evidence are supported by opinions (i.e., lowest strength of evidence according to EBM) instead of real data collected from different types of study designs (i.e., evidence). http://dx.doi.org/10.6084/m9.figshare.1122534 2. The effect of criminal practices by pharmaceutical companies is only possible because of the complicity of others: healthcare systems, professional associations, governmental and academic institutions. Pharmaceutical companies also corrupt at the personal level, politicians and political parties are on their payroll, medical professionals seduced by different types of gifts in exchange of prescriptions (i.e., bribery) which very likely results in patients not receiving the proper treatment for their disease, many times there is no such thing: healthy persons not needing pharmacological treatments of any kind are constantly misdiagnosed and treated with unnecessary drugs. Some medical professionals are converted in K.O.L. which is only a puppet appearing on stage to spread lies to their peers, a person supposedly trained to improve the well-being of others, now deceits on behalf of pharmaceutical companies. Probably the saddest thing is that many honest doctors are being misled by these lies created by the rules of pharmaceutical marketing instead of scientific, medical, and ethical principles. Interpretation of EBM in this context was not anticipated by their creators. “The main reason we take so many drugs is that drug companies don’t sell drugs, they sell lies about drugs.” ―Peter C. Gøtzsche “doctors and their organisations should recognise that it is unethical to receive money that has been earned in part through crimes that have harmed those people whose interests doctors are expected to take care of. Many crimes would be impossible to carry out if doctors weren’t willing to participate in them.” —Peter C Gøtzsche, The BMJ, 2012, Big pharma often commits corporate crime, and this must be stopped. Pending (Colombia): Health Promoter Entities (In Spanish: EPS ―Empresas Promotoras de Salud).

    1. Misinterpretations New technologies or concepts are difficult to understand in the beginning, it doesn’t matter their simplicity, we need to get used to new tools aimed to improve our professional practice. Probably the best explanation is here in these videos (credits to Antonio Villafaina for sharing these videos with me). English https://www.youtube.com/watch?v=pQHX-SjgQvQ&w=420&h=315 Spanish https://www.youtube.com/watch?v=DApozQBrlhU&w=420&h=315 ----------------------- Hypothesis: hierarchical levels of evidence based medicine are wrong Dear Editor, I have data to support the hypothesis described in the title of this letter. Before rejecting the null hypothesis I would like to ask the following open question:Could you support with data that hierarchical levels of evidence based medicine are correct? (1,2) Additional explanation to this question: – Only respond to this question attaching publicly available raw data.– Be aware that more than a question this is a challenge: I have data (i.e., evidence) which is contrary to classic (i.e., McMaster) or current (i.e., Oxford) hierarchical levels of evidence based medicine. An important part of this data (but not all) is publicly available. References
    2. Ramirez, Jorge H (2014): The EBM challenge. figshare. http://dx.doi.org/10.6084/m9.figshare.1135873
    3. The EBM Challenge Day 1: No Answers. Competing interests: I endorse the principles of open data in human biomedical research Read this letter on The BMJ – August 13, 2014.http://www.bmj.com/content/348/bmj.g3725/rr/762595Re: Greenhalgh T, et al. Evidence based medicine: a movement in crisis? BMJ 2014; 348: g3725. _ Fileset contents Raw data: Excel archive: Raw data, interactive figures, and PubMed search terms. Google Spreadsheet is also available (URL below the article description). Figure 1. Unadjusted (Fig 1A) and adjusted (Fig 1B) PubMed publication trends (01/01/1992 to 30/06/2014). Figure 2. Adjusted PubMed publication trends (07/01/2008 to 29/06/2014) Figure 3. Google search trends: Jan 2004 to Jun 2014 / 1-week periods. Figure 4. PubMed publication trends (1962-2013) systematic reviews and meta-analysis, clinical trials, and observational studies.
      Figure 5. Ramirez, Jorge H (2014): Infographics: Unpublished US phase 3 clinical trials (2002-2014) completed before Jan 2011 = 50.8%. figshare.http://dx.doi.org/10.6084/m9.figshare.1121675 Raw data: "13377 studies found for: Completed | Interventional Studies | Phase 3 | received from 01/01/2002 to 01/01/2014 | Worldwide". This database complies with the terms and conditions of ClinicalTrials.gov: http://clinicaltrials.gov/ct2/about-site/terms-conditions Supplementary Figures (S1-S6). PubMed publication delay in the indexation processes does not explain the descending trends in the scientific output of evidence-based medicine. Acknowledgments I would like to acknowledge the following persons for providing valuable concepts in data visualization and infographics:
    4. Maria Fernanda Ramírez. Professor of graphic design. Universidad del Valle. Cali, Colombia.
    5. Lorena Franco. Graphic design student. Universidad del Valle. Cali, Colombia. Related articles by this author (Jorge H. Ramírez)
    6. Ramirez JH. Lack of transparency in clinical trials: a call for action. Colomb Med (Cali) 2013;44(4):243-6. URL: http://www.ncbi.nlm.nih.gov/pubmed/24892242
    7. Ramirez JH. Re: Evidence based medicine is broken (17 June 2014). http://www.bmj.com/node/759181
    8. Ramirez JH. Re: Global rules for global health: why we need an independent, impartial WHO (19 June 2014). http://www.bmj.com/node/759151
    9. Ramirez JH. PubMed publication trends (1992 to 2014): evidence based medicine and clinical practice guidelines (04 July 2014). http://www.bmj.com/content/348/bmj.g3725/rr/759895 Recommended articles
    10. Greenhalgh Trisha, Howick Jeremy,Maskrey Neal. Evidence based medicine: a movement in crisis? BMJ 2014;348:g3725
    11. Spence Des. Evidence based medicine is broken BMJ 2014; 348:g22
    12. Schünemann Holger J, Oxman Andrew D,Brozek Jan, Glasziou Paul, JaeschkeRoman, Vist Gunn E et al. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies BMJ 2008; 336:1106
    13. Lau Joseph, Ioannidis John P A, TerrinNorma, Schmid Christopher H, OlkinIngram. The case of the misleading funnel plot BMJ 2006; 333:597
    14. Moynihan R, Henry D, Moons KGM (2014) Using Evidence to Combat Overdiagnosis and Overtreatment: Evaluating Treatments, Tests, and Disease Definitions in the Time of Too Much. PLoS Med 11(7): e1001655. doi:10.1371/journal.pmed.1001655
    15. Katz D. A-holistic view of evidence based medicinehttp://thehealthcareblog.com/blog/2014/05/02/a-holistic-view-of-evidence-based-medicine/ ---
  6. Google energy consumption 2011-2023

    • statista.com
    • ai-chatbox.pro
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Google energy consumption 2011-2023 [Dataset]. https://www.statista.com/statistics/788540/energy-consumption-of-google/
    Explore at:
    Dataset updated
    Oct 11, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    Google’s energy consumption has increased over the last few years, reaching 25.9 terawatt hours in 2023, up from 12.8 terawatt hours in 2019. The company has made efforts to make its data centers more efficient through customized high-performance servers, using smart temperature and lighting, advanced cooling techniques, and machine learning. Datacenters and energy Through its operations, Google pursues a more sustainable impact on the environment by creating efficient data centers that use less energy than the average, transitioning towards renewable energy, creating sustainable workplaces, and providing its users with the technological means towards a cleaner future for the future generations. Through its efficient data centers, Google has also managed to divert waste from its operations away from landfills. Reducing Google’s carbon footprint Google’s clean energy efforts is also related to their efforts to reduce their carbon footprint. Since their commitment to using 100 percent renewable energy, the company has met their targets largely through solar and wind energy power purchase agreements and buying renewable power from utilities. Google is one of the largest corporate purchasers of renewable energy in the world.

  7. H

    Capturing the Aftermath of the Dobbs v. Jackson Decision in the Google...

    • dataverse.harvard.edu
    Updated Jan 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brooke Perreault; Anya Wintner; Lan Dau; Eni Mustafaraj (2023). Capturing the Aftermath of the Dobbs v. Jackson Decision in the Google Search Results across 65 U.S. Locations [Dataset]. http://doi.org/10.7910/DVN/YFAH9X
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Brooke Perreault; Anya Wintner; Lan Dau; Eni Mustafaraj
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset for the paper "Capturing the Aftermath of the Dobbs v. Jackson Decision in the Google Search Results across 65 U.S. Locations" to appear in the proceedings of ICWSM 2023. Starting on the day of the U.S Supreme Court decision to overturn Roe v. Wade, we collected Google Search result pages for 21 days in 65 U.S. locations for a set of almost 1,700 queries. We stored all the SERPs generated by Google. Because the archives containing these SERPs are much larger than the file limits of the Harvard Dataverse, you can find them at this address: https://cs.wellesley.edu/~credlab/icwsm2023/. Instead, in this repository we will share all the files that were created by parsing some of the information in the SERPs: organic search results, top stories, and embedded tweets. We also provide aggregated statistics for the domains appearing in the organic results and the top stories. This dataset can be useful for answering questions about Google Search's algorithms with respect to shaping access to information related to important news events.

  8. Ethereum Cryptocurrency

    • console.cloud.google.com
    Updated Apr 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Ethereum&hl=ES&inv=1&invt=Ab2tpQ (2023). Ethereum Cryptocurrency [Dataset]. https://console.cloud.google.com/marketplace/product/ethereum/crypto-ethereum-blockchain?hl=ES
    Explore at:
    Dataset updated
    Apr 22, 2023
    Dataset provided by
    Googlehttp://google.com/
    Description

    Ethereum is a crypto currency which leverages blockchain technology to store transactions in a distributed ledger. A blockchain is an ever-growing "tree" of blocks, where each block contains a number of transactions. To learn more, read the "Ethereum in BigQuery: a Public Dataset for smart contract analytics" blog post by Google Developer Advocate Allen Day. This dataset is part of a larger effort to make cryptocurrency data available in BigQuery through the Google Cloud Public Datasets program . The program is hosting several cryptocurrency datasets, with plans to both expand offerings to include additional cryptocurrencies and reduce the latency of updates. You can find these datasets by searching "cryptocurrency" in GCP Marketplace. For analytics interoperability, we designed a unified schema that allows all Bitcoin-like datasets to share queries. Interested in learning more about how the data from these blockchains were brought into BigQuery? Looking for more ways to analyze the data? Check out the Google Cloud Big Data blog post and try the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  9. Data from: Examining bias perpetuation in academic search engines: an...

    • zenodo.org
    bin, csv, zip
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ulloa Roberto; Ulloa Roberto (2024). Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar [Dataset]. http://doi.org/10.5281/zenodo.10636247
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ulloa Roberto; Ulloa Roberto
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Main dataset (main.csv)

    The main file contains an entry (N=28530) per search result in all collected pages. It comprises the following columns:

    1. id: Unique identifier of the file (corresponds to the last part of the filename)
    2. filename: Name of the file associated with the row (the file is in serp_html.zip)
    3. engine: The search engine used (Google Scholar or Semantic Scholar).
    4. browser: The web browser used for the search (Firefox or Chrome)
    5. region: The geographical region where the search was made.
    6. year: The year when the search was made
    7. month: The month when the search was made
    8. day: The day when the search was made
    9. query: The full search query that was used
    10. query_type: The type of the search query (health or technology)
    11. topic: The topic associated with the search query ('covid vaccines', 'cryptocurrencies', 'internet', 'social media', 'vaccines', 'coffee')
    12. trt: Treatment variable associated with the search (benefits or risks).
    13. url: The URL of the (article) search result
    14. title: The title of the (article) search result.
    15. authorship: The author(s) of the (article) search result.
    16. abstract_id: Unique identifier for the abstract of the (article) search result which connects with annotated-abstracts_v0.6.xlsx
    17. abstract_hash: Hash value of the abstract for data integrity
    18. link_n: The total number of results in the search page
    19. rank: The rank of the search result on the search engine results page.
    20. annotation: Any annotations associated with the (article's abstract) search result. One of: '3. Confirms both benefits and risks', '4. Confirms neither benefits nor risks', '1. Confirms benefits', '2. Confirms risks', '5. Abstract not related to {topic}')
    21. valence: -1 for abstracts containing risks, 0 for neutral abstracts, 1 for abstracts only containing benefits

    Annotated abstracts (annotated-abstracts_v0.6.xlsx)

    Manually annotated abstracts resulting from the searches.

    Raw search engine result pages (serp_html.zip)

    The zip contains an HTML per search engine result page collected (N=2853). See column filename from the main dataset.

  10. Bitcoin Blockchain Historical Data

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/bigquery/bitcoin-blockchain
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

    Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

    Content

    In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

    Method & Acknowledgements

    Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

    Photo by Andre Francois on Unsplash.

    Inspiration

    • How many bitcoins are sent each day?
    • How many addresses receive bitcoin each day?
    • Compare transaction volume to historical prices by joining with other available data sources
  11. Zilliqa Cryptocurrency

    • console.cloud.google.com
    Updated Mar 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Cloud%20Public%20Datasets%20-%20Finance&inv=1&invt=Ab3jKQ (2021). Zilliqa Cryptocurrency [Dataset]. https://console.cloud.google.com/marketplace/product/public-data-finance/crypto-zilliqa-dataset
    Explore at:
    Dataset updated
    Mar 5, 2021
    Dataset provided by
    Googlehttp://google.com/
    Description

    Zilliqa is a blockchain platform designed around the concept of sharding. Sharding means dividing the network into several smaller component networks that are able to process transactions in parallel. Cryptocurrency markets are becoming more accessible and analysis is increasing by the day. Gone will be the days when investors jump into crypto to become overnight millionaires and it is the reason some ICOs are doing well while others have been losing value since January. Zilliqa platform is among the few projects that seem to be gaining favor from different facets in the financial services sector. Zilliqa aims at maximizing scalability within blockchain tech. The platform has been developed using sharding tech in order to interlink more networks. This dataset is part of a larger effort to make cryptocurrency data available in BigQuery through the Google Cloud Public Datasets program. The program is hosting several cryptocurrency datasets, with plans to both expand offerings to include additional cryptocurrencies and reduce the latency of updates. You can find these datasets by searching "cryptocurrency" in GCP Marketplace. For analytics interoperability, we designed a unified schema that allows all Bitcoin-like datasets to share queries. Interested in learning more about how the data from these blockchains were brought into BigQuery? Looking for more ways to analyze the data? Check out the Google Cloud Big Data blog post and try the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using

  12. Lead Scoring Dataset

    • kaggle.com
    zip
    Updated Aug 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrita Chatterjee (2020). Lead Scoring Dataset [Dataset]. https://www.kaggle.com/amritachatterjee09/lead-scoring-dataset
    Explore at:
    zip(411028 bytes)Available download formats
    Dataset updated
    Aug 17, 2020
    Authors
    Amrita Chatterjee
    Description

    Context

    An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

    The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

    Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

    There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.

    X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

    Content

    Variables Description * Prospect ID - A unique ID with which the customer is identified. * Lead Number - A lead number assigned to each lead procured. * Lead Origin - The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc. * Lead Source - The source of the lead. Includes Google, Organic Search, Olark Chat, etc. * Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not. * Do Not Call - An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not. * Converted - The target variable. Indicates whether a lead has been successfully converted or not. * TotalVisits - The total number of visits made by the customer on the website. * Total Time Spent on Website - The total time spent by the customer on the website. * Page Views Per Visit - Average number of pages on the website viewed during the visits. * Last Activity - Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc. * Country - The country of the customer. * Specialization - The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form. * How did you hear about X Education - The source from which the customer heard about X Education. * What is your current occupation - Indicates whether the customer is a student, umemployed or employed. * What matters most to you in choosing this course An option selected by the customer - indicating what is their main motto behind doing this course. * Search - Indicating whether the customer had seen the ad in any of the listed items. * Magazine
    * Newspaper Article * X Education Forums
    * Newspaper * Digital Advertisement * Through Recommendations - Indicates whether the customer came in through recommendations. * Receive More Updates About Our Courses - Indicates whether the customer chose to receive more updates about the courses. * Tags - Tags assigned to customers indicating the current status of the lead. * Lead Quality - Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead. * Update me on Supply Chain Content - Indicates whether the customer wants updates on the Supply Chain Content. * Get updates on DM Content - Indicates whether the customer wants updates on the DM Content. * Lead Profile - A lead level assigned to each customer based on their profile. * City - The city of the customer. * Asymmetric Activity Index - An index and score assigned to each customer based on their activity and their profile * Asymmetric Profile Index * Asymmetric Activity Score * Asymmetric Profile Score
    * I agree to pay the amount through cheque - Indicates whether the customer has agreed to pay the amount through cheque or not. * a free copy of Mastering The Interview - Indicates whether the customer wants a free copy of 'Mastering the Interview' or not. * Last Notable Activity - The last notable activity performed by the student.

    Acknowledgements

    UpGrad Case Study

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  13. Keywords for Lawyers

    • link-assistant.com
    xlsx
    Updated May 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEO PowerSuite (2023). Keywords for Lawyers [Dataset]. https://www.link-assistant.com/news/keywords-for-lawyers.html
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 13, 2023
    Dataset provided by
    Authors
    SEO PowerSuite
    Description

    A dataset of keywords that are relevant to lawyers, including their definitions, synonyms, antonyms, search volume and costs.

  14. f

    Dataset for Multivariate Bitcoin Price Forecasting.

    • figshare.com
    txt
    Updated Apr 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anny Mardjo; Chidchanok Choksuchat (2023). Dataset for Multivariate Bitcoin Price Forecasting. [Dataset]. http://doi.org/10.6084/m9.figshare.22678540.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 22, 2023
    Dataset provided by
    figshare
    Authors
    Anny Mardjo; Chidchanok Choksuchat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset was collected for the period spanning between 01/07/2019 and 31/12/2022.The historical Twitter volume were retrieved using ‘‘Bitcoin’’ (case insensitive) as the keyword from bitinfocharts.com. Google search volume was retrieved using library Gtrends. 2000 tweets per day using 4 times interval were crawled by employing Twitter API with the keyword “Bitcoin. The daily closing prices of Bitcoin, oil price, gold price, and U.S stock market indexes (S&P 500, NASDAQ, and Dow Jones Industrial Average) were collected using R libraries either Quantmod or Quandl.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ankush Kamboj (2025). Wordle Answer Search Trends Dataset (2021–2025) [Dataset]. https://www.kaggle.com/datasets/kambojankush/wordle-answer-search-trends-dataset-20212025/discussion
Organization logo

Wordle Answer Search Trends Dataset (2021–2025)

It includes daily data for each Wordle answer, its search trend on that day, and

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ankush Kamboj
License

https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

Description

This dataset investigates the relationship between Wordle answers and Google search spikes, particularly for uncommon words. It spans from June 21, 2021 to June 24, 2025.

It includes daily data for each Wordle answer, its search trend on that day, and frequency-based commonality indicators.

🔍 Hypothesis

Each Wordle answer causes a spike in search volume on the day it appears — more so if the word is rare.

This dataset supports exploration of:

  • Wordle Answers
  • Trends for wordle answers
  • Correlation between wordle answer rarity and search interest

Columns

ColumnDescription
dateDate of the Wordle puzzle
wordCorrect 5-letter Wordle answer
gameWordle game number
wordfreq_commonalityNormalized frequency score using Python’s wordfreq library
subtlex_commonalityNormalized frequency score using SUBTLEX-US dataset
trend_day_globalGoogle search interest on the day (global, all categories)
trend_avg_200_global200-day average search interest (global, all categories)
trend_day_languageSearch interest on Wordle day (Language Resources category)
trend_avg_200_language200-day average search interest (Language Resources category)

Notes: - All trend values are relative (0–100 scale, per Google Trends)

🧮 Methodology

  • Wordle answers were scraped from wordfinder.yourdictionary.com
  • Commonality scores were computed using:
    • wordfreq Python library
    • SUBTLEX-US dataset (subtitle frequency, approximating spoken English)
  • Trend data was fetched using Google Trends API via pytrends

📊 Analysis

Can find analysis done using this data in the blog post

Search
Clear search
Close search
Google apps
Main menu