100+ datasets found
  1. SEO-Data

    • kaggle.com
    zip
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerome (2025). SEO-Data [Dataset]. https://www.kaggle.com/datasets/deeprankai/seo-data
    Explore at:
    zip(22686543 bytes)Available download formats
    Dataset updated
    Mar 4, 2025
    Authors
    Gerome
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📊 SEO Search Results Dataset (SERP Data)

    Filename: SEO_data.csv Size: 56.63 MB Rows: ~100,000+ Columns: 7 Language: Primarily English (may contain multilingual snippets)

    🔍 Dataset Overview

    This dataset contains structured data scraped from Google Search Engine Results Pages (SERPs), specifically curated for SEO and machine learning research. It includes search rankings and metadata for various keywords, capturing how websites rank and present their content on search engines.

    🧾 Columns Description

    Column NameDescription
    wordsThe search keyword or query entered into Google
    rankThe result's position on the search engine results page (1 = top)
    titleThe meta title of the page
    h1The primary <h1> tag from the page (if available)
    snippetThe search result snippet/description shown on Google
    linksThe URL of the ranked result
    total_resultThe total number of search results Google reports for the query

    📌 Use Cases

    • Keyword ranking analysis
    • SERP feature extraction
    • SEO optimization insights
    • Natural language processing (NLP) tasks on snippets, titles, and headings
    • Predictive modeling for search rankings
    • Trend analysis on keyword frequency and ranking shifts

    📁 Example Record

    wordsranktitleh1snippetlinkstotal_result
    Artificial intelligence1Beginning Your Journey to Implementing Artificial IntelligenceBeginning Your Journey...Gérer les éditeurs grâce à des services...https://www.softwareone.com/...776,000,000

    📎 Notes

    • Multiple rows may exist for the same keyword due to multiple ranked results.
    • Some values (like H1 or snippets) may occasionally be missing or partial due to scraping limitations.
    • Useful for benchmarking search trends or training LLMs on SEO-related text features.

    Enjoy

  2. Google autocomplete data for 'machine learning'

    • kaggle.com
    zip
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zaid Qureshi (2023). Google autocomplete data for 'machine learning' [Dataset]. https://www.kaggle.com/datasets/zq1200/google-autocomplete-data-for-machine-learning
    Explore at:
    zip(19329 bytes)Available download formats
    Dataset updated
    Mar 8, 2023
    Authors
    Zaid Qureshi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains current suggestions for the term "machine learning", that Google gives, when users type the prompt into it's search engine on desktop. The data covers searches in the US, Canada, and UK, and only in the English language.

  3. SERP data from controversial queries on Google and Bing

    • zenodo.org
    bin, csv, zip
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sal Hagen; Sal Hagen; Guillén Torres; Guillén Torres (2025). SERP data from controversial queries on Google and Bing [Dataset]. http://doi.org/10.5281/zenodo.14919504
    Explore at:
    zip, bin, csvAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sal Hagen; Sal Hagen; Guillén Torres; Guillén Torres
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 24, 2024
    Description

    Data for the forthcoming publication 'Contested Components: Studying Interface Enrichment as a Form of Content Moderation on Google and Bing'.

    Datasets contain information on SERP components for Google Bing when querying 2000 controversial and 914 non-controversial questions.

    Files include:

    • question_data.csv: Information on questions sourced from 4chan and leftychan boards in November 2024. Columns include the counts per board (/fit/, /b/, /pol/. /int/, /k/, /lgbt/, and /leftypol/), categorization as controversial/non-controverial, and toxicity scores determined by Perspective API.
    • serp_components.csv: Information on the SERP data gathered using Zoekplaatje. Collected on 24 November 2024.
    • screenshots.zip: Screenshots of all SERPs. Note that at times, expanding the AI Overview box on Google resulted in the search bar overlaying the generated text.
    • component_analysis.ipynb: Code for analyzing the data.

    Component taxomony and screenshots

    <td

    Search engine

    Component name

    Count

    Example

    Google

    organic

    19,470

    Click to view

    Bing

    organic

    18,677

    Click to view

    Google

    related-questions

    1,534

    Click to view

    Bing

    related-queries

    1,425

    Click to view

    Bing

    info-card

    1,320

    Click to view

    (each card is its own info-card component)

    Google

    related-queries

    1,289

    Click to view

    Bing

    organic-answer

    1,140

    Click to view

    (often summarised through AI-assisted means)

    Bing

    video-widget

    776

    Click to view

    Bing

    organic-showcase

    752

    Click to view

    Bing

    related-questions

    725

    Click to view

    Google

    ai-overview

    499

    Click to view

    Bing

    organic-wiki-widget

    271

    Click to view

    Google

    did-you-mean

    223

    Click to view

    Bing

    related-queries-carousel

    219

    Click to view

    Bing

    info-card-image

    136

    Click to view

  4. Google Search autocomplete data for 'ChatGPT'

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zaid Qureshi (2023). Google Search autocomplete data for 'ChatGPT' [Dataset]. https://www.kaggle.com/datasets/zq1200/google-search-autocomplete-data-for-chatgpt
    Explore at:
    zip(17318 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Zaid Qureshi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains current suggestions for the term "ChatGPT", that Google gives, when users type the prompt into it's search engine on desktop. The data covers searches in the US, Canada, and UK, and only in the English language.

  5. d

    DataForSEO Google Keyword Database, historical and current

    • datarade.ai
    .json, .csv
    Updated Mar 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataForSEO (2023). DataForSEO Google Keyword Database, historical and current [Dataset]. https://datarade.ai/data-products/dataforseo-google-keyword-database-historical-and-current-dataforseo
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Mar 14, 2023
    Dataset authored and provided by
    DataForSEO
    Area covered
    Cyprus, Bolivia (Plurinational State of), Bahrain, Uruguay, Turkey, Singapore, Spain, Bangladesh, El Salvador, Canada
    Description

    You can check the fields description in the documentation: current Keyword database: https://docs.dataforseo.com/v3/databases/google/keywords/?bash; Historical Keyword database: https://docs.dataforseo.com/v3/databases/google/history/keywords/?bash. You don’t have to download fresh data dumps in JSON or CSV – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.

  6. o

    Search Engine — GitHub Repository Rankings

    • ossinsight.io
    html
    Updated Jun 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OSSInsight (2022). Search Engine — GitHub Repository Rankings [Dataset]. https://ossinsight.io/collections/search-engine
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 23, 2022
    Dataset authored and provided by
    OSSInsight
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2011 - Present
    Variables measured
    Forks, Issues, Commits, Contributors, GitHub stars, Pull requests
    Description

    Open source ranking dataset for Search Engine. Top open source search engines on GitHub — Meilisearch, Typesense, Elasticsearch alternatives. Ranked by stars and developer activity.

  7. d

    DataForSEO Google Full (Keywords+SERP) database, historical data available

    • datarade.ai
    .json, .csv
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataForSEO (2023). DataForSEO Google Full (Keywords+SERP) database, historical data available [Dataset]. https://datarade.ai/data-products/dataforseo-google-full-keywords-serp-database-historical-d-dataforseo
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Aug 17, 2023
    Dataset authored and provided by
    DataForSEO
    Area covered
    Burkina Faso, Paraguay, Portugal, United Kingdom, Côte d'Ivoire, Cyprus, Bolivia (Plurinational State of), Costa Rica, South Africa, Sweden
    Description

    You can check the fields description in the documentation: current Full database: https://docs.dataforseo.com/v3/databases/google/full/?bash; Historical Full database: https://docs.dataforseo.com/v3/databases/google/history/full/?bash.

    Full Google Database is a combination of the Advanced Google SERP Database and Google Keyword Database.

    Google SERP Database offers millions of SERPs collected in 67 regions with most of Google’s advanced SERP features, including featured snippets, knowledge graphs, people also ask sections, top stories, and more.

    Google Keyword Database encompasses billions of search terms enriched with related Google Ads data: search volume trends, CPC, competition, and more.

    This database is available in JSON format only.

    You don’t have to download fresh data dumps in JSON – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.

  8. Google autocomplete data for 'dataset'

    • kaggle.com
    zip
    Updated Mar 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zaid Qureshi (2023). Google autocomplete data for 'dataset' [Dataset]. https://www.kaggle.com/datasets/zq1200/google-autocomplete-data-for-dataset
    Explore at:
    zip(32197 bytes)Available download formats
    Dataset updated
    Mar 8, 2023
    Authors
    Zaid Qureshi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains current suggestions for the term "dataset", that Google gives, when users type the prompt into it's search engine on desktop. The data covers searches in the US, Canada, and UK, and only in the English language.

  9. Z

    Data for study "Direct Answers in Google Search Results"

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Strzelecki, Artur; Rutecka, Paulina (2020). Data for study "Direct Answers in Google Search Results" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3541091
    Explore at:
    Dataset updated
    Jun 9, 2020
    Dataset provided by
    University of Economics in Katowice
    Authors
    Strzelecki, Artur; Rutecka, Paulina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The goal of this research is to examine direct answers in Google web search engine. Dataset was collected using Senuto (https://www.senuto.com/). Senuto is as an online tool, that extracts data on websites visibility from Google search engine.

    Dataset contains the following elements:

    keyword,

    number of monthly searches,

    featured domain,

    featured main domain,

    featured position,

    featured type,

    featured url,

    content,

    content length.

    Dataset with visibility structure has 743 798 keywords that were resulting in SERPs with direct answer.

  10. D

    Search Engineing Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Search Engineing Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/search-engine-marketing-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2025 - 2034
    Area covered
    Global
    Description

    Search Engine Market Outlook



    The search engine market size was valued at approximately USD 124 billion in 2023 and is projected to reach USD 258 billion by 2032, witnessing a robust CAGR of 8.5% during the forecast period. This growth is largely attributed to the increasing reliance on digital platforms and the internet across various sectors, which has necessitated the use of search engines for data retrieval and information dissemination. With the proliferation of smartphones and the expansion of internet access globally, search engines have become indispensable tools for both businesses and consumers, driving the market's upward trajectory. The integration of artificial intelligence and machine learning technologies into search engines is transforming the way search engines operate, offering more personalized and efficient search results, thereby further propelling market growth.



    One of the primary growth factors in the search engine market is the ever-increasing digitalization across industries. As businesses continue to transition from traditional modes of operation to digital platforms, the need for search engines to navigate and manage data becomes paramount. This shift is particularly evident in industries such as retail, BFSI, and healthcare, where vast amounts of data are generated and require efficient management and retrieval systems. The integration of AI and machine learning into search engine algorithms has enhanced their ability to process and interpret large datasets, thereby improving the accuracy and relevance of search results. This technological advancement not only improves user experience but also enhances the competitive edge of businesses, further fueling market growth.



    Another significant growth factor is the expanding e-commerce sector, which relies heavily on search engines to connect consumers with products and services. With the rise of e-commerce giants and online marketplaces, consumers are increasingly using search engines to find the best prices, reviews, and availability of products, leading to a surge in search engine usage. Additionally, the implementation of voice search technology and the growing popularity of smart home devices have introduced new dynamics to search engine functionality. Consumers are now able to conduct searches verbally, which has necessitated the adaptation of search engines to incorporate natural language processing capabilities, further driving market growth.



    The advertising and marketing sectors are also contributing significantly to the growth of the search engine market. Businesses are leveraging search engines as a primary tool for online advertising, given their wide reach and ability to target specific audiences. Pay-per-click advertising and search engine optimization strategies have become integral components of digital marketing campaigns, enabling businesses to enhance their visibility and engagement with potential customers. The measurable nature of these advertising techniques allows businesses to assess the effectiveness of their campaigns and make data-driven decisions, thereby increasing their reliance on search engines and contributing to overall market growth.



    The evolution of search engines is closely tied to the development of Ai Enterprise Search, which is revolutionizing how businesses access and utilize information. Ai Enterprise Search leverages artificial intelligence to provide more accurate and contextually relevant search results, making it an invaluable tool for organizations that manage large volumes of data. By understanding user intent and learning from past interactions, Ai Enterprise Search systems can deliver personalized experiences that enhance productivity and decision-making. This capability is particularly beneficial in sectors such as finance and healthcare, where quick access to precise information is crucial. As businesses continue to digitize and data volumes grow, the demand for Ai Enterprise Search solutions is expected to increase, further driving the growth of the search engine market.



    Regionally, North America holds a significant share of the search engine market, driven by the presence of major technology companies and a well-established digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth can be attributed to the rapid digital transformation in emerging economies such as China and India, where increasing internet penetration and smartphone adoption are driving demand for search engines. Additionally, government initiatives to

  11. G

    Quantum-Enhanced Neural Search Engine Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Quantum-Enhanced Neural Search Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/quantum-enhanced-neural-search-engine-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Aug 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Quantum-Enhanced Neural Search Engine Market Outlook



    According to our latest research, the Quantum-Enhanced Neural Search Engine market size reached USD 1.82 billion globally in 2024, reflecting the rapid adoption of quantum computing and advanced neural network architectures in enterprise search solutions. The market is projected to grow at a robust CAGR of 28.7% from 2025 to 2033, culminating in a forecasted market size of USD 15.46 billion by the end of 2033. This remarkable trajectory is primarily driven by the demand for highly efficient, accurate, and context-aware search engines capable of processing vast and complex datasets across industries.



    Several key growth factors are propelling the quantum-enhanced neural search engine market forward. The exponential increase in unstructured data, combined with the limitations of classical search algorithms, has created a significant need for more sophisticated search technologies. Quantum computing, when integrated with neural search algorithms, delivers unparalleled computational power and speed, enabling real-time semantic understanding and contextual relevance in search results. Organizations across sectors such as healthcare, finance, and e-commerce are investing heavily in these technologies to improve data-driven decision-making, enhance user experiences, and maintain a competitive edge in the digital era. The synergy between quantum computing and neural networks is unlocking new possibilities for natural language processing, image recognition, and predictive analytics, further fueling market growth.



    Another significant driver is the growing adoption of artificial intelligence and machine learning across enterprise operations. As businesses transition towards digital transformation, the need for intelligent search capabilities that can extract actionable insights from massive datasets becomes increasingly critical. Quantum-enhanced neural search engines offer a transformative leap in search efficiency, delivering faster and more accurate results than traditional systems. This is particularly valuable for industries dealing with sensitive or time-critical information, such as BFSI and healthcare, where the ability to retrieve relevant data instantaneously can have a direct impact on operational efficiency and customer satisfaction. Additionally, the scalability and adaptability of these solutions make them attractive to both large enterprises and SMEs, supporting widespread market penetration.



    The ongoing advancements in quantum hardware and software ecosystems are also contributing to the market’s expansion. Major technology players and startups alike are investing in the development of quantum processors, quantum-safe algorithms, and hybrid quantum-classical architectures tailored for search applications. As quantum computing becomes more accessible through cloud-based platforms, organizations of all sizes can leverage its power without the need for significant upfront infrastructure investments. This democratization of quantum technology is expected to accelerate adoption rates, drive innovation in search engine design, and lower barriers to entry for new market participants. Furthermore, collaborative efforts between academia, industry, and government agencies are fostering a vibrant ecosystem that supports research, standardization, and commercialization of quantum-enhanced neural search solutions.



    From a regional perspective, North America currently leads the quantum-enhanced neural search engine market, accounting for the largest share in 2024, primarily due to its advanced technological infrastructure, significant R&D investments, and early adoption by key industry players. Europe follows closely, supported by robust governmental initiatives and a strong presence of quantum research institutions. The Asia Pacific region is witnessing the fastest growth, driven by increasing digitalization, expanding tech startups, and supportive regulatory frameworks, particularly in countries like China, Japan, and South Korea. Latin America and the Middle East & Africa are also emerging as promising markets, with growing interest in quantum technologies and AI-driven solutions to address local industry challenges. Each region presents unique opportunities and challenges, shaping the competitive landscape and influencing market dynamics over the forecast period.



  12. C

    Local SEO Strategy Dataset for Pueblo, Colorado Business Optimization and...

    • caseysseo.com
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Casey Miller (2025). Local SEO Strategy Dataset for Pueblo, Colorado Business Optimization and Market Dominance in 2026 [Dataset]. https://caseysseo.com/local-seo-for-pueblo-businesses-become-the-top-choice-in-2026/
    Explore at:
    Dataset updated
    Jan 11, 2025
    Dataset provided by
    Casey's SEO
    Authors
    Casey Miller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2026
    Area covered
    Pueblo, Belmont District, Pueblo, Pueblo
    Variables measured
    Content Word Count, NAP Inconsistency Impact, Local Search Market Share, Local Organic Traffic Increase, Top Three Google Maps Results Click Rate, Google Maps Ranking Improvement Timeframe, Hyper-Local Keyword Targeting Effectiveness, Local Search Visibility Loss from Technical Issues
    Measurement technique
    Google Business Profile performance monitoring and optimization tracking, Local citation consistency verification across directory platforms, Mobile-first website performance testing and optimization assessment, Customer acquisition analysis through local search channels, Technical SEO auditing using professional tools and manual review, Neighborhood-specific keyword research and search volume analysis, Local search ranking analysis and competitive benchmarking
    Description

    Comprehensive dataset analyzing local search engine optimization strategies, market challenges, and implementation methodologies specifically designed for Pueblo, Colorado businesses seeking to achieve top rankings in local search results and Google Maps positioning during 2026. This dataset encompasses neighborhood-specific optimization techniques, technical implementation guidelines, Google Business Profile optimization protocols, and proven methodologies for building complete local search ecosystems that drive consistent customer acquisition.

  13. C

    Local SEO Strategy Dataset for Pueblo West Businesses 2026

    • caseysseo.com
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Casey Miller (2025). Local SEO Strategy Dataset for Pueblo West Businesses 2026 [Dataset]. https://caseysseo.com/local-seo-for-pueblo-west-businesses-secure-your-top-spot-in-2026/
    Explore at:
    Dataset updated
    Jan 11, 2025
    Dataset provided by
    Casey's SEO
    Authors
    Casey Miller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2025 - 2026
    Area covered
    Sunnequa Area, Colorado, Pueblo West, Belmont Neighborhood
    Variables measured
    Content Word Count, Mobile Experience Priority, Local SEO System Components, Neighborhood Coverage Areas, Local Search Intent Percentage, Service Business Local Intent Rate, Local Visibility Improvement Factor, Mobile Local Search Conversion Rate
    Measurement technique
    Customer behavior tracking and mobile search pattern analysis, Hyper-local keyword research using location-specific search data, Competitive analysis of local search rankings and map pack positions, NAP consistency verification across multiple online platforms, Local search performance analysis through Google Search Console and Google Analytics, Google Business Profile performance monitoring and optimization testing, Local citation audit and directory presence assessment, Mobile user experience testing and local search conversion tracking
    Description

    Comprehensive dataset analyzing local search engine optimization strategies, market insights, and implementation techniques specifically designed for Pueblo West, Colorado businesses. This dataset includes local search behavior patterns, competitive analysis, neighborhood-specific optimization tactics, and proven methodologies for achieving top rankings in Google's map pack and local search results. The data encompasses hyper-local keyword research, citation building strategies, Google Business Profile optimization techniques, and technical SEO requirements tailored for the Pueblo West market landscape.

  14. Data from: Examining bias perpetuation in academic search engines: an...

    • zenodo.org
    bin, csv, zip
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ulloa Roberto; Ulloa Roberto (2024). Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar [Dataset]. http://doi.org/10.5281/zenodo.10636247
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ulloa Roberto; Ulloa Roberto
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Main dataset (main.csv)

    The main file contains an entry (N=28530) per search result in all collected pages. It comprises the following columns:

    1. id: Unique identifier of the file (corresponds to the last part of the filename)
    2. filename: Name of the file associated with the row (the file is in serp_html.zip)
    3. engine: The search engine used (Google Scholar or Semantic Scholar).
    4. browser: The web browser used for the search (Firefox or Chrome)
    5. region: The geographical region where the search was made.
    6. year: The year when the search was made
    7. month: The month when the search was made
    8. day: The day when the search was made
    9. query: The full search query that was used
    10. query_type: The type of the search query (health or technology)
    11. topic: The topic associated with the search query ('covid vaccines', 'cryptocurrencies', 'internet', 'social media', 'vaccines', 'coffee')
    12. trt: Treatment variable associated with the search (benefits or risks).
    13. url: The URL of the (article) search result
    14. title: The title of the (article) search result.
    15. authorship: The author(s) of the (article) search result.
    16. abstract_id: Unique identifier for the abstract of the (article) search result which connects with annotated-abstracts_v0.6.xlsx
    17. abstract_hash: Hash value of the abstract for data integrity
    18. link_n: The total number of results in the search page
    19. rank: The rank of the search result on the search engine results page.
    20. annotation: Any annotations associated with the (article's abstract) search result. One of: '3. Confirms both benefits and risks', '4. Confirms neither benefits nor risks', '1. Confirms benefits', '2. Confirms risks', '5. Abstract not related to {topic}')
    21. valence: -1 for abstracts containing risks, 0 for neutral abstracts, 1 for abstracts only containing benefits

    Annotated abstracts (annotated-abstracts_v0.6.xlsx)

    Manually annotated abstracts resulting from the searches.

    Raw search engine result pages (serp_html.zip)

    The zip contains an HTML per search engine result page collected (N=2853). See column filename from the main dataset.

  15. C

    International SEO Strategy: Optimizing for Baidu, Yandex, and Regional...

    • myseosites.blob.core.windows.net
    • caseysseo.com
    json
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Casey Miller (2025). International SEO Strategy: Optimizing for Baidu, Yandex, and Regional Search Engines [Dataset]. https://myseosites.blob.core.windows.net/developing-an-international-seo-approach-to-5-20251209/index.html
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 11, 2025
    Dataset provided by
    Casey's SEO
    Authors
    Casey Miller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2025
    Area covered
    South Korea, Eastern Europe, Czech Republic, Russia, China
    Variables measured
    Baidu Market Share, Content Word Count, Chinese Internet Users, Behavioral Signal Impact, Yandex Market Share Russia, Mobile Usage Dominance China, Regional Search Engine Coverage, Technical Infrastructure Requirements
    Measurement technique
    Regulatory compliance verification and documentation review, Regional search engine market share analysis, Behavioral signal monitoring and user engagement tracking, Cultural content analysis and localization assessment, Competitor performance benchmarking across multiple search platforms, Technical audit methodologies for regional search engines, Cross-platform keyword research and semantic analysis
    Description

    Comprehensive dataset covering international SEO strategies for regional search engines including Baidu, Yandex, and other regional platforms. This dataset provides detailed optimization techniques, technical requirements, cultural considerations, and market-specific approaches for businesses expanding globally beyond Google's ecosystem. Includes analysis of market share data, technical infrastructure requirements, content localization strategies, and performance metrics for major regional search engines across China, Russia, Eastern Europe, South Korea, and Czech Republic markets.

  16. u

    Data from: Inventory of online public databases and repositories holding...

    • agdatacommons.nal.usda.gov
    • data.wu.ac.at
    txt
    Updated Feb 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erin Antognoli; Jonathan Sears; Cynthia Parr (2024). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. http://doi.org/10.15482/USDA.ADC/1389839
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    Ag Data Commons
    Authors
    Erin Antognoli; Jonathan Sears; Cynthia Parr
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to

    establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data

    Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered.
    Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review:

    Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection.
    Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation.

    See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt

  17. d

    Grepsr| Trip Advisor Property Address and Reviews | Global Coverage with...

    • datarade.ai
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grepsr (2023). Grepsr| Trip Advisor Property Address and Reviews | Global Coverage with Custom and On-demand Datasets [Dataset]. https://datarade.ai/data-products/grepsr-trip-advisor-property-address-and-reviews-global-co-grepsr
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    Grepsr
    Area covered
    Cuba, Italy, Turkey, Andorra, Myanmar, Greece, Benin, Holy See, Croatia, Sao Tome and Principe
    Description

    A. Market Research and Analysis: Utilize the Tripadvisor dataset to conduct in-depth market research and analysis in the travel and hospitality industry. Identify emerging trends, popular destinations, and customer preferences. Gain a competitive edge by understanding your target audience's needs and expectations.

    B. Competitor Analysis: Compare and contrast your hotel or travel services with competitors on Tripadvisor. Analyze their ratings, customer reviews, and performance metrics to identify strengths and weaknesses. Use these insights to enhance your offerings and stand out in the market.

    C. Reputation Management: Monitor and manage your hotel's online reputation effectively. Track and analyze customer reviews and ratings on Tripadvisor to identify improvement areas and promptly address negative feedback. Positive reviews can be leveraged for marketing and branding purposes.

    D. Pricing and Revenue Optimization: Leverage the Tripadvisor dataset to analyze pricing strategies and revenue trends in the hospitality sector. Understand seasonal demand fluctuations, pricing patterns, and revenue optimization opportunities to maximize your hotel's profitability.

    E. Customer Sentiment Analysis: Conduct sentiment analysis on Tripadvisor reviews to gauge customer satisfaction and sentiment towards your hotel or travel service. Use this information to improve guest experiences, address pain points, and enhance overall customer satisfaction.

    F. Content Marketing and SEO: Create compelling content for your hotel or travel website based on the popular keywords, topics, and interests identified in the Tripadvisor dataset. Optimize your content to improve search engine rankings and attract more potential guests.

    G. Personalized Marketing Campaigns: Use the data to segment your target audience based on preferences, travel habits, and demographics. Develop personalized marketing campaigns that resonate with different customer segments, resulting in higher engagement and conversions.

    H. Investment and Expansion Decisions: Access historical and real-time data on hotel performance and market dynamics from Tripadvisor. Utilize this information to make data-driven investment decisions, identify potential areas for expansion, and assess the feasibility of new ventures.

    I. Predictive Analytics: Utilize the dataset to build predictive models that forecast future trends in the travel industry. Anticipate demand fluctuations, understand customer behavior, and make proactive decisions to stay ahead of the competition.

    J. Business Intelligence Dashboards: Create interactive and insightful dashboards that visualize key performance metrics from the Tripadvisor dataset. These dashboards can help executives and stakeholders get a quick overview of the hotel's performance and make data-driven decisions.

    Incorporating the Tripadvisor dataset into your business processes will enhance your understanding of the travel market, facilitate data-driven decision-making, and provide valuable insights to drive success in the competitive hospitality industry

  18. g

    Data from: Semantic Query Analysis from the Global Science Gateway

    • datasearch.gesis.org
    • ssh.datastations.nl
    Updated Jan 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlesi, Dr. C. (Istituto di Scienze e Tecnologie dell’informazione “A. Faedo”, CNR-ISTI, Italy), DataCollector (2020). Semantic Query Analysis from the Global Science Gateway [Dataset]. http://doi.org/10.17026/dans-25m-fhe2
    Explore at:
    Dataset updated
    Jan 23, 2020
    Dataset provided by
    DANS (Data Archiving and Networked Services)
    Authors
    Carlesi, Dr. C. (Istituto di Scienze e Tecnologie dell’informazione “A. Faedo”, CNR-ISTI, Italy), DataCollector
    Description

    Nowadays web portals play an essential role in searching and retrieving information in the several fields of knowledge: they are ever more technologically advanced and designed for supporting the storage of a huge amount of information in natural language originating from the queries launched by users worldwide. A good example is given by the WorldWideScience search engine: The database is available at http://worldwidescience.org/. It is based on a similar gateway, Science.gov, which is the major path to U.S. government science information, as it pulls together Web-based resources from various agencies. The information in the database is intended to be of high quality and authority, as well as the most current available from the participating countries in the Alliance, so users will find that the results will be more refined than those from a general search of Google. It covers the fields of medicine, agriculture, the environment, and energy, as well as basic sciences. Most of the information may be obtained free of charge (the database itself may be used free of charge) and is considered ‘‘open domain.’’ As of this writing, there are about 60 countries participating in WorldWideScience.org, providing access to 50+databases and information portals. Not all content is in English. (Bronson, 2009) Given this scenario, we focused on building a corpus constituted by the query logs registered by the GreyGuide: Repository and Portal to Good Practices and Resources in Grey Literature and received by the WorldWideScience.org (The Global Science Gateway) portal: the aim is to retrieve information related to social media which as of today represent a considerable source of data more and more widely used for research ends. This project includes eight months of query logs registered between July 2017 and February 2018 for a total of 445,827 queries. The analysis mainly concentrates on the semantics of the queries received from the portal clients: it is a process of information retrieval from a rich digital catalogue whose language is dynamic, is evolving and follows – as well as reflects – the cultural changes of our modern society.

  19. TREC 2022 Deep Learning test collection

    • catalog.data.gov
    • gimi9.com
    • +1more
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  20. d

    Chatbot Training Dataset | QuantLens OpenChat Corpus | 10M+ Real User–AI...

    • datarade.ai
    .parquet
    Updated Jan 28, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    QuantLens (2026). Chatbot Training Dataset | QuantLens OpenChat Corpus | 10M+ Real User–AI Multi-Turn Conversations | ChatGPT, Claude, Gemini | Training Dialogue Corpus [Dataset]. https://datarade.ai/data-products/chatbot-training-dataset-quantlens-openchat-corpus-10m-quantlens
    Explore at:
    .parquetAvailable download formats
    Dataset updated
    Jan 28, 2026
    Dataset authored and provided by
    QuantLens
    Area covered
    South Africa, Somalia, Greenland, Dominica, Finland, Macao, Syrian Arab Republic, Hong Kong, Jersey, Spain
    Description

    QuantLens OpenChat Corpus is a curated collection of 44.9 million conversational turns between real users and leading AI assistants. Unlike raw scrapes, the corpus is processed through QuantLens Active Redaction to remove PII and standardize structure—so teams can train, evaluate, and analyze at enterprise scale.

    Why OpenChat Corpus?

    Real-world conversation data is messy (multiple languages, diverse intent, adversarial prompts) and often unsafe (emails, phone numbers, IPs, identifiers). This dataset preserves real usage patterns while delivering commercial-grade safety and consistency.

    Key Features

    Massive Scale: 6.8M conversations and 44.9M turns (≈45M).

    PII Redaction: Emails, phone numbers, IP addresses, and identifiers scrubbed via semantic tagging/redaction.

    Analytics-Ready Parquet: Snappy-compressed Apache Parquet, optimized for fast queries and ML pipelines.

    Hive Partitioning: Organized for zero-ETL ingestion (e.g., source/split/lang).

    Multi-Source Diversity: Harmonized from 10+ major open conversation datasets, including WildChat (4.8M), UltraChat, LMSYS Chat 1M, and Chatbot Arena.

    Rich Metadata: Language detection, model identifiers, toxicity signals, and role labels (user/assistant).

    Technical Specifications

    File Format: Apache Parquet (Snappy)

    Text Encoding: UTF-8-SIG

    Core Schema: conversation_id, role, text, model, pii_detected, timestamp

    License: QuantLens Commercial Data License (v1)

    Ideal Use Cases

    LLM Fine-Tuning / Instruction Tuning: Train chat models on real prompt/response behavior.

    RLHF & Reward Modeling: Learn preference signals from large-scale conversational patterns.

    Prompt Intelligence: Discover high-performing prompt templates across domains/languages.

    Safety & Alignment: Analyze jailbreak attempts and adversarial prompts in a controlled, redacted corpus.

    Enterprise Analytics: Query conversational trends in Databricks/Snowflake/BigQuery/Athena without custom ETL.

    Target SEO Keywords :

    conversational ai dataset, llm training data, chat dataset parquet, pii redacted dataset, rlhf dataset, instruction tuning dataset, chatbot conversation corpus, openchat corpus, wildchat dataset, ultrachat dataset, lmsys chat dataset, chatbot arena dataset, enterprise llm dataset, multilingual chat data, safety aligned training data

    LLM Training • Conversational Data • Chatbot Logs • Parquet • PII-Redacted • Multilingual • RLHF • Prompt Engineering • Safety/Alignment • Databricks/Snowflake Ready

    FAQ :

    What is the QuantLens OpenChat Corpus? A curated enterprise conversational AI dataset with 44.9M PII-redacted user/assistant turns across 6.8M conversations, delivered in Apache Parquet.

    Is this dataset safe for enterprise use? It is processed through QuantLens Active Redaction with extensive PII scrubbing (emails/phones/IPs/identifiers) and includes metadata such as pii_detected.

    What format is the data delivered in? Snappy-compressed Apache Parquet, Hive-partitioned for fast querying and scalable ingestion.

    • PII-Free: Automated regex and semantic filtering applied to redact sensitive entities.

      • Harmonized Schema: All 10+ source datasets mapped to a unified, consistent column structure.

      -Technical Integrity: Verified via SHA-256 Checksums and full-scan auditing (Zero corrupt files)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gerome (2025). SEO-Data [Dataset]. https://www.kaggle.com/datasets/deeprankai/seo-data
Organization logo

SEO-Data

SEO Titles & Keywords Data From High Ranking Websites In All Industries

Explore at:
zip(22686543 bytes)Available download formats
Dataset updated
Mar 4, 2025
Authors
Gerome
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📊 SEO Search Results Dataset (SERP Data)

Filename: SEO_data.csv Size: 56.63 MB Rows: ~100,000+ Columns: 7 Language: Primarily English (may contain multilingual snippets)

🔍 Dataset Overview

This dataset contains structured data scraped from Google Search Engine Results Pages (SERPs), specifically curated for SEO and machine learning research. It includes search rankings and metadata for various keywords, capturing how websites rank and present their content on search engines.

🧾 Columns Description

Column NameDescription
wordsThe search keyword or query entered into Google
rankThe result's position on the search engine results page (1 = top)
titleThe meta title of the page
h1The primary <h1> tag from the page (if available)
snippetThe search result snippet/description shown on Google
linksThe URL of the ranked result
total_resultThe total number of search results Google reports for the query

📌 Use Cases

  • Keyword ranking analysis
  • SERP feature extraction
  • SEO optimization insights
  • Natural language processing (NLP) tasks on snippets, titles, and headings
  • Predictive modeling for search rankings
  • Trend analysis on keyword frequency and ranking shifts

📁 Example Record

wordsranktitleh1snippetlinkstotal_result
Artificial intelligence1Beginning Your Journey to Implementing Artificial IntelligenceBeginning Your Journey...Gérer les éditeurs grâce à des services...https://www.softwareone.com/...776,000,000

📎 Notes

  • Multiple rows may exist for the same keyword due to multiple ranked results.
  • Some values (like H1 or snippets) may occasionally be missing or partial due to scraping limitations.
  • Useful for benchmarking search trends or training LLMs on SEO-related text features.

Enjoy

Search
Clear search
Close search
Google apps
Main menu