100+ datasets found
  1. SEO-Data

    • kaggle.com
    zip
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerome (2025). SEO-Data [Dataset]. https://www.kaggle.com/datasets/deeprankai/seo-data
    Explore at:
    zip(22686543 bytes)Available download formats
    Dataset updated
    Mar 4, 2025
    Authors
    Gerome
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📊 SEO Search Results Dataset (SERP Data)

    Filename: SEO_data.csv Size: 56.63 MB Rows: ~100,000+ Columns: 7 Language: Primarily English (may contain multilingual snippets)

    🔍 Dataset Overview

    This dataset contains structured data scraped from Google Search Engine Results Pages (SERPs), specifically curated for SEO and machine learning research. It includes search rankings and metadata for various keywords, capturing how websites rank and present their content on search engines.

    🧾 Columns Description

    Column NameDescription
    wordsThe search keyword or query entered into Google
    rankThe result's position on the search engine results page (1 = top)
    titleThe meta title of the page
    h1The primary <h1> tag from the page (if available)
    snippetThe search result snippet/description shown on Google
    linksThe URL of the ranked result
    total_resultThe total number of search results Google reports for the query

    📌 Use Cases

    • Keyword ranking analysis
    • SERP feature extraction
    • SEO optimization insights
    • Natural language processing (NLP) tasks on snippets, titles, and headings
    • Predictive modeling for search rankings
    • Trend analysis on keyword frequency and ranking shifts

    📁 Example Record

    wordsranktitleh1snippetlinkstotal_result
    Artificial intelligence1Beginning Your Journey to Implementing Artificial IntelligenceBeginning Your Journey...Gérer les éditeurs grâce à des services...https://www.softwareone.com/...776,000,000

    📎 Notes

    • Multiple rows may exist for the same keyword due to multiple ranked results.
    • Some values (like H1 or snippets) may occasionally be missing or partial due to scraping limitations.
    • Useful for benchmarking search trends or training LLMs on SEO-related text features.

    Enjoy

  2. Global most searched keywords on Google in 2025, by monthly search volume

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Global most searched keywords on Google in 2025, by monthly search volume [Dataset]. https://www.statista.com/statistics/1366210/most-searched-google-keywords/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2025 - Mar 2025
    Area covered
    Worldwide
    Description

    "*******" was the most frequently searched keyword on Google worldwide, with over ***** million monthly online searches during the analyzed period of January to March in 2025. Furthermore, the search resulted in more than ***** million website visits, or more than **** percent of all traffic. With *** million monthly searches, "***" was the second most popular keyword, and "***********" came in third place with about ****** million searches per month.

  3. company keywords data

    • kaggle.com
    zip
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshii_Sharma (2025). company keywords data [Dataset]. https://www.kaggle.com/datasets/harshiisharma/company-keywords-data
    Explore at:
    zip(47497 bytes)Available download formats
    Dataset updated
    Jun 12, 2025
    Authors
    Harshii_Sharma
    Description

    Dataset

    This dataset was created by Harshii_Sharma

    Contents

  4. Global search volume for "AI" keyword 2022-2023

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Global search volume for "AI" keyword 2022-2023 [Dataset]. https://www.statista.com/statistics/1398211/ai-keyword-traffic-volume/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 2022 - Mar 2023
    Area covered
    Worldwide
    Description

    Between June 2022 and March 2023, the traffic volume for the keyword "AI" has tripled, going from around 7.9 million monthly searches to more than 30.4 million during the last month of the measured period. General interest in artificial intelligence (AI) has exploded in markets like the United States by the end of 2022. Likewise, interest for the application programming interfaces (API's) and plugins of artificial intelligence solutions, especially those of ChatGPT, has also seen a major increase since the release of the tool in November of 2022.

    The artificial intelligence market

    Valued at around 142.3 billion U.S. dollars in 2022, the artificial intelligence market is one the most promising tech segments for the rest of the decade, with more than five billion U.S. dollars invested in startups - the most notable being the Californian company OpenAI and its flagship application ChatGPT. Disruptive as it is, the adoption of AI has already sparked an alert for several industries, likely to affect job markets and thus raising concerns about cybercrime and other online misdeeds.

    The future of online search?

    Of most industries, the impact of the new tool developed by OpenAI may be felt by the online search market like a global earthquake. With chatbots providing search results in a dialogue format, the trend of AI-powered search engines unleashed by ChatGPT threw giant companies like Google and Microsoft into a race with startups and other competitors to present the best candidate for this disruptive (and experimental) online solution.

  5. Amazon Electronics Keywords from Hellium10

    • kaggle.com
    zip
    Updated Feb 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burak Gurhan (2024). Amazon Electronics Keywords from Hellium10 [Dataset]. https://www.kaggle.com/datasets/burakgurhan/h10-keywords
    Explore at:
    zip(207596 bytes)Available download formats
    Dataset updated
    Feb 19, 2024
    Authors
    Burak Gurhan
    Description

    The data is obtained from Hellium10 which is a popular Amazon seller tool. Hellium10 is a reputable tool for many Amazon sellers and supplies a lot of data to analyze products, keywords, and markets.

    Amazon does not share its data with third parties. Hence the data does not reflect the real values.

    The data is about the most searched keywords in the Amazon Electronics Category. Is created in January 2024. Thus, it reflects the data before that time.

    Data contains 21 columns with more than 4000 phrases with a lot of different details such as Search Volume, Fulfillment Type, Size Tier, and Variation.

    Data is eligible for educational purposes. You can make Exploratory Data Analysis, Data visualization, and Data manipulation practices.

  6. o

    Data from: Search strategies and keyword searches used in 78 residential...

    • osf.io
    Updated Jun 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Vuillème (2021). Search strategies and keyword searches used in 78 residential care reviews: a cross-sectional analysis [Dataset]. http://doi.org/10.17605/OSF.IO/7DQKP
    Explore at:
    Dataset updated
    Jun 10, 2021
    Dataset provided by
    Center For Open Science
    Authors
    Martin Vuillème
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract Objectives: To investigate the search strategies and keyword searches used in 95 residential care reviews. Study design: Methodological study (cross-sectional) Methods: First, I attempted to download the full-text versions of all 95 residential care reviews identified in a recently published project. I then searched the full-texts obtained to identify the database search strategies used. I extracted all residential care keywords used in the first search strategy identified in an Excel file. Keywords related to kinship care were not extracted. All keywords extracted were also added to a personal list of residential care keywords, if not included already. The titles and abstracts of all residential care reviews selected were extracted in an Excel file and analyzed using Excel’s COUNTIF function to identify the most commonly occurring keywords/strings. The sensitivity of residential care keywords found within my personal list but not within search strategies was then assessed using Excel’s COUNTIF function. Results: Among the 95 residential care reviews, 5 (5,26%) did not report a search strategy, a search strategy was mentioned but not found for 2 reviews (2,11%) and I could not access the search strategy for 5 reviews (5,26%). Keywords were not extracted from 4 reviews given extensive use of controlled vocabulary (MeSH) or advanced search functions (adj, near, ?, etc.). The only review that did not report searches conducted in English was excluded from analysis. This left 78 review search strategies for analysis. Review authors used from 0 to 53 residential care keywords/strings (mean = 9 keywords, median = 7,5 keywords). 288 unique keywords/strings were used by review authors. The 10 most commonly used keywords were: foster care (51,28%), residential care (47,44%), out of home care (29,49%), out-of-home care (24,36%), group home (20,51%), institutional care (19,23%), children’s home (17,95%), child welfare (16,67%), looked after (16,67%) and looked-after (14,10%). 198 keywords/strings were only found once. The keywords most commonly found within the titles and abstracts of residential care reviews were: foster, foster care, resident, residential, placement, in care, residential care, institution, out-of-home and out-of-home care. Four keywords/strings were found in more than 4% of the titles and abstracts of residential care reviews but could not be identified within residential care reviews search strategies: care setting, out of care, residential youth care and care placement. Funding: No funding was received for this work. Registration and study protocol: See https://osf.io/7dqkp Data and materials: See https://osf.io/7dqkp. All other data should otherwise be included within this manuscript. Keywords: Children’s homes, residential care, electronic searches, systematic review

  7. Data from: keywords-data

    • kaggle.com
    zip
    Updated Dec 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quân Phạm Ngọc (2024). keywords-data [Dataset]. https://www.kaggle.com/datasets/ngocquanofficial/keywords-data/versions/2
    Explore at:
    zip(540319 bytes)Available download formats
    Dataset updated
    Dec 24, 2024
    Authors
    Quân Phạm Ngọc
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Quân Phạm Ngọc

    Released under MIT

    Contents

  8. i

    Key data - Organic SEO Keywords: How to Choose and Use Them (2026 Guide)

    • incremys.com
    html
    Updated Mar 15, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Incremys (2026). Key data - Organic SEO Keywords: How to Choose and Use Them (2026 Guide) [Dataset]. https://www.incremys.com/en/resources/blog/organic-seo-keywords
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Mar 15, 2026
    Dataset authored and provided by
    Incremys
    Variables measured
    Metric: Scaling production whilst keeping quality under control, Metric: E-E-A-T, reliability and usefulness: what Google actually rewards, Metric: Authority: when content is not enough and how to build credibility, Metric: Business metrics: leads, conversion, pipeline contribution and ROI, Metric: How SERPs dictate formats: rich results, FAQs, videos and category pages, Metric: Write for people, structure for engines: evidence, examples and readability, Metric: Prioritise opportunities: volume, competition, seasonality, effort and expected ROI
    Description

    2026 guide to selecting, prioritising and using organic SEO keywords without over-optimising, using a method focused on ROI.

  9. m

    Keywords Studios plc Alternative Data Analytics

    • meyka.com
    Updated Sep 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meyka (2025). Keywords Studios plc Alternative Data Analytics [Dataset]. https://meyka.com/stock/KYYWF/alt-data/
    Explore at:
    Dataset updated
    Sep 20, 2025
    Dataset provided by
    Meyka
    Description

    Non-traditional data signals from social media and employment platforms for KYYWF stock analysis

  10. d

    USA Purchase Intent Data | KeyWord Search | 150M Daily Hashed Emails

    • datarade.ai
    Updated Nov 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BIGDBM (2025). USA Purchase Intent Data | KeyWord Search | 150M Daily Hashed Emails [Dataset]. https://datarade.ai/data-products/intent-data-usa-coverage-datamarket-bigdbm
    Explore at:
    .json, .csv, .parquetAvailable download formats
    Dataset updated
    Nov 23, 2025
    Dataset authored and provided by
    BIGDBM
    Area covered
    United States
    Description

    BigDBM's purchase Intent Data transforms how businesses understand and engage with their customers by providing a comprehensive, real-time view of buyer purchase intent data across both US consumer and B2B markets. With over 20 years of expertise in building identity graphs, our platform processes more than 110 million distinct hashed email addresses daily, delivering actionable insights that drive measurable ROI.

    Our proprietary methodology combines data from multiple live streams and maps website domains and emails to IAB classification codes, giving you structured insights into market interests and purchase intent. Through advanced natural language processing and a custom five-tiered taxonomy, we extract granular keywords while maintaining broad category classification for flexible targeting.

    Our unique intent intensity scoring quantifies purchase likelihood based on frequency and consistency of interest, while timestamp tracking reveals behavioral shifts and trends over time. With robust privacy compliance and ethical sourcing practices, BigDBM enables organizations to make smarter, faster decisions across customer acquisition, retention, and engagement through applications in account-based marketing, audience expansion, data enrichment, and programmatic advertising.

    We do not provide any phone details from Colorado residents.

  11. d

    Patent AT-E400852-T1: [Translated] SYSTEM, METHOD AND APPARATUS FOR...

    • catalog.data.gov
    • data.zh-tw.virginia.gov
    • +11more
    Updated Sep 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for Biotechnology Information (NCBI) (2025). Patent AT-E400852-T1: [Translated] SYSTEM, METHOD AND APPARATUS FOR PERFORMING A KEYWORD SEARCH [Dataset]. https://catalog.data.gov/dataset/patent-at-e400852-t1-translated-system-method-and-apparatus-for-performing-a-keyword-searc
    Explore at:
    Dataset updated
    Sep 8, 2025
    Dataset provided by
    National Center for Biotechnology Information (NCBI)
    Description

    A keyterm search is a method of searching a database for subsets of the database that are relevant to an input query. First, a number of relational models of subsets of a database are provided. A query is then input. The query can include one or more keyterms. Next, a gleaning model of the query is created. The gleaning model of the query is then compared to each one of the relational models of subsets of the database. The identifiers of the relevant subsets are then output.

  12. e

    Query Fan-Out Research Dataset

    • ekamoira.com
    json
    Updated Jan 27, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekamoira Research Team (2026). Query Fan-Out Research Dataset [Dataset]. https://www.ekamoira.com/blog/query-fan-out-original-research-on-how-ai-search-multiplies-every-query-and-why-most-brands-are-invisible
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 27, 2026
    Dataset authored and provided by
    Ekamoira Research Team
    Area covered
    Global
    Variables measured
    URLs Analyzed, Keywords Analyzed, Fan-out Citation Lift, Fan-Out Query Stability, Optimal Passage Length Range, AI-Cited Pages Outside Top 10, AI Mode Monthly Users (US + India), Brands Missing AI Citation Opportunities, Citation Multiplier at 0.88+ Cosine Similarity, Spearman Correlation (Fan-Out Coverage vs Citations)
    Measurement technique
    Correlation analysis (Spearman 0.77), Topical coverage analysis, Semantic similarity scoring (cosine distance), Citation probability modeling, Fan-out coverage ratio measurement
    Description

    Original quantitative research dataset analyzing AI search query fan-out behavior across 173,902 URLs and 10,000 keywords with citations across Google AI Mode, ChatGPT, and Perplexity platforms

  13. Z

    Responses to "Semantic Web: Perspectives" Questionnaire

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hogan, Aidan (2024). Responses to "Semantic Web: Perspectives" Questionnaire [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3229400
    Explore at:
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    IMFD; DCC, Universidad de Chile
    Authors
    Hogan, Aidan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset provides material relating to a questionnaire entitled "Semantic Web: Perspectives". This questionnaire was addressed to the W3C Semantic Web mailing list (semantic-web@w3.org) and was open to responses from May 12th to May 25th, 2019. A total of 113 responses were collected in this time. The following files are provided:

    public-comments.txt: provides the public comments of respondents in plain text;

    questionnaire-form.pdf: illustrates the design of the questionnaire, including questions, types of responses permitted, etc.;

    questionnaire-responses.tsv: lists the individual responses (without private comments) as a tab-separated values file;

    success-keywords.xlsx: provides a spreadsheet mapping success story responses to a list of keywords, further providing statistics on these keywords;

    wordcloud-bw.svg: provides a word-cloud of success-story keywords in black & white;

    wordcloud-colour.svg: provides a word-cloud of success-story keywords in colour.

    The word-clouds were produced using Jason Davies' online service, copying and pasting the keywords from the success-keywords.xlsx spreadsheet (e.g., Column A, Sheet Statistics) into the text field; the following settings were selected: Orientations from 0° to 0°, Spiral: Rectangular; Scale: n; Number of words: 400; One word per line: ticked; Font: Patua One (must be installed locally beforehand). The resulting SVG files were later modified in a text editor to add a link to the font used, to tighten the bounding box, and to produce a black & white version.

    We thank the respondents for providing their input.

  14. a

    China City Statistical Indicators, 2008

    • aura.american.edu
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    China Data Center (2025). China City Statistical Indicators, 2008 [Dataset]. http://doi.org/10.57912/23844804.v1
    Explore at:
    Dataset updated
    Feb 12, 2025
    Dataset authored and provided by
    China Data Center
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Area covered
    China
    Description

    2008 city statistical data integrated with administrative boundaries for prefecture and county level cities.

  15. d

    OnAudience - Keyword feed, filtered raw data (60B daily data signals), 3.5...

    • datarade.ai
    .csv
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OnAudience (2023). OnAudience - Keyword feed, filtered raw data (60B daily data signals), 3.5 years historic coverage [Dataset]. https://datarade.ai/data-products/onaudience-keyword-feed-filtered-raw-data-60b-daily-data-onaudience
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    OnAudience
    Area covered
    Cambodia, Equatorial Guinea, Rwanda, San Marino, Angola, Tanzania, United Kingdom, Solomon Islands, Sint Maarten (Dutch part), Swaziland
    Description

    Keyword feed is created by filtering raw data through a specified keyword configuration and allows for tracking web traffic with respect to various topics, e.g.: - public companies - brands - products By analyzing the feed, it is possible to evaluate popularity and sentiment surrounding the chosen phrase over time.

  16. MeSH2Wikidata: A set of tools for the interaction between MeSH keywords, OBO...

    • figshare.com
    txt
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Houcemeddine Turki; Khalil Chebil; Bonaventure Dossou; Chris Emezue; Abraham Owodunni; Mohamed Ali Hadj Taieb; Mohamed Ben Aouicha (2024). MeSH2Wikidata: A set of tools for the interaction between MeSH keywords, OBO Foundry, and Wikidata for enriching biomedical knowledge [Dataset]. http://doi.org/10.6084/m9.figshare.24438184.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Houcemeddine Turki; Khalil Chebil; Bonaventure Dossou; Chris Emezue; Abraham Owodunni; Mohamed Ali Hadj Taieb; Mohamed Ben Aouicha
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    The work consists of tools for the interaction between Wikidata and OBO Foundry and source codes for the use of MeSH keywords of PubMed publications for the enrichment of biomedical knowledge in Wikidata. This work is funded by the Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning Project within the framework of the Wikimedia Foundation Research Fund.To cite the work: Turki, H., Chebil, K., Dossou, B. F. P., Emezue, C. C., Owodunni, A. T., Hadj Taieb, M. A., & Ben Aouicha, M. (2024). A framework for integrating biomedical knowledge in Wikidata with open biomedical ontologies and MeSH keywords. Heliyon, 10(19), e38488. doi:10.1016/j.heliyon.2024.e38448.Wikidata-OBOtool1.py: A tool for the verification of the semantic alignment between Wikidata and OBO ontologies.frame.py: The layout of Tool 1.tool2.py: A tool for extracting Wikidata relations between OBO ontology items.frame2.py: The layout of Tool 2.tool3.py: A tool for extracting multilingual language data for OBO ontology items from Wikidata.frame4.py: The layout of Tool 3.Wikidata-MeSHcorrect_mesh2matrix_dataset.py: A source code for turning MeSH2Matrix into a smaller dataset for the biomedical relation classification based on the MeSH keywords of PubMed publications, named MiniMeSH2Matrix.build_numpy_dataset.py: A source code for building the numpy files for MiniMeSH2Matrix (Relation type-based classification).label_encoded.csv: A table for the conversion of Wikidata Property IDs into MeSH2Matrix Class IDs.new_encoding.csv: A table for the conversion of Wikidata Property IDs into MiniMeSH2Matrix Class IDs.super_classes_new_dataset_labels.npy: The NumPy File of the labels for the superclass-based classification.new_dataset_labels.npy: The NumPy File of the labels for the relation type-based classification.new_dataset_matrices.npy: The Numpy File of the MiniMeSH2Matrix matrices for biomedical relation classification.first_level_new_data.json: The JSON File for the conversion of relation types to superclasses.build_super_classes.py: A source code for building the numpy files for MiniMeSH2Matrix (Superclass-based classification).FC_MeSH_Model_57_New_Data.ipynb: A Jupyter Notebook for training a Dense Model to perform the relation type-based classification.FC_MeSH_Model_57_New_Data_SuperClasses.ipynb: A Jupyter Notebook for training a Dense Model to perform the superclass-based classification.new_data_best_model_1: A stored edition of the best model for the relation type-based classification.new_data_super_classes_best_model_1: A stored edition of the best model for the superclass-based classification.MiniMeSH2Matrix_SuperClasses_Confusion_Matrix.ipynb: A Jupyter Notebook for generating the confusion matrix for the superclass-based supervised classification.MiniMeSH2Matrix_Supervised_Classification_Agreement.ipynb: A Jupyter Notebook for generating the matrix of agreement between the accurate predictions for superclass-based classification and the ones for relation type-based classification.Adding_References_to_Wikidata.ipynb: A Jupyter Notebook to identify the PubMed ID of relevant references to unsupported Wikidata statements between MeSH terms.MeSH_Statistics.xlsx: Statistical data about MeSH-based items and relations in Wikidata.ref_for_unsupported_statements.csv: Retrieved Relevant PubMed References for 1k unsupported Wikidata statements.evaluate_pubmed_ref_assignment.ipynb: A Jupyter Notebook that generates statistics about reference assignment for a sample of 1k unsupported statements.MeSH_Verification.xlsx: A list of inaccurate or duplicated MeSH IDs in Wikidata, as of August 8th, 2023.WikiRelationsPMI.csv: A list of PMI values for the semantic relations between MeSH terms, as available in Wikidata.WikiRelationsPMIDistribution.xlsx: Distribution of PMI values for all Wikidata relations and for specific Wikidata relation types.WikiRelationsToVerify.xlsx: Wikidata relations needing attention because they involve Wikidata items with inaccurate MeSH IDs, they cannot be found in PubMed, or their PMI values are below the threshold of 2.Mesh_part1.py: A Python code that verifies the accuracy of the MeSH IDs for the Wikidata items.MeshWikiPart.py: A Python code that computes the pointwise mutual information values for Wikidata relations between MeSH keywords based on PubMed.Demo.ipynb: A demo of the MeSH-based biomedical relation validation and classification in French.Id_Term.json: A dict of Medical Subject Headings labels corresponding to MeSH Descriptor ID.dict_mesh.json: Number of the occurrences of MeSH keywords in PubMed.finalmatrix.xlsx: Matrix of PMI values between the 5k most common MeSH Keywords.finalmatrixrev.pkl: Pickle File Edition of the PMI matrix.pmi2.xlsx: List of significant PMI associations between the 5k most common MeSH Keywords reaching a threshold of 2.Generate5kMatrix.py: A Python code that generates the PMI matrix.clean_pmi2.py: A Python code to remove the relations already available in Wikidata from pmi.xlsx.missing_rels.xlsx: The final list of the significant PMI associations that do not exist in Wikidata.item_category.json: A dict for MeSH tree categories corresponding to MeSH items.item_categorization.py: A Python code that generates a dict for MeSH tree categories corresponding to MeSH items.classification.py: A Python code for classifying PMI-generated semantic relations between the most common MeSH Keywords.results.xlsx: The output of the classification of the PMI-generated semantic relations between the most common MeSH Keywords.ClassificationStats.ipynb: A Jupyter Notebook for generating statistical data about the classification.

  17. High frequency keywords classified as cultural value perception (Top 20).

    • plos.figshare.com
    xls
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiaoyun Xu; Yan Xu; Chao Ma (2024). High frequency keywords classified as cultural value perception (Top 20). [Dataset]. http://doi.org/10.1371/journal.pone.0315805.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Qiaoyun Xu; Yan Xu; Chao Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High frequency keywords classified as cultural value perception (Top 20).

  18. Number of models included in the MCS, at the 90% confidence level, using the...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dean Fantazzini (2023). Number of models included in the MCS, at the 90% confidence level, using the and statistics and the MSE loss function. [Dataset]. http://doi.org/10.1371/journal.pone.0111894.t011
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Dean Fantazzini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of models included in the MCS, at the 90% confidence level, using the and statistics and the MSE loss function.

  19. The data that support the findings of a review paper "From urban data to...

    • zenodo.org
    • data.niaid.nih.gov
    • +2more
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Klavdiya Bochenina; Klavdiya Bochenina (2024). The data that support the findings of a review paper "From urban data to city-scale models: A review of traffic simulation case studies" [Dataset]. http://doi.org/10.5281/zenodo.13311538
    Explore at:
    Dataset updated
    Sep 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Klavdiya Bochenina; Klavdiya Bochenina
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2024
    Description

    This dataset contains the data that were used in a review paper "From urban data to city-scale models: A review of traffic simulation case studies". It contains the following files:

    • keywords with counts.txt - list of keywords and their counts in the considered corpus of traffic simulation case studies. The data were used to produce Figure 2 and Figure 3 in the paper.
    • Papers analysis.xlsx - Excel file containing the data on the reviewed studies. The document has the following sheets:
      • Appendix A - contains a table short reference, location, simulation period, spatial scale, simulated units and marked categories for a paper;
      • Geography - contains data on geographical distribution of simulated areas between world regions and countries, these data were used to produce Figure 4 in the paper;
      • Software tools - contains data on simulation tools used in the studies.
      • Journals and conferences - contains data on where the reviewed papers were published.
  20. Starbucks 30-Year 10-K NLP Corpus

    • kaggle.com
    zip
    Updated Mar 15, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shiratori seto (2026). Starbucks 30-Year 10-K NLP Corpus [Dataset]. https://www.kaggle.com/datasets/shiratoriseto/starbucks-30year-nlp-corpus
    Explore at:
    zip(241177 bytes)Available download formats
    Dataset updated
    Mar 15, 2026
    Authors
    shiratori seto
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    What's inside

    Seven analysis-ready CSV files derived from 30 years of Starbucks 10-K annual reports (FY1996–FY2025), covering store expansion data, keyword frequency analysis, LDA topic modeling results, and document-level text statistics.

    Files

    1. store_counts_timeseries.csv (30 rows × 16 columns) — Store counts by segment (Company-operated/Licensed, US/International), percentage breakdowns, and CEO labels for each fiscal year.
    2. item1_keyword_timeseries.csv (30 rows × 70 columns) — Raw counts and relative frequency (per 10K words) for 34 strategic keywords extracted from Item 1 (Business) section. Keywords include: coffee, experience, digital, mobile, china, partner, employee, sustainability, rewards, and more.
    3. item1_lda_topic_proportions.csv (30 rows × 8 columns) — Year-level topic proportions from a 7-topic LDA model trained on 847 text chunks (~150 words each). Topics: Store Operations, Supply Chain & Commodity, Leadership & Governance, Digital & Loyalty, International & IP, People/Culture/ESG, Product & Competition.
    4. item1_basic_stats.csv (30 rows × 6 columns) — Document-level statistics: character count, sentence count, word count, unique words, and lexical diversity (type-token ratio).

    Preprocessing pipeline

    These CSVs were generated from the original 10-K filings through the following steps: 1. Download 30 annual 10-K filings from SEC EDGAR (CIK: 0000829224) 2. Extract Item 1 (Business) section from each filing 3. Strip HTML/XBRL tags, normalize whitespace 4. Tokenize → compute keyword frequencies (raw + per 10K words) 5. For LDA: chunk each document into ~150-word segments → 847 chunks → train 7-topic LDA model → aggregate topic proportions by year

    The raw 10-K texts are not included (file size), but are freely available from SEC EDGAR as public domain documents.

    Use cases

    • Corporate language evolution analysis
    • NLP × financial metrics correlation studies
    • LDA topic modeling on SEC filings
    • Strategic keyword tracking across time
    • Reusable template for any public company's annual reports
    • Business strategy case studies

    Data sources & licenses

    SourceLicense
    SEC EDGAR 10-K filingsPublic domain (US government)
    Store countsExtracted from 10-K Item 1 text (public domain)

    Related notebooks

    NotebookThemeLink
    Manhattan Cafe WarsTheme 0: EDA & competitor mappingOpen
    Starbucks 10-K NLPTheme 1: keyword trends, LDA topics, NLP × store countOpen
    Starbucks Spatial ClusteringTheme 2A: Moran's I, LISA, Ripley's KOpen
    Starbucks Location FitnessTheme 2B: demand-supply scoring & backtestOpen
    Starbucks Data PipelinePipeline: EDGAR & OSM to CSV, data quality reportOpen

    Related dataset: Manhattan Café Wars: Starbucks & Subway — spatial data for Theme 0, 2A, 2B

    Column descriptions

    store_counts_timeseries.csv

    ColumnTypeDescription
    fiscal_yearintFiscal year (1996–2025)
    co_usintCompany-operated stores in the US
    co_internationalintCompany-operated stores outside the US
    lic_usintLicensed stores in the US
    lic_internationalintLicensed stores outside the US
    total_worldwideintTotal store count worldwide
    source_notestrData extraction note
    co_totalintTotal company-operated stores
    lic_totalintTotal licensed stores
    us_totalintTotal US stores
    intl_totalintTotal international stores
    pct_licensedfloatPercentage of stores that are licensed
    pct_internationalfloatPercentage of stores outside the US
    yoy_growthfloatYear-over-year growth rate (%)
    yoy_changeintYear-over-year change in store count
    ceostrCEO name for the fiscal year

    item1_keyword_timeseries.csv

    Each keyword has two columns: {keyword} (raw count) and {keyword}_per10k (frequency per 10,000 words).

    ColumnDescription
    fiscal_yearFiscal year (1996–2025)
    total_wordsTotal word count of the Item 1 section
    digital / digital_per10k"digital" occurrences
    mobile / mobile_per10k"mobile" occurrences
    experience / experience_per10k"experience" occurrences
    china / china...
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gerome (2025). SEO-Data [Dataset]. https://www.kaggle.com/datasets/deeprankai/seo-data
Organization logo

SEO-Data

SEO Titles & Keywords Data From High Ranking Websites In All Industries

Explore at:
zip(22686543 bytes)Available download formats
Dataset updated
Mar 4, 2025
Authors
Gerome
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📊 SEO Search Results Dataset (SERP Data)

Filename: SEO_data.csv Size: 56.63 MB Rows: ~100,000+ Columns: 7 Language: Primarily English (may contain multilingual snippets)

🔍 Dataset Overview

This dataset contains structured data scraped from Google Search Engine Results Pages (SERPs), specifically curated for SEO and machine learning research. It includes search rankings and metadata for various keywords, capturing how websites rank and present their content on search engines.

🧾 Columns Description

Column NameDescription
wordsThe search keyword or query entered into Google
rankThe result's position on the search engine results page (1 = top)
titleThe meta title of the page
h1The primary <h1> tag from the page (if available)
snippetThe search result snippet/description shown on Google
linksThe URL of the ranked result
total_resultThe total number of search results Google reports for the query

📌 Use Cases

  • Keyword ranking analysis
  • SERP feature extraction
  • SEO optimization insights
  • Natural language processing (NLP) tasks on snippets, titles, and headings
  • Predictive modeling for search rankings
  • Trend analysis on keyword frequency and ranking shifts

📁 Example Record

wordsranktitleh1snippetlinkstotal_result
Artificial intelligence1Beginning Your Journey to Implementing Artificial IntelligenceBeginning Your Journey...Gérer les éditeurs grâce à des services...https://www.softwareone.com/...776,000,000

📎 Notes

  • Multiple rows may exist for the same keyword due to multiple ranked results.
  • Some values (like H1 or snippets) may occasionally be missing or partial due to scraping limitations.
  • Useful for benchmarking search trends or training LLMs on SEO-related text features.

Enjoy

Search
Clear search
Close search
Google apps
Main menu