100+ datasets found

SEO-Data

kaggle.com

zip

Updated Mar 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Gerome (2025). SEO-Data [Dataset]. https://www.kaggle.com/datasets/deeprankai/seo-data

Explore at:

zip(22686543 bytes)Available download formats

Dataset updated

Mar 4, 2025

Authors

Gerome

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📊 SEO Search Results Dataset (SERP Data)

Filename: SEO_data.csv Size: 56.63 MB Rows: ~100,000+ Columns: 7 Language: Primarily English (may contain multilingual snippets)

🔍 Dataset Overview

This dataset contains structured data scraped from Google Search Engine Results Pages (SERPs), specifically curated for SEO and machine learning research. It includes search rankings and metadata for various keywords, capturing how websites rank and present their content on search engines.

🧾 Columns Description

Column Name	Description
`words`	The search keyword or query entered into Google
`rank`	The result's position on the search engine results page (1 = top)
`title`	The meta title of the page
`h1`	The primary `<h1>` tag from the page (if available)
`snippet`	The search result snippet/description shown on Google
`links`	The URL of the ranked result
`total_result`	The total number of search results Google reports for the query

📌 Use Cases

Keyword ranking analysis
SERP feature extraction
SEO optimization insights
Natural language processing (NLP) tasks on snippets, titles, and headings
Predictive modeling for search rankings
Trend analysis on keyword frequency and ranking shifts

📁 Example Record

words	rank	title	h1	snippet	links	total_result
Artificial intelligence	1	Beginning Your Journey to Implementing Artificial Intelligence	Beginning Your Journey...	Gérer les éditeurs grâce à des services...	https://www.softwareone.com/...	776,000,000

📎 Notes

Multiple rows may exist for the same keyword due to multiple ranked results.
Some values (like H1 or snippets) may occasionally be missing or partial due to scraping limitations.
Useful for benchmarking search trends or training LLMs on SEO-related text features.

Enjoy

Global most searched keywords on Google in 2025, by monthly search volume
statista.com
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global most searched keywords on Google in 2025, by monthly search volume [Dataset]. https://www.statista.com/statistics/1366210/most-searched-google-keywords/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2025 - Mar 2025
Area covered
Worldwide
Description
"*******" was the most frequently searched keyword on Google worldwide, with over ***** million monthly online searches during the analyzed period of January to March in 2025. Furthermore, the search resulted in more than ***** million website visits, or more than **** percent of all traffic. With *** million monthly searches, "***" was the second most popular keyword, and "***********" came in third place with about ****** million searches per month.
company keywords data
kaggle.com
zip
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshii_Sharma (2025). company keywords data [Dataset]. https://www.kaggle.com/datasets/harshiisharma/company-keywords-data
Explore at:
zip(47497 bytes)Available download formats
Dataset updated
Jun 12, 2025
Authors
Harshii_Sharma
Description
Dataset

This dataset was created by Harshii_Sharma

Contents
Global search volume for "AI" keyword 2022-2023
statista.com
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global search volume for "AI" keyword 2022-2023 [Dataset]. https://www.statista.com/statistics/1398211/ai-keyword-traffic-volume/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jun 2022 - Mar 2023
Area covered
Worldwide
Description
Between June 2022 and March 2023, the traffic volume for the keyword "AI" has tripled, going from around 7.9 million monthly searches to more than 30.4 million during the last month of the measured period. General interest in artificial intelligence (AI) has exploded in markets like the United States by the end of 2022. Likewise, interest for the application programming interfaces (API's) and plugins of artificial intelligence solutions, especially those of ChatGPT, has also seen a major increase since the release of the tool in November of 2022.

The artificial intelligence market

Valued at around 142.3 billion U.S. dollars in 2022, the artificial intelligence market is one the most promising tech segments for the rest of the decade, with more than five billion U.S. dollars invested in startups - the most notable being the Californian company OpenAI and its flagship application ChatGPT. Disruptive as it is, the adoption of AI has already sparked an alert for several industries, likely to affect job markets and thus raising concerns about cybercrime and other online misdeeds.

The future of online search?

Of most industries, the impact of the new tool developed by OpenAI may be felt by the online search market like a global earthquake. With chatbots providing search results in a dialogue format, the trend of AI-powered search engines unleashed by ChatGPT threw giant companies like Google and Microsoft into a race with startups and other competitors to present the best candidate for this disruptive (and experimental) online solution.
Amazon Electronics Keywords from Hellium10
kaggle.com
zip
Updated Feb 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Burak Gurhan (2024). Amazon Electronics Keywords from Hellium10 [Dataset]. https://www.kaggle.com/datasets/burakgurhan/h10-keywords
Explore at:
zip(207596 bytes)Available download formats
Dataset updated
Feb 19, 2024
Authors
Burak Gurhan
Description
The data is obtained from Hellium10 which is a popular Amazon seller tool. Hellium10 is a reputable tool for many Amazon sellers and supplies a lot of data to analyze products, keywords, and markets.

Amazon does not share its data with third parties. Hence the data does not reflect the real values.

The data is about the most searched keywords in the Amazon Electronics Category. Is created in January 2024. Thus, it reflects the data before that time.

Data contains 21 columns with more than 4000 phrases with a lot of different details such as Search Volume, Fulfillment Type, Size Tier, and Variation.

Data is eligible for educational purposes. You can make Exploratory Data Analysis, Data visualization, and Data manipulation practices.
o
Data from: Search strategies and keyword searches used in 78 residential...
osf.io
Updated Jun 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Vuillème (2021). Search strategies and keyword searches used in 78 residential care reviews: a cross-sectional analysis [Dataset]. http://doi.org/10.17605/OSF.IO/7DQKP
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/7DQKP
Dataset updated
Jun 10, 2021
Dataset provided by
Center For Open Science
Authors
Martin Vuillème
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract Objectives: To investigate the search strategies and keyword searches used in 95 residential care reviews. Study design: Methodological study (cross-sectional) Methods: First, I attempted to download the full-text versions of all 95 residential care reviews identified in a recently published project. I then searched the full-texts obtained to identify the database search strategies used. I extracted all residential care keywords used in the first search strategy identified in an Excel file. Keywords related to kinship care were not extracted. All keywords extracted were also added to a personal list of residential care keywords, if not included already. The titles and abstracts of all residential care reviews selected were extracted in an Excel file and analyzed using Excel’s COUNTIF function to identify the most commonly occurring keywords/strings. The sensitivity of residential care keywords found within my personal list but not within search strategies was then assessed using Excel’s COUNTIF function. Results: Among the 95 residential care reviews, 5 (5,26%) did not report a search strategy, a search strategy was mentioned but not found for 2 reviews (2,11%) and I could not access the search strategy for 5 reviews (5,26%). Keywords were not extracted from 4 reviews given extensive use of controlled vocabulary (MeSH) or advanced search functions (adj, near, ?, etc.). The only review that did not report searches conducted in English was excluded from analysis. This left 78 review search strategies for analysis. Review authors used from 0 to 53 residential care keywords/strings (mean = 9 keywords, median = 7,5 keywords). 288 unique keywords/strings were used by review authors. The 10 most commonly used keywords were: foster care (51,28%), residential care (47,44%), out of home care (29,49%), out-of-home care (24,36%), group home (20,51%), institutional care (19,23%), children’s home (17,95%), child welfare (16,67%), looked after (16,67%) and looked-after (14,10%). 198 keywords/strings were only found once. The keywords most commonly found within the titles and abstracts of residential care reviews were: foster, foster care, resident, residential, placement, in care, residential care, institution, out-of-home and out-of-home care. Four keywords/strings were found in more than 4% of the titles and abstracts of residential care reviews but could not be identified within residential care reviews search strategies: care setting, out of care, residential youth care and care placement. Funding: No funding was received for this work. Registration and study protocol: See https://osf.io/7dqkp Data and materials: See https://osf.io/7dqkp. All other data should otherwise be included within this manuscript. Keywords: Children’s homes, residential care, electronic searches, systematic review
Data from: keywords-data
kaggle.com
zip
Updated Dec 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quân Phạm Ngọc (2024). keywords-data [Dataset]. https://www.kaggle.com/datasets/ngocquanofficial/keywords-data/versions/2
Explore at:
zip(540319 bytes)Available download formats
Dataset updated
Dec 24, 2024
Authors
Quân Phạm Ngọc
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Quân Phạm Ngọc

Released under MIT

Contents
i
Key data - Organic SEO Keywords: How to Choose and Use Them (2026 Guide)
incremys.com
html
Updated Mar 15, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Incremys (2026). Key data - Organic SEO Keywords: How to Choose and Use Them (2026 Guide) [Dataset]. https://www.incremys.com/en/resources/blog/organic-seo-keywords
Explore at:
htmlAvailable download formats
Dataset updated
Mar 15, 2026
Dataset authored and provided by
Incremys
Variables measured
Metric: Scaling production whilst keeping quality under control, Metric: E-E-A-T, reliability and usefulness: what Google actually rewards, Metric: Authority: when content is not enough and how to build credibility, Metric: Business metrics: leads, conversion, pipeline contribution and ROI, Metric: How SERPs dictate formats: rich results, FAQs, videos and category pages, Metric: Write for people, structure for engines: evidence, examples and readability, Metric: Prioritise opportunities: volume, competition, seasonality, effort and expected ROI
Description
2026 guide to selecting, prioritising and using organic SEO keywords without over-optimising, using a method focused on ROI.
m
Keywords Studios plc Alternative Data Analytics
meyka.com
Updated Sep 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meyka (2025). Keywords Studios plc Alternative Data Analytics [Dataset]. https://meyka.com/stock/KYYWF/alt-data/
Explore at:
Dataset updated
Sep 20, 2025
Dataset provided by
Meyka
Description
Non-traditional data signals from social media and employment platforms for KYYWF stock analysis
d
USA Purchase Intent Data | KeyWord Search | 150M Daily Hashed Emails
datarade.ai
Updated Nov 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BIGDBM (2025). USA Purchase Intent Data | KeyWord Search | 150M Daily Hashed Emails [Dataset]. https://datarade.ai/data-products/intent-data-usa-coverage-datamarket-bigdbm
Explore at:
.json, .csv, .parquetAvailable download formats
Dataset updated
Nov 23, 2025
Dataset authored and provided by
BIGDBM
Area covered
United States
Description
BigDBM's purchase Intent Data transforms how businesses understand and engage with their customers by providing a comprehensive, real-time view of buyer purchase intent data across both US consumer and B2B markets. With over 20 years of expertise in building identity graphs, our platform processes more than 110 million distinct hashed email addresses daily, delivering actionable insights that drive measurable ROI.

Our proprietary methodology combines data from multiple live streams and maps website domains and emails to IAB classification codes, giving you structured insights into market interests and purchase intent. Through advanced natural language processing and a custom five-tiered taxonomy, we extract granular keywords while maintaining broad category classification for flexible targeting.

Our unique intent intensity scoring quantifies purchase likelihood based on frequency and consistency of interest, while timestamp tracking reveals behavioral shifts and trends over time. With robust privacy compliance and ethical sourcing practices, BigDBM enables organizations to make smarter, faster decisions across customer acquisition, retention, and engagement through applications in account-based marketing, audience expansion, data enrichment, and programmatic advertising.

We do not provide any phone details from Colorado residents.
d
Patent AT-E400852-T1: [Translated] SYSTEM, METHOD AND APPARATUS FOR...
catalog.data.gov
data.zh-tw.virginia.gov
+11more
Updated Sep 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for Biotechnology Information (NCBI) (2025). Patent AT-E400852-T1: [Translated] SYSTEM, METHOD AND APPARATUS FOR PERFORMING A KEYWORD SEARCH [Dataset]. https://catalog.data.gov/dataset/patent-at-e400852-t1-translated-system-method-and-apparatus-for-performing-a-keyword-searc
Explore at:
Dataset updated
Sep 8, 2025
Dataset provided by
National Center for Biotechnology Information (NCBI)
Description
A keyterm search is a method of searching a database for subsets of the database that are relevant to an input query. First, a number of relational models of subsets of a database are provided. A query is then input. The query can include one or more keyterms. Next, a gleaning model of the query is created. The gleaning model of the query is then compared to each one of the relational models of subsets of the database. The identifiers of the relevant subsets are then output.
e
Query Fan-Out Research Dataset
ekamoira.com
json
Updated Jan 27, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ekamoira Research Team (2026). Query Fan-Out Research Dataset [Dataset]. https://www.ekamoira.com/blog/query-fan-out-original-research-on-how-ai-search-multiplies-every-query-and-why-most-brands-are-invisible
Explore at:
jsonAvailable download formats
Dataset updated
Jan 27, 2026
Dataset authored and provided by
Ekamoira Research Team
Area covered
Global
Variables measured
URLs Analyzed, Keywords Analyzed, Fan-out Citation Lift, Fan-Out Query Stability, Optimal Passage Length Range, AI-Cited Pages Outside Top 10, AI Mode Monthly Users (US + India), Brands Missing AI Citation Opportunities, Citation Multiplier at 0.88+ Cosine Similarity, Spearman Correlation (Fan-Out Coverage vs Citations)
Measurement technique
Correlation analysis (Spearman 0.77), Topical coverage analysis, Semantic similarity scoring (cosine distance), Citation probability modeling, Fan-out coverage ratio measurement
Description
Original quantitative research dataset analyzing AI search query fan-out behavior across 173,902 URLs and 10,000 keywords with citations across Google AI Mode, ChatGPT, and Perplexity platforms
Z
Responses to "Semantic Web: Perspectives" Questionnaire
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hogan, Aidan (2024). Responses to "Semantic Web: Perspectives" Questionnaire [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3229400
Explore at:
Dataset updated
Jul 22, 2024
Dataset provided by
IMFD; DCC, Universidad de Chile
Authors
Hogan, Aidan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset provides material relating to a questionnaire entitled "Semantic Web: Perspectives". This questionnaire was addressed to the W3C Semantic Web mailing list (semantic-web@w3.org) and was open to responses from May 12th to May 25th, 2019. A total of 113 responses were collected in this time. The following files are provided:

public-comments.txt: provides the public comments of respondents in plain text;

questionnaire-form.pdf: illustrates the design of the questionnaire, including questions, types of responses permitted, etc.;

questionnaire-responses.tsv: lists the individual responses (without private comments) as a tab-separated values file;

success-keywords.xlsx: provides a spreadsheet mapping success story responses to a list of keywords, further providing statistics on these keywords;

wordcloud-bw.svg: provides a word-cloud of success-story keywords in black & white;

wordcloud-colour.svg: provides a word-cloud of success-story keywords in colour.

The word-clouds were produced using Jason Davies' online service, copying and pasting the keywords from the success-keywords.xlsx spreadsheet (e.g., Column A, Sheet Statistics) into the text field; the following settings were selected: Orientations from 0° to 0°, Spiral: Rectangular; Scale: n; Number of words: 400; One word per line: ticked; Font: Patua One (must be installed locally beforehand). The resulting SVG files were later modified in a text editor to add a link to the font used, to tighten the bounding box, and to produce a black & white version.

We thank the respondents for providing their input.
a
China City Statistical Indicators, 2008
aura.american.edu
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
China Data Center (2025). China City Statistical Indicators, 2008 [Dataset]. http://doi.org/10.57912/23844804.v1
Explore at:
Unique identifier
https://doi.org/10.57912/23844804.v1
Dataset updated
Feb 12, 2025
Dataset authored and provided by
China Data Center
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Area covered
China
Description
2008 city statistical data integrated with administrative boundaries for prefecture and county level cities.
d
OnAudience - Keyword feed, filtered raw data (60B daily data signals), 3.5...
datarade.ai
.csv
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OnAudience (2023). OnAudience - Keyword feed, filtered raw data (60B daily data signals), 3.5 years historic coverage [Dataset]. https://datarade.ai/data-products/onaudience-keyword-feed-filtered-raw-data-60b-daily-data-onaudience
Explore at:
.csvAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
OnAudience
Area covered
Cambodia, Equatorial Guinea, Rwanda, San Marino, Angola, Tanzania, United Kingdom, Solomon Islands, Sint Maarten (Dutch part), Swaziland
Description
Keyword feed is created by filtering raw data through a specified keyword configuration and allows for tracking web traffic with respect to various topics, e.g.: - public companies - brands - products By analyzing the feed, it is possible to evaluate popularity and sentiment surrounding the chosen phrase over time.
MeSH2Wikidata: A set of tools for the interaction between MeSH keywords, OBO...
figshare.com
txt
Updated Oct 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Houcemeddine Turki; Khalil Chebil; Bonaventure Dossou; Chris Emezue; Abraham Owodunni; Mohamed Ali Hadj Taieb; Mohamed Ben Aouicha (2024). MeSH2Wikidata: A set of tools for the interaction between MeSH keywords, OBO Foundry, and Wikidata for enriching biomedical knowledge [Dataset]. http://doi.org/10.6084/m9.figshare.24438184.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24438184.v1
Dataset updated
Oct 29, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Houcemeddine Turki; Khalil Chebil; Bonaventure Dossou; Chris Emezue; Abraham Owodunni; Mohamed Ali Hadj Taieb; Mohamed Ben Aouicha
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
The work consists of tools for the interaction between Wikidata and OBO Foundry and source codes for the use of MeSH keywords of PubMed publications for the enrichment of biomedical knowledge in Wikidata. This work is funded by the Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning Project within the framework of the Wikimedia Foundation Research Fund.To cite the work: Turki, H., Chebil, K., Dossou, B. F. P., Emezue, C. C., Owodunni, A. T., Hadj Taieb, M. A., & Ben Aouicha, M. (2024). A framework for integrating biomedical knowledge in Wikidata with open biomedical ontologies and MeSH keywords. Heliyon, 10(19), e38488. doi:10.1016/j.heliyon.2024.e38448.Wikidata-OBOtool1.py: A tool for the verification of the semantic alignment between Wikidata and OBO ontologies.frame.py: The layout of Tool 1.tool2.py: A tool for extracting Wikidata relations between OBO ontology items.frame2.py: The layout of Tool 2.tool3.py: A tool for extracting multilingual language data for OBO ontology items from Wikidata.frame4.py: The layout of Tool 3.Wikidata-MeSHcorrect_mesh2matrix_dataset.py: A source code for turning MeSH2Matrix into a smaller dataset for the biomedical relation classification based on the MeSH keywords of PubMed publications, named MiniMeSH2Matrix.build_numpy_dataset.py: A source code for building the numpy files for MiniMeSH2Matrix (Relation type-based classification).label_encoded.csv: A table for the conversion of Wikidata Property IDs into MeSH2Matrix Class IDs.new_encoding.csv: A table for the conversion of Wikidata Property IDs into MiniMeSH2Matrix Class IDs.super_classes_new_dataset_labels.npy: The NumPy File of the labels for the superclass-based classification.new_dataset_labels.npy: The NumPy File of the labels for the relation type-based classification.new_dataset_matrices.npy: The Numpy File of the MiniMeSH2Matrix matrices for biomedical relation classification.first_level_new_data.json: The JSON File for the conversion of relation types to superclasses.build_super_classes.py: A source code for building the numpy files for MiniMeSH2Matrix (Superclass-based classification).FC_MeSH_Model_57_New_Data.ipynb: A Jupyter Notebook for training a Dense Model to perform the relation type-based classification.FC_MeSH_Model_57_New_Data_SuperClasses.ipynb: A Jupyter Notebook for training a Dense Model to perform the superclass-based classification.new_data_best_model_1: A stored edition of the best model for the relation type-based classification.new_data_super_classes_best_model_1: A stored edition of the best model for the superclass-based classification.MiniMeSH2Matrix_SuperClasses_Confusion_Matrix.ipynb: A Jupyter Notebook for generating the confusion matrix for the superclass-based supervised classification.MiniMeSH2Matrix_Supervised_Classification_Agreement.ipynb: A Jupyter Notebook for generating the matrix of agreement between the accurate predictions for superclass-based classification and the ones for relation type-based classification.Adding_References_to_Wikidata.ipynb: A Jupyter Notebook to identify the PubMed ID of relevant references to unsupported Wikidata statements between MeSH terms.MeSH_Statistics.xlsx: Statistical data about MeSH-based items and relations in Wikidata.ref_for_unsupported_statements.csv: Retrieved Relevant PubMed References for 1k unsupported Wikidata statements.evaluate_pubmed_ref_assignment.ipynb: A Jupyter Notebook that generates statistics about reference assignment for a sample of 1k unsupported statements.MeSH_Verification.xlsx: A list of inaccurate or duplicated MeSH IDs in Wikidata, as of August 8th, 2023.WikiRelationsPMI.csv: A list of PMI values for the semantic relations between MeSH terms, as available in Wikidata.WikiRelationsPMIDistribution.xlsx: Distribution of PMI values for all Wikidata relations and for specific Wikidata relation types.WikiRelationsToVerify.xlsx: Wikidata relations needing attention because they involve Wikidata items with inaccurate MeSH IDs, they cannot be found in PubMed, or their PMI values are below the threshold of 2.Mesh_part1.py: A Python code that verifies the accuracy of the MeSH IDs for the Wikidata items.MeshWikiPart.py: A Python code that computes the pointwise mutual information values for Wikidata relations between MeSH keywords based on PubMed.Demo.ipynb: A demo of the MeSH-based biomedical relation validation and classification in French.Id_Term.json: A dict of Medical Subject Headings labels corresponding to MeSH Descriptor ID.dict_mesh.json: Number of the occurrences of MeSH keywords in PubMed.finalmatrix.xlsx: Matrix of PMI values between the 5k most common MeSH Keywords.finalmatrixrev.pkl: Pickle File Edition of the PMI matrix.pmi2.xlsx: List of significant PMI associations between the 5k most common MeSH Keywords reaching a threshold of 2.Generate5kMatrix.py: A Python code that generates the PMI matrix.clean_pmi2.py: A Python code to remove the relations already available in Wikidata from pmi.xlsx.missing_rels.xlsx: The final list of the significant PMI associations that do not exist in Wikidata.item_category.json: A dict for MeSH tree categories corresponding to MeSH items.item_categorization.py: A Python code that generates a dict for MeSH tree categories corresponding to MeSH items.classification.py: A Python code for classifying PMI-generated semantic relations between the most common MeSH Keywords.results.xlsx: The output of the classification of the PMI-generated semantic relations between the most common MeSH Keywords.ClassificationStats.ipynb: A Jupyter Notebook for generating statistical data about the classification.
High frequency keywords classified as cultural value perception (Top 20).
plos.figshare.com
xls
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qiaoyun Xu; Yan Xu; Chao Ma (2024). High frequency keywords classified as cultural value perception (Top 20). [Dataset]. http://doi.org/10.1371/journal.pone.0315805.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315805.t002
Dataset updated
Dec 19, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Qiaoyun Xu; Yan Xu; Chao Ma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High frequency keywords classified as cultural value perception (Top 20).
Number of models included in the MCS, at the 90% confidence level, using the...
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dean Fantazzini (2023). Number of models included in the MCS, at the 90% confidence level, using the and statistics and the MSE loss function. [Dataset]. http://doi.org/10.1371/journal.pone.0111894.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0111894.t011
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dean Fantazzini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of models included in the MCS, at the 90% confidence level, using the and statistics and the MSE loss function.
The data that support the findings of a review paper "From urban data to...
zenodo.org
data.niaid.nih.gov
+2more
Updated Sep 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Klavdiya Bochenina; Klavdiya Bochenina (2024). The data that support the findings of a review paper "From urban data to city-scale models: A review of traffic simulation case studies" [Dataset]. http://doi.org/10.5281/zenodo.13311538
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13311538
Dataset updated
Sep 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Klavdiya Bochenina; Klavdiya Bochenina
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2024
Description
This dataset contains the data that were used in a review paper "From urban data to city-scale models: A review of traffic simulation case studies". It contains the following files:

keywords with counts.txt - list of keywords and their counts in the considered corpus of traffic simulation case studies. The data were used to produce Figure 2 and Figure 3 in the paper.

Papers analysis.xlsx - Excel file containing the data on the reviewed studies. The document has the following sheets:

Appendix A - contains a table short reference, location, simulation period, spatial scale, simulated units and marked categories for a paper;

Geography - contains data on geographical distribution of simulated areas between world regions and countries, these data were used to produce Figure 4 in the paper;

Software tools - contains data on simulation tools used in the studies.

Journals and conferences - contains data on where the reviewed papers were published.

Starbucks 30-Year 10-K NLP Corpus

kaggle.com

zip

Updated Mar 15, 2026

Facebook

Twitter

Click to copy link

Link copied

Cite

shiratori seto (2026). Starbucks 30-Year 10-K NLP Corpus [Dataset]. https://www.kaggle.com/datasets/shiratoriseto/starbucks-30year-nlp-corpus

Explore at:

zip(241177 bytes)Available download formats

Dataset updated

Mar 15, 2026

Authors

shiratori seto

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

What's inside

Seven analysis-ready CSV files derived from 30 years of Starbucks 10-K annual reports (FY1996–FY2025), covering store expansion data, keyword frequency analysis, LDA topic modeling results, and document-level text statistics.

Files

store_counts_timeseries.csv (30 rows × 16 columns) — Store counts by segment (Company-operated/Licensed, US/International), percentage breakdowns, and CEO labels for each fiscal year.
item1_keyword_timeseries.csv (30 rows × 70 columns) — Raw counts and relative frequency (per 10K words) for 34 strategic keywords extracted from Item 1 (Business) section. Keywords include: coffee, experience, digital, mobile, china, partner, employee, sustainability, rewards, and more.
item1_lda_topic_proportions.csv (30 rows × 8 columns) — Year-level topic proportions from a 7-topic LDA model trained on 847 text chunks (~150 words each). Topics: Store Operations, Supply Chain & Commodity, Leadership & Governance, Digital & Loyalty, International & IP, People/Culture/ESG, Product & Competition.
item1_basic_stats.csv (30 rows × 6 columns) — Document-level statistics: character count, sentence count, word count, unique words, and lexical diversity (type-token ratio).

Preprocessing pipeline

These CSVs were generated from the original 10-K filings through the following steps: 1. Download 30 annual 10-K filings from SEC EDGAR (CIK: 0000829224) 2. Extract Item 1 (Business) section from each filing 3. Strip HTML/XBRL tags, normalize whitespace 4. Tokenize → compute keyword frequencies (raw + per 10K words) 5. For LDA: chunk each document into ~150-word segments → 847 chunks → train 7-topic LDA model → aggregate topic proportions by year

The raw 10-K texts are not included (file size), but are freely available from SEC EDGAR as public domain documents.

Use cases

Corporate language evolution analysis
NLP × financial metrics correlation studies
LDA topic modeling on SEC filings
Strategic keyword tracking across time
Reusable template for any public company's annual reports
Business strategy case studies

Data sources & licenses

Source	License
SEC EDGAR 10-K filings	Public domain (US government)
Store counts	Extracted from 10-K Item 1 text (public domain)

Related notebooks

Notebook	Theme	Link
Manhattan Cafe Wars	Theme 0: EDA & competitor mapping	Open
Starbucks 10-K NLP	Theme 1: keyword trends, LDA topics, NLP × store count	Open
Starbucks Spatial Clustering	Theme 2A: Moran's I, LISA, Ripley's K	Open
Starbucks Location Fitness	Theme 2B: demand-supply scoring & backtest	Open
Starbucks Data Pipeline	Pipeline: EDGAR & OSM to CSV, data quality report	Open

Related dataset: Manhattan Café Wars: Starbucks & Subway — spatial data for Theme 0, 2A, 2B

Column descriptions

store_counts_timeseries.csv

Column	Type	Description
fiscal_year	int	Fiscal year (1996–2025)
co_us	int	Company-operated stores in the US
co_international	int	Company-operated stores outside the US
lic_us	int	Licensed stores in the US
lic_international	int	Licensed stores outside the US
total_worldwide	int	Total store count worldwide
source_note	str	Data extraction note
co_total	int	Total company-operated stores
lic_total	int	Total licensed stores
us_total	int	Total US stores
intl_total	int	Total international stores
pct_licensed	float	Percentage of stores that are licensed
pct_international	float	Percentage of stores outside the US
yoy_growth	float	Year-over-year growth rate (%)
yoy_change	int	Year-over-year change in store count
ceo	str	CEO name for the fiscal year

item1_keyword_timeseries.csv

Each keyword has two columns: {keyword} (raw count) and {keyword}_per10k (frequency per 10,000 words).

Column	Description
fiscal_year	Fiscal year (1996–2025)
total_words	Total word count of the Item 1 section
digital / digital_per10k	"digital" occurrences
mobile / mobile_per10k	"mobile" occurrences
experience / experience_per10k	"experience" occurrences
china / china...

Facebook

Twitter

Click to copy link

Link copied

Cite

Gerome (2025). SEO-Data [Dataset]. https://www.kaggle.com/datasets/deeprankai/seo-data

SEO-Data

SEO Titles & Keywords Data From High Ranking Websites In All Industries

Explore at:

zip(22686543 bytes)Available download formats

Dataset updated

Mar 4, 2025

Authors

Gerome

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📊 SEO Search Results Dataset (SERP Data)

Filename: SEO_data.csv Size: 56.63 MB Rows: ~100,000+ Columns: 7 Language: Primarily English (may contain multilingual snippets)