53 datasets found

S
Google Usage Statistics 2025: Key Trends and Data Insights
sqmagazine.co.uk
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SQ Magazine (2025). Google Usage Statistics 2025: Key Trends and Data Insights [Dataset]. https://sqmagazine.co.uk/google-usage-statistics/
Explore at:
Dataset updated
Sep 30, 2025
Dataset authored and provided by
SQ Magazine
License
https://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/
Time period covered
Jan 1, 2024 - Dec 31, 2025
Area covered
Global
Description
It starts with a simple habit: you open your browser and type a question. A few keystrokes later, Google gives you answers, videos, maps, and suggestions before you even finish your thought. For billions of people around the world, this daily interaction is second nature. But behind that blinking cursor...
Market share of leading desktop search engines worldwide monthly 2015-2025
statista.com
freeagenlt.com
+1more
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Market share of leading desktop search engines worldwide monthly 2015-2025 [Dataset]. https://www.statista.com/statistics/216573/worldwide-market-share-of-search-engines/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2015 - Oct 2025
Area covered
Worldwide
Description
As of October 2025, Google represented ***** percent of the global online search engine referrals on desktop devices. Despite being much ahead of its competitors, this represents a modest increase from the previous months. Meanwhile, its longtime competitor Bing accounted for ***** percent, as tools like Yahoo and Yandex held shares of over **** percent and **** percent respectively. Google and the global search market Ever since the introduction of Google Search in 1997, the company has dominated the search engine market, while the shares of all other tools has been rather lopsided. The majority of Google revenues are generated through advertising. Its parent corporation, Alphabet, was one of the biggest internet companies worldwide as of 2024, with a market capitalization of **** trillion U.S. dollars. The company has also expanded its services to mail, productivity tools, enterprise products, mobile devices, and other ventures. As a result, Google earned one of the highest tech company revenues in 2024 with roughly ****** billion U.S. dollars. Search engine usage in different countries Google is the most frequently used search engine worldwide. But in some countries, its alternatives are leading or competing with it to some extent. As of the last quarter of 2023, more than ** percent of internet users in Russia used Yandex, whereas Google users represented little over ** percent. Meanwhile, Baidu was the most used search engine in China, despite a strong decrease in the percentage of internet users in the country accessing it. In other countries, like Japan and Mexico, people tend to use Yahoo along with Google. By the end of 2024, nearly half of the respondents in Japan said that they had used Yahoo in the past four weeks. In the same year, over ** percent of users in Mexico said they used Yahoo.
B
Belarus Internet Usage: Search Engine Market Share: Desktop: StartPagina...
ceicdata.com
Updated Mar 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2025). Belarus Internet Usage: Search Engine Market Share: Desktop: StartPagina (Google) [Dataset]. https://www.ceicdata.com/en/belarus/internet-usage-search-engine-market-share/internet-usage-search-engine-market-share-desktop-startpagina-google
Explore at:
Dataset updated
Mar 9, 2025
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 1, 2025 - Mar 9, 2025
Area covered
Belarus
Description
Belarus Internet Usage: Search Engine Market Share: Desktop: StartPagina (Google) data was reported at 0.000 % in 09 Mar 2025. This records a decrease from the previous number of 0.030 % for 08 Mar 2025. Belarus Internet Usage: Search Engine Market Share: Desktop: StartPagina (Google) data is updated daily, averaging 0.070 % from Mar 2025 (Median) to 09 Mar 2025, with 9 observations. The data reached an all-time high of 0.070 % in 05 Mar 2025 and a record low of 0.000 % in 09 Mar 2025. Belarus Internet Usage: Search Engine Market Share: Desktop: StartPagina (Google) data remains active status in CEIC and is reported by Statcounter Global Stats. The data is categorized under Global Database’s Belarus – Table BY.SC.IU: Internet Usage: Search Engine Market Share.
d
Google SERP Data, Web Search Data, Google Images Data | Real-Time API
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWeb Ninja, Google SERP Data, Web Search Data, Google Images Data | Real-Time API [Dataset]. https://datarade.ai/data-products/openweb-ninja-google-data-google-image-data-google-serp-d-openweb-ninja
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
OpenWeb Ninja
Area covered
Tokelau, Burundi, Ireland, Panama, Barbados, South Georgia and the South Sandwich Islands, Grenada, Uganda, Virgin Islands (U.S.), Uruguay
Description
OpenWeb Ninja's Google Images Data (Google SERP Data) API provides real-time image search capabilities for images sourced from all public sources on the web.

The API enables you to search and access more than 100 billion images from across the web including advanced filtering capabilities as supported by Google Advanced Image Search. The API provides Google Images Data (Google SERP Data) including details such as image URL, title, size information, thumbnail, source information, and more data points. The API supports advanced filtering and options such as file type, image color, usage rights, creation time, and more. In addition, any Advanced Google Search operators can be used with the API.

OpenWeb Ninja's Google Images Data & Google SERP Data API common use cases:

Creative Media Production: Enhance digital content with a vast array of real-time images, ensuring engaging and brand-aligned visuals for blogs, social media, and advertising.

AI Model Enhancement: Train and refine AI models with diverse, annotated images, improving object recognition and image classification accuracy.

Trend Analysis: Identify emerging market trends and consumer preferences through real-time visual data, enabling proactive business decisions.

Innovative Product Design: Inspire product innovation by exploring current design trends and competitor products, ensuring market-relevant offerings.

Advanced Search Optimization: Improve search engines and applications with enriched image datasets, providing users with accurate, relevant, and visually appealing search results.

OpenWeb Ninja's Annotated Imagery Data & Google SERP Data Stats & Capabilities:

100B+ Images: Access an extensive database of over 100 billion images.

Images Data from all Public Sources (Google SERP Data): Benefit from a comprehensive aggregation of image data from various public websites, ensuring a wide range of sources and perspectives.

Extensive Search and Filtering Capabilities: Utilize advanced search operators and filters to refine image searches by file type, color, usage rights, creation time, and more, making it easy to find exactly what you need.

Rich Data Points: Each image comes with more than 10 data points, including URL, title (annotation), size information, thumbnail, and source information, providing a detailed context for each image.
S
Singapore Internet Usage: Search Engine Market Share: Tablet: StartPagina...
ceicdata.com
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2025). Singapore Internet Usage: Search Engine Market Share: Tablet: StartPagina (Google) [Dataset]. https://www.ceicdata.com/en/singapore/internet-usage-search-engine-market-share/internet-usage-search-engine-market-share-tablet-startpagina-google
Explore at:
Dataset updated
Oct 15, 2025
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 3, 2024 - Dec 11, 2024
Area covered
Singapore
Description
Singapore Internet Usage: Search Engine Market Share: Tablet: StartPagina (Google) data was reported at 0.000 % in 11 Dec 2024. This stayed constant from the previous number of 0.000 % for 10 Dec 2024. Singapore Internet Usage: Search Engine Market Share: Tablet: StartPagina (Google) data is updated daily, averaging 0.060 % from Dec 2024 (Median) to 11 Dec 2024, with 9 observations. The data reached an all-time high of 0.060 % in 07 Dec 2024 and a record low of 0.000 % in 11 Dec 2024. Singapore Internet Usage: Search Engine Market Share: Tablet: StartPagina (Google) data remains active status in CEIC and is reported by Statcounter Global Stats. The data is categorized under Global Database’s Singapore – Table SG.SC.IU: Internet Usage: Search Engine Market Share.
ScrapeHero Data Cloud - Free and Easy to use
datarade.ai
.json, .csv
Updated Feb 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scrapehero (2022). ScrapeHero Data Cloud - Free and Easy to use [Dataset]. https://datarade.ai/data-products/scrapehero-data-cloud-free-and-easy-to-use-scrapehero
Explore at:
.json, .csvAvailable download formats
Dataset updated
Feb 8, 2022
Dataset provided by
ScrapeHero
Authors
Scrapehero
Area covered
Portugal, Ghana, Slovakia, Anguilla, Bhutan, Bahamas, Dominica, Chad, Bahrain, Niue
Description
The Easiest Way to Collect Data from the Internet Download anything you see on the internet into spreadsheets within a few clicks using our ready-made web crawlers or a few lines of code using our APIs

We have made it as simple as possible to collect data from websites

Easy to Use Crawlers Amazon Product Details and Pricing Scraper Amazon Product Details and Pricing Scraper Get product information, pricing, FBA, best seller rank, and much more from Amazon.

Google Maps Search Results Google Maps Search Results Get details like place name, phone number, address, website, ratings, and open hours from Google Maps or Google Places search results.

Twitter Scraper Twitter Scraper Get tweets, Twitter handle, content, number of replies, number of retweets, and more. All you need to provide is a URL to a profile, hashtag, or an advance search URL from Twitter.

Amazon Product Reviews and Ratings Amazon Product Reviews and Ratings Get customer reviews for any product on Amazon and get details like product name, brand, reviews and ratings, and more from Amazon.

Google Reviews Scraper Google Reviews Scraper Scrape Google reviews and get details like business or location name, address, review, ratings, and more for business and places.

Walmart Product Details & Pricing Walmart Product Details & Pricing Get the product name, pricing, number of ratings, reviews, product images, URL other product-related data from Walmart.

Amazon Search Results Scraper Amazon Search Results Scraper Get product search rank, pricing, availability, best seller rank, and much more from Amazon.

Amazon Best Sellers Amazon Best Sellers Get the bestseller rank, product name, pricing, number of ratings, rating, product images, and more from any Amazon Bestseller List.

Google Search Scraper Google Search Scraper Scrape Google search results and get details like search rank, paid and organic results, knowledge graph, related search results, and more.

Walmart Product Reviews & Ratings Walmart Product Reviews & Ratings Get customer reviews for any product on Walmart.com and get details like product name, brand, reviews, and ratings.

Scrape Emails and Contact Details Scrape Emails and Contact Details Get emails, addresses, contact numbers, social media links from any website.

Walmart Search Results Scraper Walmart Search Results Scraper Get Product details such as pricing, availability, reviews, ratings, and more from Walmart search results and categories.

Glassdoor Job Listings Glassdoor Job Listings Scrape job details such as job title, salary, job description, location, company name, number of reviews, and ratings from Glassdoor.

Indeed Job Listings Indeed Job Listings Scrape job details such as job title, salary, job description, location, company name, number of reviews, and ratings from Indeed.

LinkedIn Jobs Scraper Premium LinkedIn Jobs Scraper Scrape job listings on LinkedIn and extract job details such as job title, job description, location, company name, number of reviews, and more.

Redfin Scraper Premium Redfin Scraper Scrape real estate listings from Redfin. Extract property details such as address, price, mortgage, redfin estimate, broker name and more.

Yelp Business Details Scraper Yelp Business Details Scraper Scrape business details from Yelp such as phone number, address, website, and more from Yelp search and business details page.

Zillow Scraper Premium Zillow Scraper Scrape real estate listings from Zillow. Extract property details such as address, price, Broker, broker name and more.

Amazon product offers and third party sellers Amazon product offers and third party sellers Get product pricing, delivery details, FBA, seller details, and much more from the Amazon offer listing page.

Realtor Scraper Premium Realtor Scraper Scrape real estate listings from Realtor.com. Extract property details such as Address, Price, Area, Broker and more.

Target Product Details & Pricing Target Product Details & Pricing Get product details from search results and category pages such as pricing, availability, rating, reviews, and 20+ data points from Target.

Trulia Scraper Premium Trulia Scraper Scrape real estate listings from Trulia. Extract property details such as Address, Price, Area, Mortgage and more.

Amazon Customer FAQs Amazon Customer FAQs Get FAQs for any product on Amazon and get details like the question, answer, answered user name, and more.

Yellow Pages Scraper Yellow Pages Scraper Get details like business name, phone number, address, website, ratings, and more from Yellow Pages search results.
Z
Data for study "Direct Answers in Google Search Results"
data.niaid.nih.gov
zenodo.org
Updated Jun 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strzelecki, Artur; Rutecka, Paulina (2020). Data for study "Direct Answers in Google Search Results" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3541091
Explore at:
Dataset updated
Jun 9, 2020
Dataset provided by
University of Economics in Katowice
Authors
Strzelecki, Artur; Rutecka, Paulina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The goal of this research is to examine direct answers in Google web search engine. Dataset was collected using Senuto (https://www.senuto.com/). Senuto is as an online tool, that extracts data on websites visibility from Google search engine.

Dataset contains the following elements:

keyword,

number of monthly searches,

featured domain,

featured main domain,

featured position,

featured type,

featured url,

content,

content length.

Dataset with visibility structure has 743 798 keywords that were resulting in SERPs with direct answer.
Frequently used Google search terms in Germany 2024
statista.com
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Frequently used Google search terms in Germany 2024 [Dataset]. https://www.statista.com/statistics/445591/most-frequent-google-search-terms-germany/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Germany
Description
As of February 2025, several search terms were googled especially actively by German internet users. Among these, the leading three were ‘handball wm’ (handball championship), ‘australian open 2025’ and ‘handball wm 2025’. The terms reflect events, certain dates and ensuing media coverage taking place at the time, in this case about the handball championship and the Australian Open tennis tournament. Always searching Daily life seems unthinkable without using an online search engine, whether for longer research or quickly checking something, or even just to avoid setting bookmarks, typing in URLs. Google has by far the highest share among online search engines used on desktop and mobile devices at almost 90 percent of searches done on Google, followed by Bing and Ecosia. While DuckDuckGo was further down on the list, its market share has been rising in Germany. Google may still have a substantial head start compared to its competitors, but users are increasingly apprehensive about data privacy and protection in connection with how the online search giant uses and stores personal information, as well as tracks search queries. Searching for Trees Ecosia is an environmentally friendly search engine with a unique business model that sets it apart from other search engines. It uses the revenue from search ads to plant trees worldwide and support reforestation projects. Every time a user performs a search on Ecosia, they indirectly contribute to reforestation, as one tree is planted for every 45 searches.The search engine market share held by Ecosia has been growing in recent years, especially in Germany where the company is based, and in other countries in Europe. Ecosia, similarly to other alternative search engines (e.g. DuckDuckGo), uses Bing to power its results.
Google Ad Costs
kaggle.com
zip
Updated Aug 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brenda N (2020). Google Ad Costs [Dataset]. https://www.kaggle.com/brendan45774/how-much-it-cost-to-get-an-ad-on-google
Explore at:
zip(1660 bytes)Available download formats
Dataset updated
Aug 27, 2020
Authors
Brenda N
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

Google is ranked in the top 10 with over 73,407 backlinks and a domain score of 94. Google search volume has 83,100,100 a month. So how much money does it take to have an ad of your website?

Content

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2681031%2Fb4d4d101d4c7a560aed2d962a1c1f6de%2Fvolume.PNG?generation=1598539665064163&alt=media" alt="">

Acknowledgements

Thanks for ubersuggest for helping me prove the information for the dataset.

Inspiration

A lot of people have websites that they want to showcase, but at what cost. This also helps people with their DataFrame and csv file skills. Also help people who want to have an Google ad.
o
Tutorial video on how to use Advanced Google Search - Library records OD...
data.opendevelopmentmekong.net
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Tutorial video on how to use Advanced Google Search - Library records OD Mekong Datahub [Dataset]. https://data.opendevelopmentmekong.net/dataset/tutorial-video-on-how-to-use-advanced-google-search
Explore at:
Dataset updated
Jan 10, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This video shows a tutorial on how to use Advanced Google Search. In this video, it will show what "Advanced Google search" is. How to use Advanced Google Search, How to use Advanced Google Search's functions and the comparison between Google Search and Advanced Google Search.
U.S. mobile and desktop local search volume 2014-2019
statista.com
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista Research Department (2024). U.S. mobile and desktop local search volume 2014-2019 [Dataset]. https://www.statista.com/topics/2479/mobile-search/
Explore at:
Dataset updated
Nov 22, 2024
Dataset provided by
Statistahttp://statista.com/
Authors
Statista Research Department
Description
This statistic shows a projection of the local search query volume in the United States from 2014 to 2019, sorted by platform. In 2016, mobile local search query volume is estimated to reach 94.7 billion searches.
Leading search engines in the UK 2015-2025, by market share
statista.com
freeagenlt.com
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading search engines in the UK 2015-2025, by market share [Dataset]. https://www.statista.com/statistics/279548/market-share-held-by-search-engines-in-the-united-kingdom/
Explore at:
Dataset updated
Nov 28, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2015 - Jan 2025
Area covered
United Kingdom
Description
In January 2025, Google remained by far the most popular search engine in the UK, holding a market share of ***** percent across all devices. That month, Bing had a market share of approximately **** percent in second place, followed by Yahoo! with approximately **** percent. The EU vs Google Despite Google’s dominance of the search engine market, maintaining its position at the top has not been a smooth ride. Google’s market share saw a decline in the summer of 2018, plummeting to an all-time-low in July. The search engine experienced a similar dip in June and July 2017. These two low points coincided with the European Commission’s antitrust charges against the company, both of which were unprecedented in the now decade-long duel between both parties. As skepticism towards search engine platforms grows in line with public concern regarding censorship and data privacy, alternative services like Duckduckgo offer users both information protection and unfiltered results. Despite this, it still held less than *** percent of the industry’s market share as of June 2021. Perception of fake news in the UK According to a questionnaire conducted in the United Kingdom in 2018, **** percent of respondents had come across inaccurate news on social media at least once before. Rising concerns over fake news, or information which has been manipulated to influence the public has been a hot topic in recent years. The younger generation however, remains skeptical with nearly **** of Generation Z claiming to be either unconcerned about fake news, or believed that it did not exist altogether.
d
DataForSEO Labs API for keyword research and search analytics, real-time...
datarade.ai
.json
Updated Jun 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataForSEO (2021). DataForSEO Labs API for keyword research and search analytics, real-time data for all Google locations and languages [Dataset]. https://datarade.ai/data-products/dataforseo-labs-api-for-keyword-research-and-search-analytics-dataforseo
Explore at:
.jsonAvailable download formats
Dataset updated
Jun 4, 2021
Dataset authored and provided by
DataForSEO
Area covered
Tokelau, Mauritania, Micronesia (Federated States of), Armenia, Morocco, Cocos (Keeling) Islands, Azerbaijan, Isle of Man, Kenya, Korea (Democratic People's Republic of)
Description
DataForSEO Labs API offers three powerful keyword research algorithms and historical keyword data:

• Related Keywords from the “searches related to” element of Google SERP. • Keyword Suggestions that match the specified seed keyword with additional words before, after, or within the seed key phrase. • Keyword Ideas that fall into the same category as specified seed keywords. • Historical Search Volume with current cost-per-click, and competition values.

Based on in-market categories of Google Ads, you can get keyword ideas from the relevant Categories For Domain and discover relevant Keywords For Categories. You can also obtain Top Google Searches with AdWords and Bing Ads metrics, product categories, and Google SERP data.

You will find well-rounded ways to scout the competitors:

• Domain Whois Overview with ranking and traffic info from organic and paid search. • Ranked Keywords that any domain or URL has positions for in SERP. • SERP Competitors and the rankings they hold for the keywords you specify. • Competitors Domain with a full overview of its rankings and traffic from organic and paid search. • Domain Intersection keywords for which both specified domains rank within the same SERPs. • Subdomains for the target domain you specify along with the ranking distribution across organic and paid search. • Relevant Pages of the specified domain with rankings and traffic data. • Domain Rank Overview with ranking and traffic data from organic and paid search. • Historical Rank Overview with historical data on rankings and traffic of the specified domain from organic and paid search. • Page Intersection keywords for which the specified pages rank within the same SERP.

All DataForSEO Labs API endpoints function in the Live mode. This means you will be provided with the results in response right after sending the necessary parameters with a POST request.

The limit is 2000 API calls per minute, however, you can contact our support team if your project requires higher rates.

We offer well-rounded API documentation, GUI for API usage control, comprehensive client libraries for different programming languages, free sandbox API testing, ad hoc integration, and deployment support.

We have a pay-as-you-go pricing model. You simply add funds to your account and use them to get data. The account balance doesn't expire.

Wordle Answer Search Trends Dataset (2021–2025)

kaggle.com

zip

Updated Jun 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ankush Kamboj (2025). Wordle Answer Search Trends Dataset (2021–2025) [Dataset]. https://www.kaggle.com/datasets/kambojankush/wordle-answer-search-trends-dataset-20212025

Explore at:

zip(30419 bytes)Available download formats

Dataset updated

Jun 26, 2025

Authors

Ankush Kamboj

License

https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

Description

This dataset investigates the relationship between Wordle answers and Google search spikes, particularly for uncommon words. It spans from June 21, 2021 to June 24, 2025.

It includes daily data for each Wordle answer, its search trend on that day, and frequency-based commonality indicators.

🔍 Hypothesis

Each Wordle answer causes a spike in search volume on the day it appears — more so if the word is rare.

This dataset supports exploration of:

Wordle Answers
Trends for wordle answers
Correlation between wordle answer rarity and search interest

Columns

Column	Description
`date`	Date of the Wordle puzzle
`word`	Correct 5-letter Wordle answer
`game`	Wordle game number
`wordfreq_commonality`	Normalized frequency score using Python’s `wordfreq` library
`subtlex_commonality`	Normalized frequency score using SUBTLEX-US dataset
`trend_day_global`	Google search interest on the day (global, all categories)
`trend_avg_200_global`	200-day average search interest (global, all categories)
`trend_day_language`	Search interest on Wordle day (Language Resources category)
`trend_avg_200_language`	200-day average search interest (Language Resources category)

Notes: - All trend values are relative (0–100 scale, per Google Trends)

🧮 Methodology

Wordle answers were scraped from wordfinder.yourdictionary.com
Commonality scores were computed using:
- wordfreq Python library
- SUBTLEX-US dataset (subtitle frequency, approximating spoken English)
Trend data was fetched using Google Trends API via pytrends

📊 Analysis

Can find analysis done using this data in the blog post

Data from: Inventory of online public databases and repositories holding...
catalog.data.gov
s.cnmilf.com
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Forecasting influenza in Hong Kong with Google search queries and...
plos.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qinneng Xu; Yulia R. Gel; L. Leticia Ramirez Ramirez; Kusha Nezafati; Qingpeng Zhang; Kwok-Leung Tsui (2023). Forecasting influenza in Hong Kong with Google search queries and statistical model fusion [Dataset]. http://doi.org/10.1371/journal.pone.0176690
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0176690
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Qinneng Xu; Yulia R. Gel; L. Leticia Ramirez Ramirez; Kusha Nezafati; Qingpeng Zhang; Kwok-Leung Tsui
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Hong Kong
Description
BackgroundThe objective of this study is to investigate predictive utility of online social media and web search queries, particularly, Google search data, to forecast new cases of influenza-like-illness (ILI) in general outpatient clinics (GOPC) in Hong Kong. To mitigate the impact of sensitivity to self-excitement (i.e., fickle media interest) and other artifacts of online social media data, in our approach we fuse multiple offline and online data sources.MethodsFour individual models: generalized linear model (GLM), least absolute shrinkage and selection operator (LASSO), autoregressive integrated moving average (ARIMA), and deep learning (DL) with Feedforward Neural Networks (FNN) are employed to forecast ILI-GOPC both one week and two weeks in advance. The covariates include Google search queries, meteorological data, and previously recorded offline ILI. To our knowledge, this is the first study that introduces deep learning methodology into surveillance of infectious diseases and investigates its predictive utility. Furthermore, to exploit the strength from each individual forecasting models, we use statistical model fusion, using Bayesian model averaging (BMA), which allows a systematic integration of multiple forecast scenarios. For each model, an adaptive approach is used to capture the recent relationship between ILI and covariates.ResultsDL with FNN appears to deliver the most competitive predictive performance among the four considered individual models. Combing all four models in a comprehensive BMA framework allows to further improve such predictive evaluation metrics as root mean squared error (RMSE) and mean absolute predictive error (MAPE). Nevertheless, DL with FNN remains the preferred method for predicting locations of influenza peaks.ConclusionsThe proposed approach can be viewed a feasible alternative to forecast ILI in Hong Kong or other countries where ILI has no constant seasonal trend and influenza data resources are limited. The proposed methodology is easily tractable and computationally efficient.
n
Repository Analytics and Metrics Portal (RAMP) 2017 data
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2017 data [Dataset]. http://doi.org/10.5061/dryad.r7sqv9scf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.r7sqv9scf
Dataset updated
Jul 27, 2021
Dataset provided by
University of New Mexico
Montana State University
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2017. For a description of the data collection, processing, and output methods, please see the "methods" section below.

Methods RAMP Data Documentation – January 1, 2017 through August 18, 2018

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search.

Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

The data in these CSV files include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data follow the format 2017-01_RAMP_all.csv. Using this example, the file 2017-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2017.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
n
Repository Analytics and Metrics Portal (RAMP) 2018 data
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2018 data [Dataset]. http://doi.org/10.5061/dryad.ffbg79cvp
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ffbg79cvp
Dataset updated
Jul 27, 2021
Dataset provided by
University of New Mexico
Montana State University
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2018. For a description of the data collection, processing, and output methods, please see the "methods" section below. Note that the RAMP data model changed in August, 2018 and two sets of documentation are provided to describe data collection and processing before and after the change.

Methods

RAMP Data Documentation – January 1, 2017 through August 18, 2018

Data Collection

RAMP data were downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search.

Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

The data in these CSV files include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data follow the format 2018-01_RAMP_all.csv. Using this example, the file 2018-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2018.

Data Collection from August 19, 2018 Onward

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository
H
Replication Data for: Using electronic health records and Internet search...
datasetcatalog.nlm.nih.gov
dataverse.harvard.edu
+1more
Updated May 3, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gray, Josh; Kou, S. C.; Brownstein, John S.; Richardson, Stewart; Santillana, Mauricio; Yang, Shihao (2017). Replication Data for: Using electronic health records and Internet search information for accurate influenza forecasting [Dataset]. http://doi.org/10.7910/DVN/ZJZM4F
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ZJZM4F
Dataset updated
May 3, 2017
Authors
Gray, Josh; Kou, S. C.; Brownstein, John S.; Richardson, Stewart; Santillana, Mauricio; Yang, Shihao
Description
The data for replication contain three parts: Centers for Disease Control and Prevention (CDC) data, Google Trends (GT) data, and Electronic Health Record (EHR) data. All data are obtained and frozen as of July 9, 2016. CDC publish weekly unweighted Influenza-like Illness (ILI) activity level (gis.cdc.gov/grasp/fluview/fluportaldashboard.html). The initial report is subject to revision in later weeks as more data are gathered and processed from participating clinics around the country. We have consolidated the CDC's weekly unweighted ILI activity level data with later revisions into one single csv file. Google Trends (www.google.com/trends) data are publicly available. The query terms that we used were identified from Google Correlate (www.google.com/trends/correlate), where we identified 129 flu-related Google search terms in total. The Google Trends data are then manually downloaded and consolidated into one single csv file. The EHR data are from athenahealth, a provider of cloud-based services and mobile applications for medical groups and health systems (www.athenahealth.com). The data we used are historical values of three nationally aggregated weekly fraction from total patient visit counts: (a) flu visit counts, (b) ILI visit counts, and (c) unspecified viral or ILI visit counts. That is, data reported are rounded fraction of each type of counts to total patient visit counts. The EHR data are available in real time starting from July 2009.
Z
Network Traffic Analysis: Data and Code
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric (2024). Network Traffic Analysis: Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11479410
Explore at:
Dataset updated
Jun 12, 2024
Dataset provided by
Loyola University Chicago
Authors
Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code:

Packet_Features_Generator.py & Features.py

To run this code:

pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j

-h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j

Purpose:

Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.

Uses Features.py to calcualte the features.

startMachineLearning.sh & machineLearning.py

To run this code:

bash startMachineLearning.sh

This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags

Options (to be edited within this file):

--evaluate-only to test 5 fold cross validation accuracy

--test-scaling-normalization to test 6 different combinations of scalers and normalizers

Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use

--grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'

Purpose:

Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.

Data

Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.

Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:

First number is a classification number to denote what website, query, or vr action is taking place.

The remaining numbers in each line denote:

The size of a packet,

and the direction it is traveling.

negative numbers denote incoming packets

positive numbers denote outgoing packets

Figure 4 Data

This data uses specific lines from the Virtual Reality.txt file.

The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.

The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.

The .xlsx and .csv file are identical

Each file includes (from right to left):

The origional packet data,

each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,

and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.

Facebook

Twitter

Click to copy link

Link copied

Cite

SQ Magazine (2025). Google Usage Statistics 2025: Key Trends and Data Insights [Dataset]. https://sqmagazine.co.uk/google-usage-statistics/

Google Usage Statistics 2025: Key Trends and Data Insights

Explore at:

Dataset updated

Sep 30, 2025

Dataset authored and provided by

SQ Magazine

License

https://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/

Time period covered

Jan 1, 2024 - Dec 31, 2025

Area covered

Global

Description

It starts with a simple habit: you open your browser and type a question. A few keystrokes later, Google gives you answers, videos, maps, and suggestions before you even finish your thought. For billions of people around the world, this daily interaction is second nature. But behind that blinking cursor...

Clear search

Close search

Google apps

Main menu

Google Usage Statistics 2025: Key Trends and Data Insights

Market share of leading desktop search engines worldwide monthly 2015-2025

Belarus Internet Usage: Search Engine Market Share: Desktop: StartPagina...

Google SERP Data, Web Search Data, Google Images Data | Real-Time API

Singapore Internet Usage: Search Engine Market Share: Tablet: StartPagina...

ScrapeHero Data Cloud - Free and Easy to use

Data for study "Direct Answers in Google Search Results"

Frequently used Google search terms in Germany 2024

Google Ad Costs

Overview

Content

Acknowledgements

Inspiration

Tutorial video on how to use Advanced Google Search - Library records OD...

U.S. mobile and desktop local search volume 2014-2019

Leading search engines in the UK 2015-2025, by market share

DataForSEO Labs API for keyword research and search analytics, real-time...

Wordle Answer Search Trends Dataset (2021–2025)

🔍 Hypothesis

Columns

🧮 Methodology

📊 Analysis

Data from: Inventory of online public databases and repositories holding...

Forecasting influenza in Hong Kong with Google search queries and...

Repository Analytics and Metrics Portal (RAMP) 2017 data

Repository Analytics and Metrics Portal (RAMP) 2018 data

Replication Data for: Using electronic health records and Internet search...

Network Traffic Analysis: Data and Code

Google Usage Statistics 2025: Key Trends and Data Insights