DataForSEO Labs API offers three powerful keyword research algorithms and historical keyword data:
• Related Keywords from the “searches related to” element of Google SERP. • Keyword Suggestions that match the specified seed keyword with additional words before, after, or within the seed key phrase. • Keyword Ideas that fall into the same category as specified seed keywords. • Historical Search Volume with current cost-per-click, and competition values.
Based on in-market categories of Google Ads, you can get keyword ideas from the relevant Categories For Domain and discover relevant Keywords For Categories. You can also obtain Top Google Searches with AdWords and Bing Ads metrics, product categories, and Google SERP data.
You will find well-rounded ways to scout the competitors:
• Domain Whois Overview with ranking and traffic info from organic and paid search. • Ranked Keywords that any domain or URL has positions for in SERP. • SERP Competitors and the rankings they hold for the keywords you specify. • Competitors Domain with a full overview of its rankings and traffic from organic and paid search. • Domain Intersection keywords for which both specified domains rank within the same SERPs. • Subdomains for the target domain you specify along with the ranking distribution across organic and paid search. • Relevant Pages of the specified domain with rankings and traffic data. • Domain Rank Overview with ranking and traffic data from organic and paid search. • Historical Rank Overview with historical data on rankings and traffic of the specified domain from organic and paid search. • Page Intersection keywords for which the specified pages rank within the same SERP.
All DataForSEO Labs API endpoints function in the Live mode. This means you will be provided with the results in response right after sending the necessary parameters with a POST request.
The limit is 2000 API calls per minute, however, you can contact our support team if your project requires higher rates.
We offer well-rounded API documentation, GUI for API usage control, comprehensive client libraries for different programming languages, free sandbox API testing, ad hoc integration, and deployment support.
We have a pay-as-you-go pricing model. You simply add funds to your account and use them to get data. The account balance doesn't expire.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.
The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:
Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.
Fork this kernel to get started.
Banner Photo by Edho Pratama from Unsplash.
What is the total number of transactions generated per device browser in July 2017?
The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?
What was the average number of product pageviews for users who made a purchase in July 2017?
What was the average number of product pageviews for users who did not make a purchase in July 2017?
What was the average total transactions per user that made a purchase in July 2017?
What is the average amount of money spent per session in July 2017?
What is the sequence of pages viewed?
In November 2024, Google.com was the most popular website worldwide with 136 billion average monthly visits. The online platform has held the top spot as the most popular website since June 2010, when it pulled ahead of Yahoo into first place. Second-ranked YouTube generated more than 72.8 billion monthly visits in the measured period. The internet leaders: search, social, and e-commerce Social networks, search engines, and e-commerce websites shape the online experience as we know it. While Google leads the global online search market by far, YouTube and Facebook have become the world’s most popular websites for user generated content, solidifying Alphabet’s and Meta’s leadership over the online landscape. Meanwhile, websites such as Amazon and eBay generate millions in profits from the sale and distribution of goods, making the e-market sector an integral part of the global retail scene. What is next for online content? Powering social media and websites like Reddit and Wikipedia, user-generated content keeps moving the internet’s engines. However, the rise of generative artificial intelligence will bring significant changes to how online content is produced and handled. ChatGPT is already transforming how online search is performed, and news of Google's 2024 deal for licensing Reddit content to train large language models (LLMs) signal that the internet is likely to go through a new revolution. While AI's impact on the online market might bring both opportunities and challenges, effective content management will remain crucial for profitability on the web.
The Easiest Way to Collect Data from the Internet Download anything you see on the internet into spreadsheets within a few clicks using our ready-made web crawlers or a few lines of code using our APIs
We have made it as simple as possible to collect data from websites
Easy to Use Crawlers Amazon Product Details and Pricing Scraper Amazon Product Details and Pricing Scraper Get product information, pricing, FBA, best seller rank, and much more from Amazon.
Google Maps Search Results Google Maps Search Results Get details like place name, phone number, address, website, ratings, and open hours from Google Maps or Google Places search results.
Twitter Scraper Twitter Scraper Get tweets, Twitter handle, content, number of replies, number of retweets, and more. All you need to provide is a URL to a profile, hashtag, or an advance search URL from Twitter.
Amazon Product Reviews and Ratings Amazon Product Reviews and Ratings Get customer reviews for any product on Amazon and get details like product name, brand, reviews and ratings, and more from Amazon.
Google Reviews Scraper Google Reviews Scraper Scrape Google reviews and get details like business or location name, address, review, ratings, and more for business and places.
Walmart Product Details & Pricing Walmart Product Details & Pricing Get the product name, pricing, number of ratings, reviews, product images, URL other product-related data from Walmart.
Amazon Search Results Scraper Amazon Search Results Scraper Get product search rank, pricing, availability, best seller rank, and much more from Amazon.
Amazon Best Sellers Amazon Best Sellers Get the bestseller rank, product name, pricing, number of ratings, rating, product images, and more from any Amazon Bestseller List.
Google Search Scraper Google Search Scraper Scrape Google search results and get details like search rank, paid and organic results, knowledge graph, related search results, and more.
Walmart Product Reviews & Ratings Walmart Product Reviews & Ratings Get customer reviews for any product on Walmart.com and get details like product name, brand, reviews, and ratings.
Scrape Emails and Contact Details Scrape Emails and Contact Details Get emails, addresses, contact numbers, social media links from any website.
Walmart Search Results Scraper Walmart Search Results Scraper Get Product details such as pricing, availability, reviews, ratings, and more from Walmart search results and categories.
Glassdoor Job Listings Glassdoor Job Listings Scrape job details such as job title, salary, job description, location, company name, number of reviews, and ratings from Glassdoor.
Indeed Job Listings Indeed Job Listings Scrape job details such as job title, salary, job description, location, company name, number of reviews, and ratings from Indeed.
LinkedIn Jobs Scraper Premium LinkedIn Jobs Scraper Scrape job listings on LinkedIn and extract job details such as job title, job description, location, company name, number of reviews, and more.
Redfin Scraper Premium Redfin Scraper Scrape real estate listings from Redfin. Extract property details such as address, price, mortgage, redfin estimate, broker name and more.
Yelp Business Details Scraper Yelp Business Details Scraper Scrape business details from Yelp such as phone number, address, website, and more from Yelp search and business details page.
Zillow Scraper Premium Zillow Scraper Scrape real estate listings from Zillow. Extract property details such as address, price, Broker, broker name and more.
Amazon product offers and third party sellers Amazon product offers and third party sellers Get product pricing, delivery details, FBA, seller details, and much more from the Amazon offer listing page.
Realtor Scraper Premium Realtor Scraper Scrape real estate listings from Realtor.com. Extract property details such as Address, Price, Area, Broker and more.
Target Product Details & Pricing Target Product Details & Pricing Get product details from search results and category pages such as pricing, availability, rating, reviews, and 20+ data points from Target.
Trulia Scraper Premium Trulia Scraper Scrape real estate listings from Trulia. Extract property details such as Address, Price, Area, Mortgage and more.
Amazon Customer FAQs Amazon Customer FAQs Get FAQs for any product on Amazon and get details like the question, answer, answered user name, and more.
Yellow Pages Scraper Yellow Pages Scraper Get details like business name, phone number, address, website, ratings, and more from Yellow Pages search results.
Information about pages on the City's website including their age and their Google Analytics data (everything from "PageViews" and to the right). If the Google Analytics fields are empty, the page hasn't been visited recently at all.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping.The dataset consists of 10 numerical and 8 categorical attributes. The 'Revenue' attribute can be used as the class label.The dataset contains 18 columns, each representing specific attributes of online shopping behavior:Administrative and Administrative_Duration: Number of pages visited and time spent on administrative pages.Informational and Informational_Duration: Number of pages visited and time spent on informational pages.ProductRelated and ProductRelated_Duration: Number of pages visited and time spent on product-related pages.BounceRates and ExitRates: Metrics indicating user behavior during the session.PageValues: Value of the page based on e-commerce metrics.SpecialDay: Likelihood of shopping based on special days.Month: Month of the session.OperatingSystems, Browser, Region, TrafficType: Technical and geographical attributes.VisitorType: Categorizes users as returning, new, or others.Weekend: Indicates if the session occurred on a weekend.Revenue: Target variable indicating whether a transaction was completed (True or False).The original dataset has been picked up from the UCI Machine Learning Repository, the link to which is as follows :https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+datasetAdditional Variable InformationThe dataset consists of 10 numerical and 8 categorical attributes. The 'Revenue' attribute can be used as the class label. "Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please refer to the original data article for further data description: Jeřábek & Hynek et al., Collection of datasets with DNS over HTTPS traffic In: Data in Brief Journal ,DOI:10.1016/j.dib.2022.108310
Dataset of DNS over HTTPS traffic from Firefox (FFMuc, Google, Hostux, OpenDNS Quad9, Switch)
The dataset contains DoH and HTTPS traffic that was captured in a virtualized environment (Docker) and generated automatically by Firefox browser with enabled DoH towards 6 different DoH servers (FFMuc, Google, Hostux, OpenDNS Quad9, Switch) and a web page loads towards a sample of web pages taken from Majestic Million dataset. The data are provided in the form of PCAP files. However, we also provided TLS enriched flow data that are generated with opensource ipfixprobe flow exporter. Other than TLS related information is not relevant since the dataset comprises only encrypted TLS traffic. The TLS enriched flow data are provided in the form of CSV files with the following columns:
Column Name | Column Description |
---|---|
DST_IP | Destination IP address |
SRC_IP | Source IP address |
BYTES | The number of transmitted bytes from Source to Destination |
BYTES_REV | The number of transmitted bytes from Destination to Source |
TIME_FIRST | Timestamp of the first packet in the flow in format YYYY-MM-DDTHH-MM-SS |
TIME_LAST | Timestamp of the last packet in the flow in format YYYY-MM-DDTHH-MM-SS |
PACKETS | The number of packets transmitted from Source to Destination |
PACKETS_REV | The number of packets transmitted from Destination to Source |
DST_PORT | Destination port |
SRC_PORT | Source port |
PROTOCOL | The number of transport protocol |
TCP_FLAGS | Logic OR across all TCP flags in the packets transmitted from Source to Destination |
TCP_FLAGS_REV | Logic OR across all TCP flags in the packets transmitted from Destination to Source |
TLS_ALPN | The Value of Application Protocol Negotiation Extension sent from Server |
TLS_JA3 | The JA3 fingerprint |
TLS_SNI | The value of Server Name Indication Extension sent by Client |
The DoH resolvers in the dataset can be identified by IP addresses written in doh_resolver_ip.csv file.
The main part of the dataset is located in DoH-Gen-F-FGHOQS.tar.gz and has the following structure:
.
└─── data | - Main directory with data
└── generated | - Directory with generated captures
├── pcap | - Generated PCAPs
│ └── firefox
└── tls-flow-csv | - Generated CSV flow data
└── firefox
Total stats of generated data:
Name | Value |
---|---|
Total Data Size | 46.7 GB |
Total files | 12 |
DoH extracted tls flows | ~98 K |
Non-DoH extracted tls flows | ~353 K |
DoH Server information
Please cite the original article:
@article{Jerabek2022,
title = {Collection of datasets with DNS over HTTPS traffic},
journal = {Data in Brief},
volume = {42},
pages = {108310},
year = {2022},
issn = {2352-3409},
doi = {https://doi.org/10.1016/j.dib.2022.108310},
url = {https://www.sciencedirect.com/science/article/pii/S2352340922005121},
author = {Kamil Jeřábek and Karel Hynek and Tomáš Čejka and Ondřej Ryšavý}
}
The Digital Geomorphic-GIS Map of Cape Lookout National Seashore, North Carolina (1:24,000 scale 2008 mapping) is composed of GIS data layers and GIS tables, and is available in the following GRI-supported GIS data formats: 1.) an ESRI file geodatabase (calo_geomorphology.gdb), a 2.) Open Geospatial Consortium (OGC) geopackage, and 3.) 2.2 KMZ/KML file for use in Google Earth, however, this format version of the map is limited in data layers presented and in access to GRI ancillary table information. The file geodatabase format is supported with a 1.) ArcGIS Pro 3.X map file (.mapx) file (calo_geomorphology.mapx) and individual Pro 3.X layer (.lyrx) files (for each GIS data layer). The OGC geopackage is supported with a QGIS project (.qgz) file. Upon request, the GIS data is also available in ESRI shapefile format. Contact Stephanie O'Meara (see contact information below) to acquire the GIS data in these GIS data formats. In addition to the GIS data and supporting GIS files, three additional files comprise a GRI digital geologic-GIS dataset or map: 1.) a readme file (calo_geology_gis_readme.pdf), 2.) the GRI ancillary map information document (.pdf) file (calo_geomorphology.pdf) which contains geologic unit descriptions, as well as other ancillary map information and graphics from the source map(s) used by the GRI in the production of the GRI digital geologic-GIS data for the park, and 3.) a user-friendly FAQ PDF version of the metadata (calo_geomorphology_metadata_faq.pdf). Please read the calo_geology_gis_readme.pdf for information pertaining to the proper extraction of the GIS data and other map files. Google Earth software is available for free at: https://www.google.com/earth/versions/. QGIS software is available for free at: https://www.qgis.org/en/site/. Users are encouraged to only use the Google Earth data for basic visualization, and to use the GIS data for any type of data analysis or investigation. The data were completed as a component of the Geologic Resources Inventory (GRI) program, a National Park Service (NPS) Inventory and Monitoring (I&M) Division funded program that is administered by the NPS Geologic Resources Division (GRD). For a complete listing of GRI products visit the GRI publications webpage: https://www.nps.gov/subjects/geology/geologic-resources-inventory-products.htm. For more information about the Geologic Resources Inventory Program visit the GRI webpage: https://www.nps.gov/subjects/geology/gri.htm. At the bottom of that webpage is a "Contact Us" link if you need additional information. You may also directly contact the program coordinator, Jason Kenworthy (jason_kenworthy@nps.gov). Source geologic maps and data used to complete this GRI digital dataset were provided by the following: North Carolina Geological Survey. Detailed information concerning the sources used and their contribution the GRI product are listed in the Source Citation section(s) of this metadata record (calo_geomorphology_metadata.txt or calo_geomorphology_metadata_faq.pdf). Users of this data are cautioned about the locational accuracy of features within this dataset. Based on the source map scale of 1:24,000 and United States National Map Accuracy Standards features are within (horizontally) 12.2 meters or 40 feet of their actual location as presented by this dataset. Users of this data should thus not assume the location of features is exactly where they are portrayed in Google Earth, ArcGIS Pro, QGIS or other software used to display this dataset. All GIS and ancillary tables were produced as per the NPS GRI Geology-GIS Geodatabase Data Model v. 2.3. (available at: https://www.nps.gov/articles/gri-geodatabase-model.htm).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to
establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data
Approach
The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered.
Search methods
We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects.
We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories.
Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo.
Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories.
Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals.
Evaluation
We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results.
We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind.
Results
A summary of the major findings from our data review:
Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors.
There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection.
Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation.
See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Quantifying Iconicity - Zenodo
This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.
The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes:
- the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match
- the title of the page
- the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found
- the language found by the langid
Python module link, along with the normalized score.
- the labels associated with the image by Google
- the scrape date
Alongside the .tsv-files, there are several other elements in the following folder structure:
├── data
│ ├── embeddings
│ └── doc2vec
│ └── input-text
│ └── metadata
│ └── umap
│ └── evaluation
│ └── results
│ └── diachronic-plots
│ └── top-words
│ └── tsv
/embeddings
folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata./evaluation
folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters./results
folder contains the top words associated with the clusters and the diachronic cluster prominence plots.Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as <p>
, <h1>
etc.
https://www.icpsr.umich.edu/web/ICPSR/studies/27681/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/27681/terms
The BJS Census of State and Local Law Enforcement Agencies (CSLLEA) is conducted every 4 years to provide a complete enumeration of agencies and their employees. Employment data are reported by agencies for sworn and nonsworn (civilian) personnel and, within these categories, by full-time or part-time status. The pay period that included September 30, 2008, was the reference date for all personnel data. Agencies also complete a checklist of functions they regularly perform, or have primary responsibility for, within the following areas: patrol and response, criminal investigation, traffic and vehicle-related functions, detention-related functions, court-related functions, special public safety functions (e.g., animal control), task force participation, and specialized functions (e.g., search and rescue). The CSLLEA provides national data on the number of state and local law enforcement agencies and employees for local police departments, sheriffs' offices, state law enforcement agencies, and special jurisdiction agencies. It also serves as the sampling frame for BJS surveys of law enforcement agencies.
The Digital Geologic-GIS Map of Sagamore Hill National Historic Site and Vicinity, New York is composed of GIS data layers and GIS tables, and is available in the following GRI-supported GIS data formats: 1.) a 10.1 file geodatabase (sahi_geology.gdb), a 2.) Open Geospatial Consortium (OGC) geopackage, and 3.) 2.2 KMZ/KML file for use in Google Earth, however, this format version of the map is limited in data layers presented and in access to GRI ancillary table information. The file geodatabase format is supported with a 1.) ArcGIS Pro map file (.mapx) file (sahi_geology.mapx) and individual Pro layer (.lyrx) files (for each GIS data layer), as well as with a 2.) 10.1 ArcMap (.mxd) map document (sahi_geology.mxd) and individual 10.1 layer (.lyr) files (for each GIS data layer). The OGC geopackage is supported with a QGIS project (.qgz) file. Upon request, the GIS data is also available in ESRI 10.1 shapefile format. Contact Stephanie O'Meara (see contact information below) to acquire the GIS data in these GIS data formats. In addition to the GIS data and supporting GIS files, three additional files comprise a GRI digital geologic-GIS dataset or map: 1.) A GIS readme file (sahi_geology_gis_readme.pdf), 2.) the GRI ancillary map information document (.pdf) file (sahi_geology.pdf) which contains geologic unit descriptions, as well as other ancillary map information and graphics from the source map(s) used by the GRI in the production of the GRI digital geologic-GIS data for the park, and 3.) a user-friendly FAQ PDF version of the metadata (sahi_geology_metadata_faq.pdf). Please read the sahi_geology_gis_readme.pdf for information pertaining to the proper extraction of the GIS data and other map files. Google Earth software is available for free at: https://www.google.com/earth/versions/. QGIS software is available for free at: https://www.qgis.org/en/site/. Users are encouraged to only use the Google Earth data for basic visualization, and to use the GIS data for any type of data analysis or investigation. The data were completed as a component of the Geologic Resources Inventory (GRI) program, a National Park Service (NPS) Inventory and Monitoring (I&M) Division funded program that is administered by the NPS Geologic Resources Division (GRD). For a complete listing of GRI products visit the GRI publications webpage: For a complete listing of GRI products visit the GRI publications webpage: https://www.nps.gov/subjects/geology/geologic-resources-inventory-products.htm. For more information about the Geologic Resources Inventory Program visit the GRI webpage: https://www.nps.gov/subjects/geology/gri,htm. At the bottom of that webpage is a "Contact Us" link if you need additional information. You may also directly contact the program coordinator, Jason Kenworthy (jason_kenworthy@nps.gov). Source geologic maps and data used to complete this GRI digital dataset were provided by the following: U.S. Geological Survey. Detailed information concerning the sources used and their contribution the GRI product are listed in the Source Citation section(s) of this metadata record (sahi_geology_metadata.txt or sahi_geology_metadata_faq.pdf). Users of this data are cautioned about the locational accuracy of features within this dataset. Based on the source map scale of 1:62,500 and United States National Map Accuracy Standards features are within (horizontally) 31.8 meters or 104.2 feet of their actual location as presented by this dataset. Users of this data should thus not assume the location of features is exactly where they are portrayed in Google Earth, ArcGIS, QGIS or other software used to display this dataset. All GIS and ancillary tables were produced as per the NPS GRI Geology-GIS Geodatabase Data Model v. 2.3. (available at: https://www.nps.gov/articles/gri-geodatabase-model.htm).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of around 2,000 HTML pages: these web pages contain the search results obtained in return to queries for different products, searched by a set of synthetic users surfing Google Shopping (US version) from different locations, in July, 2016.
Each file in the collection has a name where there is indicated the location from where the search has been done, the userID, and the searched product: no_email_LOCATION_USERID.PRODUCT.shopping_testing.#.html
The locations are Philippines (PHI), United States (US), India (IN). The userIDs: 26 to 30 for users searching from Philippines, 1 to 5 from US, 11 to 15 from India.
Products have been choice following 130 keywords (e.g., MP3 player, MP4 Watch, Personal organizer, Television, etc.).
In the following, we describe how the search results have been collected.
Each user has a fresh profile. The creation of a new profile corresponds to launch a new, isolated, web browser client instance and open the Google Shopping US web page.
To mimic real users, the synthetic users can browse, scroll pages, stay on a page, and click on links.
A fully-fledged web browser is used to get the correct desktop version of the website under investigation. This is because websites could be designed to behave according to user agents, as witnessed by the differences between the mobile and desktop versions of the same website.
The prices are the retail ones displayed by Google Shopping in US dollars (thus, excluding shipping fees).
Several frameworks have been proposed for interacting with web browsers and analysing results from search engines. This research adopts OpenWPM. OpenWPM is automatised with Selenium to efficiently create and manage different users with isolated Firefox and Chrome client instances, each of them with their own associated cookies.
The experiments run, on average, 24 hours. In each of them, the software runs on our local server, but the browser's traffic is redirected to the designated remote servers (i.e., to India), via tunneling in SOCKS proxies. This way, all commands are simultaneously distributed over all proxies. The experiments adopt the Mozilla Firefox browser (version 45.0) for the web browsing tasks and run under Ubuntu 14.04. Also, for each query, we consider the first page of results, counting 40 products. Among them, the focus of the experiments is mostly on the top 10 and top 3 results.
Due to connection errors, one of the Philippine profiles have no associated results. Also, for Philippines, a few keywords did not lead to any results: videocassette recorders, totes, umbrellas. Similarly, for US, no results were for totes and umbrellas.
The search results have been analyzed in order to check if there were evidence of price steering, based on users' location.
One term of usage applies:
In any research product whose findings are based on this dataset, please cite
@inproceedings{DBLP:conf/ircdl/CozzaHPN19, author = {Vittoria Cozza and Van Tien Hoang and Marinella Petrocchi and Rocco {De Nicola}}, title = {Transparency in Keyword Faceted Search: An Investigation on Google Shopping}, booktitle = {Digital Libraries: Supporting Open Science - 15th Italian Research Conference on Digital Libraries, {IRCDL} 2019, Pisa, Italy, January 31 - February 1, 2019, Proceedings}, pages = {29--43}, year = {2019}, crossref = {DBLP:conf/ircdl/2019}, url = {https://doi.org/10.1007/978-3-030-11226-4_3}, doi = {10.1007/978-3-030-11226-4_3}, timestamp = {Fri, 18 Jan 2019 23:22:50 +0100}, biburl = {https://dblp.org/rec/bib/conf/ircdl/CozzaHPN19}, bibsource = {dblp computer science bibliography, https://dblp.org} }
The Digital Geologic-GIS Map of Joshua Tree National Park, California is composed of GIS data layers and GIS tables, and is available in the following GRI-supported GIS data formats: 1.) a 10.1 file geodatabase (jotr_geology.gdb), a 2.) Open Geospatial Consortium (OGC) geopackage, and 3.) 2.2 KMZ/KML file for use in Google Earth, however, this format version of the map is limited in data layers presented and in access to GRI ancillary table information. The file geodatabase format is supported with a 1.) ArcGIS Pro map file (.mapx) file (jotr_geology.mapx) and individual Pro layer (.lyrx) files (for each GIS data layer), as well as with a 2.) 10.1 ArcMap (.mxd) map document (jotr_geology.mxd) and individual 10.1 layer (.lyr) files (for each GIS data layer). The OGC geopackage is supported with a QGIS project (.qgz) file. Upon request, the GIS data is also available in ESRI 10.1 shapefile format. Contact Stephanie O'Meara (see contact information below) to acquire the GIS data in these GIS data formats. In addition to the GIS data and supporting GIS files, three additional files comprise a GRI digital geologic-GIS dataset or map: 1.) a readme file (jotr_geology_gis_readme.pdf), 2.) the GRI ancillary map information document (.pdf) file (jotr_geology.pdf) which contains geologic unit descriptions, as well as other ancillary map information and graphics from the source map(s) used by the GRI in the production of the GRI digital geologic-GIS data for the park, and 3.) a user-friendly FAQ PDF version of the metadata (jotr_geology_metadata_faq.pdf). Please read the jotr_geology_gis_readme.pdf for information pertaining to the proper extraction of the GIS data and other map files. Google Earth software is available for free at: https://www.google.com/earth/versions/. QGIS software is available for free at: https://www.qgis.org/en/site/. Users are encouraged to only use the Google Earth data for basic visualization, and to use the GIS data for any type of data analysis or investigation. The data were completed as a component of the Geologic Resources Inventory (GRI) program, a National Park Service (NPS) Inventory and Monitoring (I&M) Division funded program that is administered by the NPS Geologic Resources Division (GRD). For a complete listing of GRI products visit the GRI publications webpage: For a complete listing of GRI products visit the GRI publications webpage: https://www.nps.gov/subjects/geology/geologic-resources-inventory-products.htm. For more information about the Geologic Resources Inventory Program visit the GRI webpage: https://www.nps.gov/subjects/geology/gri,htm. At the bottom of that webpage is a "Contact Us" link if you need additional information. You may also directly contact the program coordinator, Jason Kenworthy (jason_kenworthy@nps.gov). Source geologic maps and data used to complete this GRI digital dataset were provided by the following: U.S. Geological Survey and ESRI. Detailed information concerning the sources used and their contribution the GRI product are listed in the Source Citation section(s) of this metadata record (jotr_geology_metadata.txt or jotr_geology_metadata_faq.pdf). Users of this data are cautioned about the locational accuracy of features within this dataset. Based on the source map scale of 1:100,000 and United States National Map Accuracy Standards features are within (horizontally) 50.8 meters or 166.7 feet of their actual _location as presented by this dataset. Users of this data should thus not assume the _location of features is exactly where they are portrayed in Google Earth, ArcGIS, QGIS or other software used to display this dataset. All GIS and ancillary tables were produced as per the NPS GRI Geology-GIS Geodatabase Data Model v. 2.3. (available at: https://www.nps.gov/articles/gri-geodatabase-model.htm).
http://www.openstreetmap.org/images/osm_logo.png" alt=""> OpenStreetMap (openstreetmap.org) is a global collaborative mapping project, which offers maps and map data released with an open license, encouraging free re-use and re-distribution. The data is created by a large community of volunteers who use a variety of simple on-the-ground surveying techniques, and wiki-syle editing tools to collaborate as they create the maps, in a process which is open to everyone. The project originated in London, and an active community of mappers and developers are based here. Mapping work in London is ongoing (and you can help!) but the coverage is already good enough for many uses.
Browse the map of London on OpenStreetMap.org
The whole of England updated daily:
For more details of downloads available from OpenStreetMap, including downloading the whole planet, see 'planet.osm' on the wiki.
Download small areas of the map by bounding-box. For example this URL requests the data around Trafalgar Square:
http://api.openstreetmap.org/api/0.6/map?bbox=-0.13062,51.5065,-0.12557,51.50969
Data filtered by "tag". For example this URL returns all elements in London tagged shop=supermarket:
http://www.informationfreeway.org/api/0.6/*[shop=supermarket][bbox=-0.48,51.30,0.21,51.70]
The format of the data is a raw XML represention of all the elements making up the map. OpenStreetMap is composed of interconnected "nodes" and "ways" (and sometimes "relations") each with a set of name=value pairs called "tags". These classify and describe properties of the elements, and ultimately influence how they get drawn on the map. To understand more about tags, and different ways of working with this data format refer to the following pages on the OpenStreetMap wiki.
Rather than working with raw map data, you may prefer to embed maps from OpenStreetMap on your website with a simple bit of javascript. You can also present overlays of other data, in a manner very similar to working with google maps. In fact you can even use the google maps API to do this. See OSM on your own website for details and links to various javascript map libraries.
The OpenStreetMap project aims to attract large numbers of contributors who all chip in a little bit to help build the map. Although the map editing tools take a little while to learn, they are designed to be as simple as possible, so that everyone can get involved. This project offers an exciting means of allowing local London communities to take ownership of their part of the map.
Read about how to Get Involved and see the London page for details of OpenStreetMap community events.
The Digital Quaternary Surficial Geologic-GIS Map of Santa Barbara Island, California is composed of GIS data layers and GIS tables, and is available in the following GRI-supported GIS data formats: 1.) a 10.1 file geodatabase (saba_surficial_geology.gdb), a 2.) Open Geospatial Consortium (OGC) geopackage, and 3.) 2.2 KMZ/KML file for use in Google Earth, however, this format version of the map is limited in data layers presented and in access to GRI ancillary table information. The file geodatabase format is supported with a 1.) ArcGIS Pro map file (.mapx) file (saba_surficial_geology.mapx) and individual Pro layer (.lyrx) files (for each GIS data layer), as well as with a 2.) 10.1 ArcMap (.mxd) map document (saba_surficial_geology.mxd) and individual 10.1 layer (.lyr) files (for each GIS data layer). The OGC geopackage is supported with a QGIS project (.qgz) file. Upon request, the GIS data is also available in ESRI 10.1 shapefile format. Contact Stephanie O'Meara (see contact information below) to acquire the GIS data in these GIS data formats. In addition to the GIS data and supporting GIS files, three additional files comprise a GRI digital geologic-GIS dataset or map: 1.) this file (chis_geology_gis_readme.pdf), 2.) the GRI ancillary map information document (.pdf) file (chis_geology.pdf) which contains geologic unit descriptions, as well as other ancillary map information and graphics from the source map(s) used by the GRI in the production of the GRI digital geologic-GIS data for the park, and 3.) a user-friendly FAQ PDF version of the metadata (saba_surficial_geology_metadata_faq.pdf). Please read the chis_geology_gis_readme.pdf for information pertaining to the proper extraction of the GIS data and other map files. Google Earth software is available for free at: https://www.google.com/earth/versions/. QGIS software is available for free at: https://www.qgis.org/en/site/. Users are encouraged to only use the Google Earth data for basic visualization, and to use the GIS data for any type of data analysis or investigation. The data were completed as a component of the Geologic Resources Inventory (GRI) program, a National Park Service (NPS) Inventory and Monitoring (I&M) Division funded program that is administered by the NPS Geologic Resources Division (GRD). For a complete listing of GRI products visit the GRI publications webpage: For a complete listing of GRI products visit the GRI publications webpage: https://www.nps.gov/subjects/geology/geologic-resources-inventory-products.htm. For more information about the Geologic Resources Inventory Program visit the GRI webpage: https://www.nps.gov/subjects/geology/gri,htm. At the bottom of that webpage is a "Contact Us" link if you need additional information. You may also directly contact the program coordinator, Jason Kenworthy (jason_kenworthy@nps.gov). Source geologic maps and data used to complete this GRI digital dataset were provided by the following: U.S. Geological Survey. Detailed information concerning the sources used and their contribution the GRI product are listed in the Source Citation section(s) of this metadata record (saba_surficial_geology_metadata.txt or saba_surficial_geology_metadata_faq.pdf). Users of this data are cautioned about the locational accuracy of features within this dataset. Based on the source map scale of 1:12,000 and United States National Map Accuracy Standards features are within (horizontally) 6.1 meters or 20 feet of their actual location as presented by this dataset. Users of this data should thus not assume the location of features is exactly where they are portrayed in Google Earth, ArcGIS, QGIS or other software used to display this dataset. All GIS and ancillary tables were produced as per the NPS GRI Geology-GIS Geodatabase Data Model v. 2.3. (available at: https://www.nps.gov/articles/gri-geodatabase-model.htm).
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Below you’ll find a month by month breakdown of traffic on the australia.gov.au website along the following lines: Pageviews Visits Pages per visit Average time on page Devices This data is generated …Show full descriptionBelow you’ll find a month by month breakdown of traffic on the australia.gov.au website along the following lines: Pageviews Visits Pages per visit Average time on page Devices This data is generated using Google analytics. Please Note: This is an initial version of the data only. We’re looking forward to hearing your feedback on what other metrics are of interest to you. Please let us know by sending an email to data@digital.gov.au.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for Conceptual Captions
Dataset Summary
Conceptual Captions is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/conceptual_captions.
DataForSEO Labs API offers three powerful keyword research algorithms and historical keyword data:
• Related Keywords from the “searches related to” element of Google SERP. • Keyword Suggestions that match the specified seed keyword with additional words before, after, or within the seed key phrase. • Keyword Ideas that fall into the same category as specified seed keywords. • Historical Search Volume with current cost-per-click, and competition values.
Based on in-market categories of Google Ads, you can get keyword ideas from the relevant Categories For Domain and discover relevant Keywords For Categories. You can also obtain Top Google Searches with AdWords and Bing Ads metrics, product categories, and Google SERP data.
You will find well-rounded ways to scout the competitors:
• Domain Whois Overview with ranking and traffic info from organic and paid search. • Ranked Keywords that any domain or URL has positions for in SERP. • SERP Competitors and the rankings they hold for the keywords you specify. • Competitors Domain with a full overview of its rankings and traffic from organic and paid search. • Domain Intersection keywords for which both specified domains rank within the same SERPs. • Subdomains for the target domain you specify along with the ranking distribution across organic and paid search. • Relevant Pages of the specified domain with rankings and traffic data. • Domain Rank Overview with ranking and traffic data from organic and paid search. • Historical Rank Overview with historical data on rankings and traffic of the specified domain from organic and paid search. • Page Intersection keywords for which the specified pages rank within the same SERP.
All DataForSEO Labs API endpoints function in the Live mode. This means you will be provided with the results in response right after sending the necessary parameters with a POST request.
The limit is 2000 API calls per minute, however, you can contact our support team if your project requires higher rates.
We offer well-rounded API documentation, GUI for API usage control, comprehensive client libraries for different programming languages, free sandbox API testing, ad hoc integration, and deployment support.
We have a pay-as-you-go pricing model. You simply add funds to your account and use them to get data. The account balance doesn't expire.