34 datasets found

Google Trends - International
console.cloud.google.com
Updated Jul 25, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program (2019). Google Trends - International [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/google-trends-intl
Explore at:
Dataset updated
Jul 25, 2019
Dataset provided by
Google Searchhttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Description
The International Google Trends dataset will provide critical signals that individual users and businesses alike can leverage to make better data-driven decisions. This dataset simplifies the manual interaction with the existing Google Trends UI by automating and exposing anonymized, aggregated, and indexed search data in BigQuery. This dataset includes the Top 25 stories and Top 25 Rising queries from Google Trends. It will be made available as two separate BigQuery tables, with a set of new top terms appended daily. Each set of Top 25 and Top 25 rising expires after 30 days, and will be accompanied by a rolling five-year window of historical data for each country and region across the globe, where data is available. This Google dataset is hosted in Google BigQuery as part of Google Cloud's Datasets solution and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
Global Top Chart Searches in 21st Century
kaggle.com
zip
Updated Apr 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sanjay (2023). Global Top Chart Searches in 21st Century [Dataset]. https://www.kaggle.com/datasets/sanjay277/global-top-chart-searches-in-21st-century
Explore at:
zip(2476 bytes)Available download formats
Dataset updated
Apr 16, 2023
Authors
Sanjay
Description
This Kaggle dataset provides a comprehensive list of the top global Google searches over the years 2001-2023. The dataset includes information such as the search term, the year in which it was trending. This information can be used for a variety of purposes, including trend analysis, market research, and data visualization. With this dataset, users can gain insights into the popular search trends and topics over the years, and how they have evolved over time.
Z
Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...
data.niaid.nih.gov
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haak, Fabian; Schaer, Philipp (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
Explore at:
Dataset updated
Mar 1, 2023
Dataset provided by
Technische Hochschule Köln
Authors
Haak, Fabian; Schaer, Philipp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
Transparency in Keyword Faceted Search: a dataset of Google Shopping html...
data.europa.eu
zenodo.org
unknown
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Transparency in Keyword Faceted Search: a dataset of Google Shopping html pages [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-1491557/embed
Explore at:
unknown(1069031412)Available download formats
Dataset updated
Nov 12, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a collection of around 2,000 HTML pages: these web pages contain the search results obtained in return to queries for different products, searched by a set of synthetic users surfing Google Shopping (US version) from different locations, in July, 2016. Each file in the collection has a name where there is indicated the location from where the search has been done, the userID, and the searched product: no_email_LOCATION_USERID.PRODUCT.shopping_testing.#.html The locations are Philippines (PHI), United States (US), India (IN). The userIDs: 26 to 30 for users searching from Philippines, 1 to 5 from US, 11 to 15 from India. Products have been choice following 130 keywords (e.g., MP3 player, MP4 Watch, Personal organizer, Television, etc.). In the following, we describe how the search results have been collected. Each user has a fresh profile. The creation of a new profile corresponds to launch a new, isolated, web browser client instance and open the Google Shopping US web page. To mimic real users, the synthetic users can browse, scroll pages, stay on a page, and click on links. A fully-fledged web browser is used to get the correct desktop version of the website under investigation. This is because websites could be designed to behave according to user agents, as witnessed by the differences between the mobile and desktop versions of the same website. The prices are the retail ones displayed by Google Shopping in US dollars (thus, excluding shipping fees). Several frameworks have been proposed for interacting with web browsers and analysing results from search engines. This research adopts OpenWPM. OpenWPM is automatised with Selenium to efficiently create and manage different users with isolated Firefox and Chrome client instances, each of them with their own associated cookies. The experiments run, on average, 24 hours. In each of them, the software runs on our local server, but the browser's traffic is redirected to the designated remote servers (i.e., to India), via tunneling in SOCKS proxies. This way, all commands are simultaneously distributed over all proxies. The experiments adopt the Mozilla Firefox browser (version 45.0) for the web browsing tasks and run under Ubuntu 14.04. Also, for each query, we consider the first page of results, counting 40 products. Among them, the focus of the experiments is mostly on the top 10 and top 3 results. Due to connection errors, one of the Philippine profiles have no associated results. Also, for Philippines, a few keywords did not lead to any results: videocassette recorders, totes, umbrellas. Similarly, for US, no results were for totes and umbrellas. The search results have been analyzed in order to check if there were evidence of price steering, based on users' location. One term of usage applies: In any research product whose findings are based on this dataset, please cite @inproceedings{DBLP:conf/ircdl/CozzaHPN19, author = {Vittoria Cozza and Van Tien Hoang and Marinella Petrocchi and Rocco {De Nicola}}, title = {Transparency in Keyword Faceted Search: An Investigation on Google Shopping}, booktitle = {Digital Libraries: Supporting Open Science - 15th Italian Research Conference on Digital Libraries, {IRCDL} 2019, Pisa, Italy, January 31 - February 1, 2019, Proceedings}, pages = {29--43}, year = {2019}, crossref = {DBLP:conf/ircdl/2019}, url = {https://doi.org/10.1007/978-3-030-11226-4_3}, doi = {10.1007/978-3-030-11226-4_3}, timestamp = {Fri, 18 Jan 2019 23:22:50 +0100}, biburl = {https://dblp.org/rec/bib/conf/ircdl/CozzaHPN19}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Data for: Managing Retractions and their Afterlife: A Tripartite Framework...
zenodo.org
data-staging.niaid.nih.gov
csv, txt
Updated Feb 5, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renata Gonçalves Curty; Renata Gonçalves Curty (2026). Data for: Managing Retractions and their Afterlife: A Tripartite Framework for Research Datasets [Dataset]. http://doi.org/10.5281/zenodo.14783213
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14783213
Dataset updated
Feb 5, 2026
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Renata Gonçalves Curty; Renata Gonçalves Curty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 31, 2025
Description
This deposit includes supplementary data files for the paper "Managing Retractions and Their Afterlife: A Tripartite Framework for Research Datasets," which has been accepted for presentation at the International Digital Curation Conference (IDCC) 2025 in The Hague, Amsterdam (https://dcc.ac.uk/events/idcc25)

The dataset consists of retraction records crawled via Google Dataset Search queries, along with the selected sample and annotated data resulting from the assessment of these records. A README file is included to provide further details on the CSV files and their structure.

Abstract: Retractions serve as a critical, albeit last-resort, post-publication correction mechanism in scholarly publishing, playing an important role in upholding the integrity of the scientific record. By formally retracting flawed or misleading research, the scientific community mitigates the harm caused by errors or misconduct that may have escaped detection during peer review. While retractions of research articles have been extensively discussed across scientific disciplines and are well-integrated into most publishers' workflows, the retraction of research datasets remains underexplored and rarely implemented. This paper seeks to address this gap by reviewing recent developments in this area, analyzing a sample of publicly available retracted dataset records considering existing recommendations and guidelines, and putting forward a few points for discussions—particularly for cases where datasets have been published and correction is no longer feasible, or when all efforts to amend the dataset have been exhausted. These considerations are framed into three main categories: (1) preventive actions and timely response, (2) purposeful damage control, and (3) community engagement and shared standards. Although still preliminary, this framework aims to help entertain future debates and inform actionable strategies for addressing the unique challenges of managing retracted datasets where scientific rigor has been compromised. By contributing to the discussion on dataset retractions, this work seeks to better equip data curators, repository managers, and other stakeholders with tools to enhance accountability and transparency throughout the data preservation process while also helping to mitigate the error cascade effect in science.
d
DataForSEO Google Full (Keywords+SERP) database, historical data available
datarade.ai
.json, .csv
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataForSEO (2023). DataForSEO Google Full (Keywords+SERP) database, historical data available [Dataset]. https://datarade.ai/data-products/dataforseo-google-full-keywords-serp-database-historical-d-dataforseo
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 17, 2023
Dataset authored and provided by
DataForSEO
Area covered
Sweden, Portugal, Costa Rica, Burkina Faso, Bolivia (Plurinational State of), South Africa, United Kingdom, Côte d'Ivoire, Cyprus, Paraguay
Description
You can check the fields description in the documentation: current Full database: https://docs.dataforseo.com/v3/databases/google/full/?bash; Historical Full database: https://docs.dataforseo.com/v3/databases/google/history/full/?bash.

Full Google Database is a combination of the Advanced Google SERP Database and Google Keyword Database.

Google SERP Database offers millions of SERPs collected in 67 regions with most of Google’s advanced SERP features, including featured snippets, knowledge graphs, people also ask sections, top stories, and more.

Google Keyword Database encompasses billions of search terms enriched with related Google Ads data: search volume trends, CPC, competition, and more.

This database is available in JSON format only.

You don’t have to download fresh data dumps in JSON – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.
u
Data from: Inventory of online public databases and repositories holding...
agdatacommons.nal.usda.gov
data.wu.ac.at
txt
Updated Feb 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erin Antognoli; Jonathan Sears; Cynthia Parr (2024). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. http://doi.org/10.15482/USDA.ADC/1389839
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1389839
Dataset updated
Feb 8, 2024
Dataset provided by
Ag Data Commons
Authors
Erin Antognoli; Jonathan Sears; Cynthia Parr
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to

establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data

Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered.
Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review:

Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection.
Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation.

See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Data from: Bibliographic dataset characterizing studies that use online...
zenodo.org
portalcientifico.unav.edu
+2more
bin, csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joan E. Ball-Damerow; Joan E. Ball-Damerow; Laura Brenskelle; Laura Brenskelle; Narayani Barve; Narayani Barve; Raphael LaFrance; Pamela S. Soltis; Petra Sierwald; Petra Sierwald; Rüdiger Bieler; Rüdiger Bieler; Arturo Ariño; Arturo Ariño; Robert Guralnick; Robert Guralnick; Raphael LaFrance; Pamela S. Soltis (2020). Bibliographic dataset characterizing studies that use online biodiversity databases [Dataset]. http://doi.org/10.5281/zenodo.2589439
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2589439
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joan E. Ball-Damerow; Joan E. Ball-Damerow; Laura Brenskelle; Laura Brenskelle; Narayani Barve; Narayani Barve; Raphael LaFrance; Pamela S. Soltis; Petra Sierwald; Petra Sierwald; Rüdiger Bieler; Rüdiger Bieler; Arturo Ariño; Arturo Ariño; Robert Guralnick; Robert Guralnick; Raphael LaFrance; Pamela S. Soltis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes bibliographic information for 501 papers that were published from 2010-April 2017 (time of search) and use online biodiversity databases for research purposes. Our overarching goal in this study is to determine how research uses of biodiversity data developed during a time of unprecedented growth of online data resources. We also determine uses with the highest number of citations, how online occurrence data are linked to other data types, and if/how data quality is addressed. Specifically, we address the following questions:

1.) What primary biodiversity databases have been cited in published research, and which

databases have been cited most often?

2.) Is the biodiversity research community citing databases appropriately, and are

the cited databases currently accessible online?

3.) What are the most common uses, general taxa addressed, and data linkages, and how

have they changed over time?

4.) What uses have the highest impact, as measured through the mean number of citations

per year?

5.) Are certain uses applied more often for plants/invertebrates/vertebrates?

6.) Are links to specific data types associated more often with particular uses?

7.) How often are major data quality issues addressed?

8.) What data quality issues tend to be addressed for the top uses?

Relevant papers for this analysis include those that use online and openly accessible primary occurrence records, or those that add data to an online database. Google Scholar (GS) provides full-text indexing, which was important to identify data sources that often appear buried in the methods section of a paper. Our search was therefore restricted to GS. All authors discussed and agreed upon representative search terms, which were relatively broad to capture a variety of databases hosting primary occurrence records. The terms included: “species occurrence” database (8,800 results), “natural history collection” database (634 results), herbarium database (16,500 results), “biodiversity database” (3,350 results), “primary biodiversity data” database (483 results), “museum collection” database (4,480 results), “digital accessible information” database (10 results), and “digital accessible knowledge” database (52 results)--note that quotations are used as part of the search terms where specific phrases are needed in whole. We downloaded all records returned by each search (or the first 500 if there were more) into a Zotero reference management database. About one third of the 2500 papers in the final dataset were relevant. Three of the authors with specialized knowledge of the field characterized relevant papers using a standardized tagging protocol based on a series of key topics of interest. We developed a list of potential tags and descriptions for each topic, including: database(s) used, database accessibility, scale of study, region of study, taxa addressed, research use of data, other data types linked to species occurrence data, data quality issues addressed, authors, institutions, and funding sources. Each tagged paper was thoroughly checked by a second tagger.

The final dataset of tagged papers allow us to quantify general areas of research made possible by the expansion of online species occurrence databases, and trends over time. Analyses of this data will be published in a separate quantitative review.
n
A dataset of 5 million city trees from 63 US cities: species, location,...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+3more
zip
Updated Aug 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz (2022). A dataset of 5 million city trees from 63 US cities: species, location, nativity status, health, and more. [Dataset]. http://doi.org/10.5061/dryad.2jm63xsrf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jm63xsrf
Dataset updated
Aug 31, 2022
Dataset provided by
Harvard University
Worcester Polytechnic Institute
Cornell University
Stanford University
The Biota of North America Program (BONAP)
Authors
Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
United States
Description
Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems.

Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.

Data Acquisition We limited our search to the 150 largest cities in the USA (by census population).

To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.

Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not).

First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”).

Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9.

Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected.

Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9).

Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known).

Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia).

Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4.

After each stage of data cleaning, we performed manual spot checking to identify any issues.
n
Data from: A meta-analysis of butterfly structural colors: their color...
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachel Thayer (2023). A meta-analysis of butterfly structural colors: their color range, distribution, and biological production [Dataset]. http://doi.org/10.5061/dryad.qnk98sfnx
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.qnk98sfnx
Dataset updated
Sep 22, 2023
Dataset provided by
University of California, Davis
Authors
Rachel Thayer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Butterfly scales are among the richest natural sources of optical nanostructures, which produce structural color and iridescence. Several recurring nanostructure types have been described, such as ridge multilayers, gyroids, and lower lamina thin films. While the optical mechanisms of these nanostructure classes are known, their phylogenetic distributions and functional ranges have not been described in detail. In this Review, we examine a century of research on the biological production of structural colors, including their evolution, development, and genetic regulation. We also create a database of more than 300 optical nanostructures in butterflies and conduct a meta-analysis of the color range, abundance, and phylogenetic distribution of each nanostructure class. Butterfly structural colors are ubiquitous in short wavelengths but extremely rare in long wavelengths, especially red. In particular, blue wavelengths (around 450 nm) occur in more clades and are produced by more kinds of nanostructures than other hues. Nanostructure categories differ in prevalence, phylogenetic distribution, color range, and brightness. For example, lamina thin films are the least bright; perforated lumen multilayers occur most often but are almost entirely restricted to the family Lycaenidae; and 3D photonic crystals, including gyroids, have the narrowest wavelength range (from about 450 to 550 nm). We discuss the implications of these patterns in terms of nanostructure evolution, physical constraint, and relationships to pigmentary color. Finally, we highlight opportunities for future research, such as analyses of subadult and Hesperid structural colors and the identification of genes that directly build the nanostructures, with relevance for biomimetic engineering.

Methods Several intersecting approaches were used to search for articles, book chapters, and theses reporting nanostructures that produce structural colors in butterflies. First, Google Scholar searches were run with combinations of these keywords: structural color, butterfly, Lepidoptera, iridescence, scale, and Ghiradella. Database searches brought up tens of thousands of hits, many of which were not pertinent, so we used high-quality results to find cited and citing references and noted every species that was mentioned in connection with structural color, iridescence, or derived scale morphology. Finally, a database search was run on each species name that had been mentioned in any prior included reference. When a search on species name returned many results, it was searched again in combination with the keywords. If the search on species name returned no relevant results, we tried searches with only the genus name and checked for alternate nomenclature. We continued snowballing references and running database searches on species until we could no longer find any new taxa mentioned in connection with structural color. Criteria for inclusion were that the article must include either (1) reflectance measurements or (2) electron microscope images of non-transparent (i.e. colored) optical nanostructures in a butterfly. We also included studies which provided additional characterizations (e.g. absorption measurements, mathematical modeling, scatterometry) for structures that had been included on the basis of (1) or (2).

This literature review strategy yielded 187 included references which described 421 potential structures from 378 species, all of which are included in this dataset. Before further analysis, we secondarily excluded entries that were presented as non-photonic comparisons to structurally colored specimens. Only seven reported structures occurred outside adult wing scales, which were all multilayer broadband reflectors in the pupal cuticle (Neville, 1977; Steinbrecht, 1985; Steinbrecht et al., 1985). We therefore narrowed our focus to structures located in scales or bristles on the adult, which can be homologized and directly compared in subsequent analyses. After filtering, there was pairwise complete data on both color and morphology for 314 optical nanostructures from 287 species. Some species had multiple structural colors on different body parts (e.g. blue dorsal and green ventral wings in Cyanophrys remus and Albulina metallica; Biró et al., 2007).

To compare color between structures, we recorded the peak reflected wavelength (i.e. hue) and the percent reflectance at that wavelength (i.e. brightness) for each structural color. Due to iridescence, quantification of structural color is extremely sensitive to the measurement protocol, specifically illumination and detection angles, light source, reference sample, and spot size (Meadows et al., 2011). Spectroscopy methodologies were variable among the included studies making comparisons imperfect; nevertheless, the data are useful to show broad trends. When multiple spectra were available, we used the following rules for consistency. When reflectance was reported from more than one angle, the peak wavelength at the maximally reflective angle was used. If comparable reflectance data was reported from more than one study or from replicated specimens, we took their average. When reflectance data was found for both an isolated scale and the intact wing, both values were noted, but the intact wing reflectance was preferentially used in comparative analyses for consistency, because single-scale reflectance measures were uncommon. In cases where structures produced two reflectance peaks – as in Chrysozephyrus species with both a UV and a green peak (Imafuku, Hirose and Takeuchi, 2002) – the brighter peak was used in graphical summaries, but both were listed in the spreadsheet. Peak wavelengths were typically estimated by eye from graphs, which limited precision to a 5–10 nm window around the measured peak. This precision limit is similar to the magnitude of inter-individual variation (Imafuku, Gotoh and Takeuchi, 2002; Bálint et al., 2008). We also recorded percent reflectance at the maximally reflective wavelength (i.e. spectral intensity or ‘brightness’). When no reflectance spectra were available but a color image or a qualitative color descriptor (e.g. ‘blue’, ‘UV’) was given, the qualitative descriptor was recorded. Broadband reflectors have a similar reflectance intensity across many wavelengths, so the maximally reflecting wavelength is not a good summary of the reflector’s properties and may not be identifiable from a visual inspection of a graph. Therefore, for broadband reflectors, we only recorded a qualitative descriptor, such as ‘white’, ‘silver’, or ‘gold’. Additionally, some reddish lamina thin films that reflected in both violet and red, without a peak wavelength in either region, were handled as qualitatively ‘magenta’. Note that many reflectance spectra were likely influenced by co-occurring pigments as well as the nanostructures.

To compare scale morphological modifications, we noted which kind of optical nanostructure was present. Generally, we followed the author’s conclusion as to which scale component caused the optical properties. If the author’s description was brief but a micrograph was provided, we assigned the structure to the same category as the well-studied examples that it most resembled. In a few cases when the proposed mechanism seemed questionable, we noted the explanation in the spreadsheet but dropped that structure from comparative analysis (for example, the proposed nanostructure was not present in the provided micrograph, or the mechanism was disputed across studies). Filled-in windows and crossrib air columns likely involve modifications to both the crossribs and microribs, and reflectance in these scales also requires the lower lamina; for simplicity, we have summarized them as crossrib bilayer structures.

Data was processed using R.
r
Coral restoration database – Dataset from Bostrom-Einarsson et al 2019 (NESP...
researchdata.edu.au
Updated Sep 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McLeod, Ian; Smith, Adam; Hein, Margaux; Cook, Nathan; Bostrom-Einarsson, Lisa; Ceccarelli, Daniela (2020). Coral restoration database – Dataset from Bostrom-Einarsson et al 2019 (NESP TWQ 4.3, JCU) [Dataset]. https://researchdata.edu.au/coral-restoration-database-43-jcu/1425277
Explore at:
Dataset updated
Sep 2, 2020
Dataset provided by
eAtlas
Authors
McLeod, Ian; Smith, Adam; Hein, Margaux; Cook, Nathan; Bostrom-Einarsson, Lisa; Ceccarelli, Daniela
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Jan 1, 2017 - Jan 31, 2019
Area covered
Description
This dataset consists of a review of case studies and descriptions of coral restoration methods from four sources:
1) the primary literature (i.e. published peer-reviewed scientific literature),
2) grey literature (e.g. scientific reports and technical summaries from experts in the field),
3) online descriptions (e.g. blogs and online videos describing projects), and
4) an online survey targeting restoration practitioners (doi:10.5061/dryad.p6r3816).

Included are only those case studies which actively conducted coral restoration (i.e. at least one stage of scleractinian coral life-history was involved). This excludes indirect coral restoration projects, such as disturbance mitigation (e.g. predator removal, disease control etc.) and passive restoration interventions (e.g. enforcement of control against dynamite fishing or water quality improvement). It also excludes many artificial reefs, in particular if the aim was fisheries enhancement (i.e. fish aggregation devices), and if corals were not included in the method. To the best of our abilities, duplication of case studies was avoided across the four separate sources, so that each case in the review and database represents a separate project.

Methods:
More than 40 separate categories of data were recorded from each case study and entered into a database. These included data on
(1) the information source,
(2) the case study particulars (e.g. location, duration, spatial scale, objectives, etc.),
(3) specific details about the methods,
(4) coral details (e.g. genus, species, morphology),
(5) monitoring details, and
(6) the outcomes and conclusions.

Primary literature
Multiple search engines were used to achieve the most complete coverage of the scientific literature. First, the scientific literature was searched using Google Scholar with the keywords “coral* + restoration”. Because the field (and therefore search results) are dominated by transplantation studies, separate searches were then conducted for other common techniques using “coral* + restoration + [technique name]”. This search was further complemented by using the same keywords in ISI Web of Knowledge (search yield n=738). Studies were then manually selected that fulfilled our criteria for active coral restoration described above (final yield n= 221). In those cases where a single paper describes several different projects or methods, these were split into separate case studies. Finally, prior reviews of coral restoration were consulted to obtain case studies from their reference lists.

Grey literature
While many reports appeared in the Google Scholar literature searches, The Nature Conservancy (TNC) database of reports for North American coastal restoration projects (http://projects.tnc.org/coastal/) was also conducted. This was supplemented with reports listed in the reference lists of other papers, reports and reviews, and during the online searches (n=30).

Online records
Small-scale projects conducted without substantial input from researchers, academics, non-governmental organisations (NGO) or coral reef managers often do not result in formal written accounts of methods. To access this information, we conducted online searches of YouTube, Facebook and Google, using the search terms “Coral restoration”. The information provided in videos, blog posts and websites to describe further projects (n=48) was also used. Due to the unverified nature of such accounts, the data collected from these online-only records was limited compared to peer reviewed literature and surveys. At the minimum, the location, the methods used and reported outcomes or lessons learned were included in this review.

Online survey
To access information from projects not published elsewhere, an online survey targeting restoration practitioners was designed. The survey consisted of 25 questions querying restoration practitioners regarding projects they had undertaken under JCU human ethics H7218 (following the Australian National Statement on Ethical Conduct in Human Research, 2007). These data (n=63) are included in all calculations within this review, but are not publicly available to preserve the anonymity of participants. Although we encouraged participants to fill out a separate survey for each case study, it is possible that participants included multiple separate projects in a single survey, which may reduce the real number of case studies reported.

Data analysis
Percentages, counts and other quantifications from the database refer to the total number of case studies with data in that category. Case studies where data were lacking for the category in question, or lack appropriate detail (e.g. reporting ‘mixed’ for coral genera) are not included in calculations. Many categories allowed multiple answers (e.g. coral species); these were split into separate records for calculations (e.g. coral species n). For this reason, absolute numbers may exceed the number of case studies in the database. However, percentages reflect the proportion of case studies in each category. We used the seven objectives outlined in [1] to classify the objective of each case study, with an additional two categories (‘scientific research’ and ‘ecological engineering’). We used Tableau to visualise and analyse the database (Desktop Professional Edition, version 10.5, Tableau Software). The data have been made available following the FAIR Guiding Principles for scientific data management and stewardship [2]. Data available from the Dryad Digital Repository downloaded here (https://doi.org/10.5061/dryad.p6r3816), and visually explored: https://public.tableau.com/views/CoralRestorationDatabase-Visualisation/Coralrestorationmethods?:embed=y&:display_count=yes&publish=yes&:showVizHome=no#1.

Limitations:
While our expanded search enabled us to avoid the bias from the more limited published literature, we acknowledge that using sources that have not undergone rigorous peer-review potentially introduces another bias. Many government reports undergo an informal peer-review; however, survey results and online descriptions may present a subjective account of restoration outcomes. To reduce subjective assessment of case studies, we opted not to interpret results or survey answers, instead only recording what was explicitly stated in each document [3, 4].

Defining restoration
In this review, active restoration methods are methods which reintroduce coral (e.g. coral fragment transplantation, or larval enhancement) or augment coral assemblages (e.g. substrate stabilisation, or algal removal), for the purposes of restoring the reef ecosystem. In the published literature and elsewhere, there are many terms that describe the same intervention. For clarity, we provide the terms we have used in the review, their definitions and alternative terms (see references). Passive restoration methods such as predator removal (e.g. crown-of-thorns starfish and Drupella control) have been excluded, unless they were conducted in conjunction with active restoration (e.g. macroalgal removal combined with transplantation).

Format:
The data is supplied as an excel file with three separate tabs for 1) peer reviewed literature 2) grey literature, and 3) a description of the objectives form Hein et al. 2017. Survey responses have been excluded to preserve the anonymity of the respondents.

This dataset is a database that underpins a 2018 report and 2019 published review of coral restoration methods from around the world.
- Bostrom-Einarsson L, Ceccarelli D, Babcock R.C., Bayraktarov E, Cook N, Harrison P, Hein M, Shaver E, Smith A, Stewart-Sinclair P.J, Vardi T, McLeod I.M. 2018 - Coral restoration in a changing world - A global synthesis of methods and techniques, report to the National Environmental Science Program. Reef and Rainforest Research Centre Ltd, Cairns (63pp.).
- Review manuscript is currently under review.

Data Dictionary:
The Data Dictionary is emended in the excel spreadsheet. Comments are included in the column titles to aid interpretation, and/or refer to additional information tabs. For more information on each column, open the red triangle [located top right of cell].

References:
1. Hein MY, Willis BL, Beeden R, Birtles A. The need for broader ecological and socioeconomic tools to evaluate the effectiveness of coral restoration programs. Restoration Ecology. Wiley/Blackwell (10.1111); 2017;25: 873–883. doi:10.1111/rec.12580
2. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016 3. Nature Publishing Group; 2016;3: 160018. doi:10.1038/sdata.2016.18
3.Miller RL, Marsh H, Cottrell A, Hamann M. Protecting Migratory Species in the Australian Marine Environment: A Cross-Jurisdictional Analysis of Policy and Management Plans. Front Mar Sci. Frontiers; 2018;5: 211. doi:10.3389/fmars.2018.00229
4. Ortega-Argueta A, Baxter G, Hockings M. Compliance of Australian threatened species recovery plans with legislative requirements. Journal of Environmental Management. Elsevier; 2011;92: 2054–2060.

Data Location:

This dataset is filed in the eAtlas enduring data repository at: data\2018-2021-NESP-TWQ-4\4.3_Best-practice-coral-restoration
n
Repository Analytics and Metrics Portal (RAMP) 2017 data
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2017 data [Dataset]. http://doi.org/10.5061/dryad.r7sqv9scf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.r7sqv9scf
Dataset updated
Jul 27, 2021
Dataset provided by
University of New Mexico
Montana State University
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2017. For a description of the data collection, processing, and output methods, please see the "methods" section below.

Methods RAMP Data Documentation – January 1, 2017 through August 18, 2018

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search.

Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

The data in these CSV files include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data follow the format 2017-01_RAMP_all.csv. Using this example, the file 2017-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2017.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
Data from: R code and dataset to "Monetizing Spillover Effects in the...
zenodo.org
producciocientifica.uv.es
+2more
zip
Updated Jul 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan D. Montoro-Pons; Juan D. Montoro-Pons; María Caballer-Tarazona; María Caballer-Tarazona; Manuel Cuadrado-García; Manuel Cuadrado-García (2021). R code and dataset to "Monetizing Spillover Effects in the Creative Industries: the Impact of Live Music Performances on Youtube Searches" [Dataset]. http://doi.org/10.5281/zenodo.5091809
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5091809
Dataset updated
Jul 30, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan D. Montoro-Pons; Juan D. Montoro-Pons; María Caballer-Tarazona; María Caballer-Tarazona; Manuel Cuadrado-García; Manuel Cuadrado-García
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
Content:

The script main_script.R includes code to run a regression discontinuity (RD) design and validation and falsification of estimated results

The folder data contains two files:

bands_2016_2019.csv: a dataset of performers with additional information for each one.

festivals_2016_2019.csv: a dataset of video search activity (as retrieved from Google Trends) for performers in file bands_2016_2019.csv

The folder source contains two additional R scripts:

data_preparation.R: generates the long dataset used to estimate RD effects

status_simulation.R: randomly assigns treattment status to performers and estimates RD effects. Note this may take a long time to run. Parallel code is used: the number of cores has been set to 4.

The folder simulation_results contains simulated data after running the script status_simulation.R.
d
Data from: CanFlyet: Habitat zone and diet trait dataset for Diptera species...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samantha Majoros (2025). CanFlyet: Habitat zone and diet trait dataset for Diptera species of Canada and Greenland [Dataset]. http://doi.org/10.5061/dryad.fqz612jwx
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.fqz612jwx
Dataset updated
Aug 1, 2025
Dataset provided by
Dryad Digital Repository
Authors
Samantha Majoros
Time period covered
May 28, 2024
Description
True flies (Diptera) are an ecologically important group that play a role in agriculture, public health, and ecosystem functioning. As researchers continue to investigate this order, it is beneficial to link the growing occurrence data to biological traits. However, large-scale ecological trait data are not readily available for fly species. While some databases and datasets include fly data, many ecologically relevant traits for taxa of interest are not included. In this dataset we provide ecological traits (habitat and diet) for fly species of Canada and Greenland having occurrence records on the Barcode of Life Data Systems (BOLD). Trait data were compiled based on literature searches conducted from April 2021 - April 2024 and assigned at the lowest taxonomic level possible. This dataset contains trait information for 990 taxa: 981 records identified to the species level, and 9 taxa only identified to the genus level. The species in the dataset are found across 380 genera, 34 subfami..., The fly species were chosen for inclusion in this dataset by first downloading data for Diptera from Canada and Greenland from BOLD directly into R using BOLDâ€™s application programming interface (API) on June 24th, 2021. The records were filtered based on the requirements outlined in Majoros et al. (2023), and the remaining species were chosen for analysis and inclusion in this dataset. Additional species from Greenland were chosen based on occurrence records from GBIF (GBIF.org (June 24th, 2021)) GBIF Occurrence Download (https://doi.org/10.15468/dl.mk52hp) and included in the dataset. The biological traits for each species were determined and assigned through literature searches conducted from April 2021 - April 2024. Through the Omni Academic search tool available through the University of Guelph and Google Scholar, traits were found using the following search terms: trait AND â€œTaxonomic nameâ€ , habitat AND â€œTaxonomic nameâ€ , diet AND â€œTaxonomic nameâ€ , â€œFeeding modeâ€ AND â€œTaxonomic nam..., The dataset is provided in three formats. Only the xlsx file requires Microsoft Excel to open. All files contain the same dataset and information.Â , # CanFlyet: Habitat Zone and Diet Trait Dataset for Diptera Species of Canada and Greenland

Version history

Update June 10, 2024

Made several formatting changes and corrected errors in the dataset. A taxonRank column was added and the referenceNumber column was renamed to â€œNumber_of_references_consultedâ€ .

Update April 29, 2024

Based on feedback from peer reviewers, the trait assignments and categories have been updated. Additional search terms were added, and an additional extensive literature search was conducted resulting in over 400 new references. Trait categories were redefined. â€œPhytophagousâ€ and "Omnivore" were removed as categories, and â€œLeaf/Root/Stem Feedingâ€ , â€œNectar/Pollen/Honeydew Feedingâ€ , â€œDetritus and Algae Feedingâ€ , â€œKleptoparasiticâ€ and â€œPloyphagousâ€ were added. â€œFungivoreâ€ has been renamed to â€œMycophagousâ€ . The â€œUnclearâ€ category has been added for taxa where there is not enough information to make a trait assignment. Taxa may now also have two trait...
f
Raw data of Articles that were produced from PubMed search using the...
figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Webb (2023). Raw data of Articles that were produced from PubMed search using the included search string within the Journals listed in the Google metrics. [Dataset]. http://doi.org/10.6084/m9.figshare.24539458.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24539458.v1
Dataset updated
Nov 9, 2023
Dataset provided by
figshare
Authors
Jason Webb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Stigmatizing language or non-person-centered language (PCL) has been shown to impact patients negatively, especially in the case of obesity. This has led many associations, such as the American Medical Association (AMA) and the International Committee of Medical Journal Editors (ICMJE) to enact guidelines prohibiting the use of stigmatizing language in medical research. In 2018, the AMA adopted PCL guidelines, including a specific obesity amendment that all researchers should adhere to. Our primary objective was to determine if PCL guidelines specific to obesity have been properly obeyed in the most interacted with sports medicine journals. We searched within PubMed for obesity-related articles between 2019 and 2022 published in the top ten most interacted sports medicine journals based on Google Metrics data. A predetermined list of stigmatizing and non-PCL terms/language was searched within each article.
m
CUP- Convolutional Neural Network Training dataset
data.mendeley.com
Updated Apr 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
taraneh saniei (2024). CUP- Convolutional Neural Network Training dataset [Dataset]. http://doi.org/10.17632/7md9bgd4tg.3
Explore at:
Unique identifier
https://doi.org/10.17632/7md9bgd4tg.3
Dataset updated
Apr 23, 2024
Authors
taraneh saniei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset has been provided for training a convolutional neural network to measure some selected visual design principles. These visual principles are related to the preference matrix which is adopted from Kaplan and Kaplan (1989). The dataset is contained of images that illustrate the considered variables as obviously as possible and the CNN trained by them can be used for analysis in fields of art and architecture. CUP is the short form of contrast, unity, and proportion. 3 types of complementary color contrasts, 2 types of warm/cold contrasts, light/dark contrast, similarity in color and variety in form , proportion and similarity in form and variety in color. The dataset is a collection of images found in the search through google and pinterest, and some of them may be subject to copyright. For such images, the copyright of all the images belongs to the image owners.
R
Steering_wheel_detection Dataset
universe.roboflow.com
zip
Updated Aug 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ManipulatorDataset (2023). Steering_wheel_detection Dataset [Dataset]. https://universe.roboflow.com/manipulatordataset/steering_wheel_detection
Explore at:
zipAvailable download formats
Dataset updated
Aug 3, 2023
Dataset authored and provided by
ManipulatorDataset
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Steeringwheel Bounding Boxes
Description
Steering-Wheel-Detection

Steering-Wheel-Detection

Steering-Wheel-Detection is created by CAIR, IIT Mandi with a goal of building a model to detect steering wheel for Manipulator Task. It contains 960 images.

Data collection

We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.

Fair use

This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).
Occurrence Records of Tropical Asian Butterflies: 1970 - 2024
figshare.com
csv
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Jones; Yu Hin Yau; Eugene Yau; Timothy Carlton Bonebrake (2025). Occurrence Records of Tropical Asian Butterflies: 1970 - 2024 [Dataset]. http://doi.org/10.6084/m9.figshare.25037645.v8
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25037645.v8
Dataset updated
Jun 3, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Emily Jones; Yu Hin Yau; Eugene Yau; Timothy Carlton Bonebrake
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Asia
Description
Occurrence records of tropical Asia's butterflies collated from online databases and published literature. This dataset consists of 730,190 occurrence records for 3,752 species.Methods:We collected occurrence records for tropical Asian Papilionoidea (Lepidoptera: Nymphalidae, Papilionidae, Lycaenidae, Pieridae, Hesperiidae, Riodinidae)(-11.426 – 35.64 N, 67.588 – 174.990 E) for the years 1970-present. Records were extracted from the Global Biodiversity Information Facility on 15 April 2024. We included only presence records derived from human observation, preserved specimens, material samples, or literature, provided they had associated coordinates. We omitted all records with >100,000 m coordinate uncertainty, so-called “fuzzy” taxon matches, and records for which the scientific name was missing or incomplete, unless nomenclature could be extracted using a BOLD identifier (boldsystems.org/). This resulted in a final number of GBIF records = 651,285. Note to commercial users: 432,340 GBIF records included within this dataset have a CC BY-NC 4.0 license. See column Z for data license types and/or visit the GBIF-derived dataset (https://doi.org/10.15468/dl.9wyfb6) to rerun the query and filter data according to their licenses.We also extracted data from the B2D2 Database of Butterflies for Borneo provided by JKH/the Darwin Initiative (n = 19,417) and a dataset for Bangladesh provided by SC (Chowdhury et al. 2021) (n = 18,278), and unpublished datasets from coauthors AN, DJL, LVV, TK, and YB (n = 13,993). To fill geographic gaps when all of these records were plotted, we conducted targeted searches of the published literature on Google Scholar (details below), resulting in an additional 27,217 records. Pre-1970 data were omitted where possible from all sources. Final binomial synonym harmonization, validation, and authority assignment were conducted by DJL using a taxonomic reference prepared by Gerardo Lamas (Lamas, 2015. Catalogue of the butterflies (Papilionidae). Available from the author.).For geographic regions with relatively few GBIF records (e.g., China, Myanmar, Thailand) and for species with < 10 records, we conducted targeted literature searches using Google Scholar in English, simplified Chinese, and traditional Chinese (genus OR genus + species + country name). For all species records in published sources, we extracted coordinates, locality name, locality type (e.g., exact coordinates, city, national park, island, or province), country, and year of record (where available). If exact coordinates were not provided by the source, we used Google Earth Pro (v7.3.6.9345) to estimate the locality centroid for any record provided at the province level or below (e.g., national park or city). For records from islands ≤ 100 km in length or diameter (e.g, localities within the Philippines and Indonesia), we estimated the island or archipelago centroid. If a range of coordinates was provided (e.g., records from The Butterflies of Vietnam), coordinates within the range were chosen haphazardly by selecting a point within the range provided on Google Earth Pro. Data sources for all records are provided in the reference column (C) in Occurrence Records of Tropical Asian Butterflies: 1970-2024 and alphabetically in Data Sources for: Occurrence Records of Tropical Asian Butterflies: 1970-2024.
Human Variant Annotation Datasets
console.cloud.google.com
Updated Jul 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=de (2023). Human Variant Annotation Datasets [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/human-variant-annotation-public?hl=de
Explore at:
Dataset updated
Jul 21, 2023
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
These datasets are important to genomics researchers because they characterize several aspects of what the scientific community has learned to date about human sequence variants. Making this human annotation data freely available in GCP will enable researchers to focus less on data movement and management tasks associated with procuring this data and instead make immediate use of the data to better understand the clinical relevance of particular variant such as disease causing or protective variants (ClinVar), search a catalog of SNPs that have been identified in the human genome (dbSNP), and discover how frequently a particular variant occurs across the human population (1000Genomes, ESP, ExAC, gnomAD) This human annotation dataset contains both a mirror of the original Variant Call Files (VCF) files from NCBI, NHLBI Exome Sequencing Project (ESP) and ensembl as Google Cloud Storage (GCS) objects. In addition, these human sequence variants have also been translated into a particular variant table format and made available in Google BigQuery giving researchers the ability to use cloud technology and code repositories such as the Verily Life Sciences Annotation Toolkit to perform analyses in parallel. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This public dataset is hosted in Google Cloud Storage and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage.
PubMed Central
console.cloud.google.com
Updated Mar 12, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=it (2026). PubMed Central [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/pmc?hl=it
Explore at:
Dataset updated
Mar 12, 2026
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Description
PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH NLM). This dataset contains open access articles available under Creative Commons. The collection includes article metadata, full text content, author information, publication dates, journal details, and licensing information. The data has been indexed with vector embeddings to support semantic search capabilities, enabling more sophisticated analysis and discovery of related research. For detailed information about PubMed Central and its open access collection, see the PMC documentation . This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

Facebook

Twitter

Click to copy link

Link copied

Cite

https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program (2019). Google Trends - International [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/google-trends-intl

Google Trends - International

Explore at:

459 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 25, 2019

Dataset provided by

Google Searchhttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/

Description

The International Google Trends dataset will provide critical signals that individual users and businesses alike can leverage to make better data-driven decisions. This dataset simplifies the manual interaction with the existing Google Trends UI by automating and exposing anonymized, aggregated, and indexed search data in BigQuery. This dataset includes the Top 25 stories and Top 25 Rising queries from Google Trends. It will be made available as two separate BigQuery tables, with a set of new top terms appended daily. Each set of Top 25 and Top 25 rising expires after 30 days, and will be accompanied by a rolling five-year window of historical data for each country and region across the globe, where data is available. This Google dataset is hosted in Google BigQuery as part of Google Cloud's Datasets solution and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery

Clear search

Close search

Google apps

Main menu

Google Trends - International

Global Top Chart Searches in 21st Century

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

Transparency in Keyword Faceted Search: a dataset of Google Shopping html...

Data for: Managing Retractions and their Afterlife: A Tripartite Framework...

DataForSEO Google Full (Keywords+SERP) database, historical data available

Data from: Inventory of online public databases and repositories holding...

Data from: Bibliographic dataset characterizing studies that use online...

A dataset of 5 million city trees from 63 US cities: species, location,...

Data from: A meta-analysis of butterfly structural colors: their color...

Coral restoration database – Dataset from Bostrom-Einarsson et al 2019 (NESP...

Repository Analytics and Metrics Portal (RAMP) 2017 data

Data from: R code and dataset to "Monetizing Spillover Effects in the...

Data from: CanFlyet: Habitat zone and diet trait dataset for Diptera species...

Version history

Update June 10, 2024

Update April 29, 2024

Raw data of Articles that were produced from PubMed search using the...

CUP- Convolutional Neural Network Training dataset

Steering_wheel_detection Dataset

Steering-Wheel-Detection

Steering-Wheel-Detection

Data collection

Fair use

Occurrence Records of Tropical Asian Butterflies: 1970 - 2024

Human Variant Annotation Datasets

PubMed Central

Google Trends - InternationalSee More Versions

Google Trends - International