Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).
You can check the fields description in the documentation: current Full database: https://docs.dataforseo.com/v3/databases/google/full/?bash; Historical Full database: https://docs.dataforseo.com/v3/databases/google/history/full/?bash.
Full Google Database is a combination of the Advanced Google SERP Database and Google Keyword Database.
Google SERP Database offers millions of SERPs collected in 67 regions with most of Google’s advanced SERP features, including featured snippets, knowledge graphs, people also ask sections, top stories, and more.
Google Keyword Database encompasses billions of search terms enriched with related Google Ads data: search volume trends, CPC, competition, and more.
This database is available in JSON format only.
You don’t have to download fresh data dumps in JSON – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in
Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.
Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)
The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.
To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.
Dataset 2: Search Query Suggestions (suggestions.csv)
The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.
The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".
We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.
AllSides Scraper
At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.
We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The search engine market size was valued at approximately USD 124 billion in 2023 and is projected to reach USD 258 billion by 2032, witnessing a robust CAGR of 8.5% during the forecast period. This growth is largely attributed to the increasing reliance on digital platforms and the internet across various sectors, which has necessitated the use of search engines for data retrieval and information dissemination. With the proliferation of smartphones and the expansion of internet access globally, search engines have become indispensable tools for both businesses and consumers, driving the market's upward trajectory. The integration of artificial intelligence and machine learning technologies into search engines is transforming the way search engines operate, offering more personalized and efficient search results, thereby further propelling market growth.
One of the primary growth factors in the search engine market is the ever-increasing digitalization across industries. As businesses continue to transition from traditional modes of operation to digital platforms, the need for search engines to navigate and manage data becomes paramount. This shift is particularly evident in industries such as retail, BFSI, and healthcare, where vast amounts of data are generated and require efficient management and retrieval systems. The integration of AI and machine learning into search engine algorithms has enhanced their ability to process and interpret large datasets, thereby improving the accuracy and relevance of search results. This technological advancement not only improves user experience but also enhances the competitive edge of businesses, further fueling market growth.
Another significant growth factor is the expanding e-commerce sector, which relies heavily on search engines to connect consumers with products and services. With the rise of e-commerce giants and online marketplaces, consumers are increasingly using search engines to find the best prices, reviews, and availability of products, leading to a surge in search engine usage. Additionally, the implementation of voice search technology and the growing popularity of smart home devices have introduced new dynamics to search engine functionality. Consumers are now able to conduct searches verbally, which has necessitated the adaptation of search engines to incorporate natural language processing capabilities, further driving market growth.
The advertising and marketing sectors are also contributing significantly to the growth of the search engine market. Businesses are leveraging search engines as a primary tool for online advertising, given their wide reach and ability to target specific audiences. Pay-per-click advertising and search engine optimization strategies have become integral components of digital marketing campaigns, enabling businesses to enhance their visibility and engagement with potential customers. The measurable nature of these advertising techniques allows businesses to assess the effectiveness of their campaigns and make data-driven decisions, thereby increasing their reliance on search engines and contributing to overall market growth.
The evolution of search engines is closely tied to the development of Ai Enterprise Search, which is revolutionizing how businesses access and utilize information. Ai Enterprise Search leverages artificial intelligence to provide more accurate and contextually relevant search results, making it an invaluable tool for organizations that manage large volumes of data. By understanding user intent and learning from past interactions, Ai Enterprise Search systems can deliver personalized experiences that enhance productivity and decision-making. This capability is particularly beneficial in sectors such as finance and healthcare, where quick access to precise information is crucial. As businesses continue to digitize and data volumes grow, the demand for Ai Enterprise Search solutions is expected to increase, further driving the growth of the search engine market.
Regionally, North America holds a significant share of the search engine market, driven by the presence of major technology companies and a well-established digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth can be attributed to the rapid digital transformation in emerging economies such as China and India, where increasing internet penetration and smartphone adoption are driving demand for search engines. Additionally, government initiatives to
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.
Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.
UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.
UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.
1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets:
Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.
2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases.
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The replication package for our article The State of Serverless Applications: Collection,Characterization, and Community Consensus provides everything required to reproduce all results for the following three studies:
Serverless Application Collection
We collect descriptions of serverless applications from open-source projects, academic literature, industrial literature, and scientific computing.
Open-source Applications
As a starting point, we used an existing data set on open-source serverless projects from this study. We removed small and inactive projects based on the number of files, commits, contributors, and watchers. Next, we manually filtered the resulting data set to include only projects that implement serverless applications. We provide a table containing all projects that remained after the filtering alongside the notes from the manual filtering.
Academic Literature Applications
We based our search on an existing community-curated dataset on literature for serverless computing consisting of over 180 peer-reviewed articles. First, we filtered the articles based on title and abstract. In a second iteration, we filtered out any articles that implement only a single function for evaluation purposes or do not include sufficient detail to enable a review. As the authors were familiar with some additional publications describing serverless applications, we contributed them to the community-curated dataset and included them in this study. We provide a table with our notes from the manual filtering.
Scientific Computing Applications
Most of these scientific computing serverless applications are still at an early stage and therefore there is little public data available. One of the authors is employed at the German Aerospace Center (DLR) at the time of writing, which allowed us to collect information about several projects at DLR that are either currently moving to serverless solutions or are planning to do so. Additionally, an application from the German Electron Synchrotron (DESY) could be included. For each of these scientific computing applications, we provide a document containing a description of the project and the names of our contacts that provided information for the characterization of these applications.
Collection of serverless applications
Based on the previously described methodology, we collected a diverse dataset of 89 serverless applications from open-source projects, academic literature, industrial literature, and scientific computing. This dataset is can be found in Dataset.xlsx.
Serverless Application Characterization
As previously described, we collected 89 serverless applications from four different sources. Subsequently, two randomly assigned reviewers out of seven available reviewers characterized each application along 22 characteristics in a structured collaborative review sheet. The characteristics and potential values were defined a priori by the authors and iteratively refined, extended, and generalized during the review process. The initial moderate inter-rater agreement was followed by a discussion and consolidation phase, where all differences between the two reviewers were discussed and resolved. The six scientific applications were not publicly available and therefore characterized by a single domain expert, who is either involved in the development of the applications or in direct contact with the development team.
Initial Ratings & Interrater Agreement Calculation
The initial reviews are available as a table, where every application is characterized along with the 22 characteristics. A single value indicates that both reviewers assigned the same value, whereas a value of the form [Reviewer 2] A | [Reviewer 4] B
indicates that for this characteristic, reviewer two assigned the value A, whereas reviewer assigned the value B.
Our script for the calculation of the Fleiß-Kappa score based on this data is also publically available. It requires the python package pandas
and statsmodels
. It does not require any input and assumes that the file Initial Characterizations.csv
is located in the same folder. It can be executed as follows:
python3 CalculateKappa.py
Results Including Unknown Data
In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find.
For six characteristics, many applications were assigned the ''Unknown'' value, i.e., the reviewers were not able to determine the value of this characteristic. Therefore, we excluded these characteristics from this study. For the remaining characteristics, the percentage of ''Unknowns'' ranges from 0–19% with two outliers at 25% and 30%. These ''Unknowns'' were excluded from the percentage values presented in the article. As part of our replication package, we provide the raw results for each characteristic including the ''Unknown'' percentages in the form of bar charts.
The script for the generation of these bar charts is also part of this replication package). It uses the python packages pandas
, numpy
, and matplotlib
. It does not require any input and assumes that the file Dataset.csv
is located in the same folder. It can be executed as follows:
python3 GenerateResultsIncludingUnknown.py
Final Dataset & Figure Generation
In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find. Following this process, we were able to resolve all conflicts, resulting in a collection of 89 applications described by 18 characteristics. This dataset is available here: link
The script to generate all figures shown in the chapter "Serverless Application Characterization can be found here. It does not require any input but assumes that the file Dataset.csv
is located in the same folder. It uses the python packages pandas
, numpy
, and matplotlib
. It can be executed as follows:
python3 GenerateFigures.py
Comparison Study
To identify existing surveys and datasets that also investigate one of our characteristics, we conducted a literature search using Google as our search engine, as we were mostly looking for grey literature. We used the following search term:
("serverless" OR "faas") AND ("dataset" OR "survey" OR "report") after: 2018-01-01
This search term looks for any combination of either serverless or faas alongside any of the terms dataset, survey, or report. We further limited the search to any articles after 2017, as serverless is a fast-moving field and therefore any older studies are likely outdated already. This search term resulted in a total of 173 search results. In order to validate if using only a single search engine is sufficient, and if the search term is broad enough, we
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ag_news_subset', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By dbpedia_14 (From Huggingface) [source]
The DBpedia Ontology Classification Dataset, known as dbpedia_14, is a comprehensive and meticulously constructed dataset containing a vast collection of text samples. These samples have been expertly classified into 14 distinct and non-overlapping classes. The dataset draws its information from the highly reliable and up-to-date DBpedia 2014 knowledge base, ensuring the accuracy and relevance of the data.
Each text sample in this extensive dataset consists of various components that provide valuable insights into its content. These components include a title, which succinctly summarizes the main topic or subject matter of the text sample, and content that comprehensively covers all relevant information related to a specific topic.
To facilitate effective training of machine learning models for text classification tasks, each text sample is further associated with a corresponding label. This categorical label serves as an essential element for supervised learning algorithms to classify new instances accurately.
Furthermore, this exceptional dataset is part of the larger DBpedia Ontology Classification Dataset with 14 Classes (dbpedia_14). It offers numerous possibilities for researchers, practitioners, and enthusiasts alike to conduct in-depth analyses ranging from sentiment analysis to topic modeling.
Aspiring data scientists will find great value in utilizing this well-organized dataset for training their machine learning models. Although specific details about train.csv and test.csv files are not provided here due to their dynamic nature, they play pivotal roles during model training and testing processes by respectively providing labeled training samples and unseen test samples.
Lastly, it's worth mentioning that users can refer to the included classes.txt file within this dataset for an exhaustive list of all 14 classes used in classifying these diverse text samples accurately.
Overall, with its wealth of carefully curated textual data across multiple domains and precise class labels assigned based on well-defined categories derived from DBpedia 2014 knowledge base, the DBpedia Ontology Classification Dataset (dbpedia_14) proves instrumental in advancing research efforts related to natural language processing (NLP), text classification, and other related fields
- Text classification: The DBpedia Ontology Classification Dataset can be used to train machine learning models for text classification tasks. With 14 different classes, the dataset is suitable for various classification tasks such as sentiment analysis, topic classification, or intent detection.
- Ontology development: The dataset can also be used to improve or expand existing ontologies. By analyzing the text samples and their assigned labels, researchers can identify missing or incorrect relationships between concepts in the ontology and make improvements accordingly.
- Semantic search engine: The DBpedia knowledge base is widely used in semantic search engines that aim to provide more accurate and relevant search results by understanding the meaning of user queries and matching them with structured data. This dataset can help in training models for improving the performance of these semantic search engines by enhancing their ability to classify and categorize information accurately based on user queries
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------------------------------------| | label | The class label assigned to each text sample. (Categorical) | | title | The heading or name given to each text sample, providing some context or overview of its content. (Text) |
File: test.csv | Column name | Description | |:--------------|:-----------------------...
The resource contains data used to estimate the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing (by Microsoft Corporation), and Yandex (by ООО «Яндекс», Russia). For this purpose, a special list of 100 rare Lithuanian words (pivot words) with specific characteristics was compiled. Shorter lists for Belarusian, Estonian, Finnish, Latvian, Polish, and Russian languages were also compiled. Pivot words are words with special characteristics that are used to estimate the amount of words in corpora. Pivot words that were used for the estimation of the amount of words indexed by GSE should meet the following special criteria: 1) frequency of occurrence - 10-100; 2) do not coincide with regular words in another language; 3) longer than 6 letters; 4) not of international origin; 5) not foreign loanwords; 6) not proper names of any kind; 7) not headword forms; 8) with only basic Latin letters; 9) not specific to particular domain or time period; 10) they should not coincide with variants of other words, when diacritics are removed; 11) not words that, when commonly misspelled coincide with words, in other languages. Low frequency of pivot words is crucial to consider the count of document matches reported by GSE as an indicator of the word count. Comparative results for neighbouring Belarusian, Estonian, Finnish, Latvian , Polish , and Russian languages have also been assessed. The results have been publish in https://www.bjmc.lu.lv/fileadmin/user_upload/lu_portal/projekti/bjmc/Contents/10_3_06_Dadurkevicius.pdf.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the paper titled "The Language of Sound Search: Examining User Queries in Audio Search Engines." The study investigates user-generated textual queries within the context of sound search engines, which are commonly used for applications such as foley, sound effects, and general audio retrieval.
The paper addresses the gap in current research regarding the real-world needs and behaviors of users when designing text-based audio retrieval systems. By analyzing search queries collected from two sources — a custom survey and Freesound query logs — the study provides insights into user behavior in sound search contexts. Our findings reveal that users tend to formulate longer and more detailed queries when not constrained by existing systems, and that both survey and Freesound queries are predominantly keyword-based.
This dataset contains the raw data collected from the survey and annotations of Freesound query logs.
The dataset includes the following files:
participants.csv
Contains data from the survey participants. Columns:
id
: A unique identifier for each participant.fluency
: Self-reported English language proficiency.experience
: Whether the participant has used online sound libraries before.passed_instructions
: Boolean value indicating whether the participant advanced past the instructions page in the survey.annotations.csv
Contains annotations of the survey responses, detailing the participants' interaction with the sound search tasks. Columns:
id
: A unique identifier for each annotation.participant_id
: Links to the participant’s ID in participants.csv
.stimulus_id
: Identifier for the stimulus presented to the participant (audio, image, or text description).stimulus_type
: The type of stimulus (audio, image, text).audio_result_id
: Identifier for the hypothetical audio result presented during the search task.query1
: Initial search query submitted based on the stimulus.query2
: Refined search query after seeing the hypothetical search result.aspects1
: Aspects considered important when formulating the initial query.aspects2
: Aspects considered important when refining the query.result_relevance
: Participant's rating of the hypothetical search result's relevance.time
: Time taken to complete the search task.freesound_queries_annotated.csv
Contains annotated Freesound search queries. Columns:
query
: Text of the search query submitted to Freesound.count
: The number of times the specific query was submitted.topic
: Annotated topic of the query, based on an ontology derived from AudioSet, with an additional category, Other
, which includes non-English queries and NSFW-related content.survey_stimuli_data.zip
This ZIP file contains three CSV files corresponding to the three stimulus types used in the survey:
More details on the stimuli and the survey methodology can be found in the accompanying paper.
If you use this dataset in your research, please cite the corresponding paper:
B. Weck and F. Font, ‘The Language of Sound Search: Examining User Queries in Audio Search Engines’, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024), Tokyo, Japan, Oct. 2024, pp. 181–185.
@inproceedings{Weck2024,
author = "Weck, Benno and Font, Frederic",
title = "The Language of Sound Search: Examining User Queries in Audio Search Engines",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)",
address = "Tokyo, Japan",
month = "October",
year = "2024",
pages = "181--185"
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Fashion Magazine Library Management: Operators of a large fashion magazine library can use the VOGUE_PK model to catalog their extensive collection. It can help to classify different editions by issue date, identify styles from specific stylists or designers, and even recognize featured models. This would simplify the process of finding specific issues or fashion styles.
Style Tracking and Analysis: Fashion researchers, analysts, and enthusiasts could use this model to track and analyze the evolution of styles by a particular designer or stylist over time. By identifying the designer or stylist in multiple issues, users can study trends, predict future fashion movements, or create comprehensive style portfolios.
Education and Training: Fashion design students or professionals could use this model as a learning tool to study and analyze the distinct characteristics of various famous designers and stylists' work in different issue dates.
Image-Based Fashion Search Engines: The "VOGUE_PK" model can be instrumental in constructing a powerful image-based search engine. Users could upload an image and receive similar styles, designers, models, and the specific stylist involved in those similar styles.
Content Creation: Fashion content creators, such as bloggers and journalists, can use the model to easily identify the key details about images they're using in articles, posts, or other content. The model can help to ensure that designer, model, stylist, and issue date are correctly attributed.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
911 Public Safety Answering Point (PSAP) service area boundaries in the United States According to the National Emergency Number Association (NENA), a Public Safety Answering Point (PSAP) is a facility equipped and staffed to receive 9-1-1 calls. The service area is the geographic area within which a 911 call placed using a landline is answered at the associated PSAP. This dataset only includes primary PSAPs. Secondary PSAPs, backup PSAPs, and wireless PSAPs have been excluded from this dataset. Primary PSAPs receive calls directly, whereas secondary PSAPs receive calls that have been transferred by a primary PSAP. Backup PSAPs provide service in cases where another PSAP is inoperable. Most military bases have their own emergency telephone systems. To connect to such a system from within a military base, it may be necessary to dial a number other than 9 1 1. Due to the sensitive nature of military installations, TGS did not actively research these systems. If civilian authorities in surrounding areas volunteered information about these systems, or if adding a military PSAP was necessary to fill a hole in civilian provided data, TGS included it in this dataset. Otherwise, military installations are depicted as being covered by one or more adjoining civilian emergency telephone systems. In some cases, areas are covered by more than one PSAP boundary. In these cases, any of the applicable PSAPs may take a 911 call. Where a specific call is routed may depend on how busy the applicable PSAPs are (i.e., load balancing), operational status (i.e., redundancy), or time of day / day of week. If an area does not have 911 service, TGS included that area in the dataset along with the address and phone number of their dispatch center. These are areas where someone must dial a 7 or 10 digit number to get emergency services. These records can be identified by a "Y" in the [NON911EMNO] field. This indicates that dialing 911 inside one of these areas does not connect one with emergency services. This dataset was constructed by gathering information about PSAPs from state level officials. In some cases, this was geospatial information; in other cases, it was tabular. This information was supplemented with a list of PSAPs from the Federal Communications Commission (FCC). Each PSAP was researched to verify its tabular information. In cases where the source data was not geospatial, each PSAP was researched to determine its service area in terms of existing boundaries (e.g., city and county boundaries). In some cases, existing boundaries had to be modified to reflect coverage areas (e.g., "entire county north of Country Road 30"). However, there may be cases where minor deviations from existing boundaries are not reflected in this dataset, such as the case where a particular PSAPs coverage area includes an entire county plus the homes and businesses along a road which is partly in another county. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics.Homeland Security Use Cases: Use cases describe how the data may be used and help to define and clarify requirements. 1) A disaster has struck, or is predicted for, a locality. The PSAP that may be affected must be identified and verified to be operational. 2) In the event that the local PSAP is inoperable, adjacent PSAP locations could be identified and utilized.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Bridges-Rail in the United States According to The National Bridge Inspection Standards published in the Code of Federal Regulations (23 CFR 650.3), a bridge is: A structure including supports erected over a depression or an obstruction, such as water, highway, or railway, and having a track or passageway for carrying traffic or other moving loads. Each bridge was captured as a point which was placed in the center of the "main span" (highest and longest span). For bridges that cross navigable waterways, this was typically the part of the bridge over the navigation channel. If no "main span" was discernable using the imagery sources available, or if multiple non contiguous main spans were discernable, the point was placed in the center of the overall structure. Bridges that are sourced from the National Bridge Inventory (NBI) that cross state boundaries are an exception. Bridges that cross state boundaries are represented in the NBI by two records. The points for the two records have been located so as to be within the state indicated by the NBI's [STATE_CODE] attribute. In some cases, following these rules did not place the point at the location at which the bridge crosses what the user may judge as the most important feature intersected. For example, a given bridge may be many miles long, crossing nothing more than low lying ground for most of its length but crossing a major interstate at its far end. Due to the fact that bridges are often high narrow structures crossing depressions that may or may not be too narrow to be represented in the DEM used to orthorectify a given source of imagery, alignment with ortho imagery is highly variable. In particular, apparent bridge location in ortho imagery is highly dependent on collection angle. During verification, TechniGraphics used imagery from the following sources: NGA HSIP 133 City, State or Local; NAIP; DOQQ imagery. In cases where "bridge sway" or "tall structure lean" was evident, TGS attempted to compensate for these factors when capturing the bridge location. For instances in which the bridge was not visible in imagery, it was captured using topographic maps at the intersection of the water and rail line. TGS processed 784 entities previously with the HSIP Bridges-Roads (STRAHNET Option - HSIP 133 Cities and Gulf Coast). These entities were added into this dataset after processing. No entities were included in this dataset for American Samoa, Guam, Hawaii, the Commonwealth of the Northern Mariana Islands, or the Virgin Islands because there are no main line railways in these areas. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, leading and trailing spaces were trimmed from all text fields. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is given by the publication date which is 09/02/2009. A more precise measure of currentness cannot be provided since this is dependent on the NBI and the source of imagery used during processing.
Most bottom-up proteomics experiments share two features: The use of trypsin to digest proteins for mass spectrometry and the statistic driven matching of the measured peptide fragment spectra against protein database derived in silico generated spectra. While this extremely powerful approach in combination with latest generation mass spectrometers facilitates very deep proteome coverage, the assumptions made have to be met to generate meaningful results. One of these assumptions is that the measured spectra indeed have a match in the search space, since the search engine will always report the best match. However, one of the most abundant proteins in the sample, the protease, is often not represented in the employed database. It is therefore widely accepted in the community to include the protease and other common contaminants in the database to avoid false positive matches. Although this approach accounts for unmodified trypsin peptides, the most widely employed trypsin preparations are chemically modified to prevent autolysis and premature activity loss of the protease. In this study we observed numerous spectra of modified trypsin derived peptides in samples from our laboratory as well as in datasets downloaded from public repositories. In many cases the spectra were assigned to other proteins, often with good statistical significance. We therefore designed a new database search strategy employing an artificial amino acid which accounts for these peptides with a minimal increase in search space and the concomitant loss of statistical significance. Moreover, this approach can be easily implemented into existing workflows for many widely used search engines.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains citation-based impact indicators (a.k.a, "measures") for ~187,8M distinct PIDs (persistent identifiers) that correspond to research products (scientific publications, datasets, etc). In particular, for each PID, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture):
Influence indicators (i.e., indicators of the "total" impact of each research product; how established it is in general)
Citation Count: The total number of citations of the product, the most well-known influence indicator.
PageRank score: An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).
Popularity indicators (i.e., indicators of the "current" impact of each research product; how popular the product is currently)
RAM score: A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).
AttRank score: A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.
Impulse indicators (i.e., indicators of the initial momentum that the research product received right after its publication)
Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.
More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (e.g., in [5]).
From version 5.1 onward, the impact indicators are calculated in two levels:
Previous versions of the dataset only provided the scores at the PID level.
From version 12 onward, two types of PIDs are included in the dataset: DOIs and PMIDs (before that version, only DOIs were included).
Also, from version 7 onward, for each product in our files we also offer an impact class, which informs the user about the percentile into which the product score belongs compared to the impact scores of the rest products in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).
Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each product with a distinct PID that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the OpenAIRE article deduplication algorithm. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same product). As a result, each node in the citation network we build is a deduplicated product having a distinct OpenAIRE id. We still report the scores at PID level (i.e., we assign a score to each of the versions/instances of the product), however these PID-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated product will receive the same scores). We have removed a small number of instances (having a PID) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph.
For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format "identifier
From version 9 onward, we also provide topic-specific impact classes for PID-identified products. In particular, we associated those products with 2nd level concepts from OpenAlex; we chose to keep only the three most dominant concepts for each product, based on their confidence score, and only if this score was greater than 0.3. Then, for each product and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format "identifier
The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v7.1.0, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG [6,7], and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered. In addition, versions later than v.10 leverage the filtering rules described here to remove from the dataset PIDs with problematic metadata.
References:
[1] R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
[2] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380
[3] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)
[4] P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).
[5] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)
[6] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839
[7] K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045
Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder's API.
Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the CC0 license.
More details about BIP! DB can be found in our relevant peer-reviewed publication:
Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460
We kindly request that any published research that makes use of BIP! DB cite the above article.
Average estimated yields and associated CV values for current (2018) model runs. Based on work done by Harsimran Kaur et al in 2017. The following is from her thesis: Agro-ecological classes (AECs) of dryland cropping systems in the inland Pacific Northwest have been predicted to become more dynamic with greater use of annual fallow under projected climate change. At the same time, initiatives are being taken by growers either to intensify or diversify their cropping systems using oilseed and grain legume crops. The main objective of this study was to use a mechanistic model (CropSyst) to provide yield and soil water forecasts at regional scales which could compare fallow versus spring crop choices (flex/opportunity crop). Model simulations were based on historic weather data (1981-2010) as well as combined with actual year weather data for simulations at pre-planting dates starting in Dec. for representative years. Yield forecasts of spring pea, canola and wheat were compared to yield simulations using only weather of the representative year via linear regression analysis to assess pre-plant forecasts. Crop yield projections on pre-plant forecast date of Feb 1st had higher R2 with yield simulated using actual years weather data and lower CVs across the region as compared to forecasts based on historic weather data and other pre-season forecast dates (Dec. 1st and Jan. 1st). Therefore, Feb. 1st was considered the most reliable time to predict yield and other relevant outputs such as available water forecasts on a regional scale. Regional forecast maps of predicted spring crop yields and CVs showed ranges of 1 to 4367 kg/ha and 11 to 293% for spring canola, 72 to 2646 kg/ha and 11 to 143% for spring pea and 39 to 5330 kg/ha and 11 to 158% for spring wheat across study region for a representative year. These data combined with predicted available water after fallow and following spring crop yield as well as estimates of winter wheat yield reduction would collectively serve as information contributing to decisions related to crop intensification and diversification. Resources in this dataset:Resource Title: GeoData catalog record. File Name: Web Page, url: https://geodata.nal.usda.gov/geonetwork/srv/eng/catalog.search#/metadata/459d2dba-a346-4e54-9750-ef3178c18f38
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">
Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!
Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)
There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.
Thanks to the tweepy package for making the data extraction via Twitter API so easy.
Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.
Here's an App I built using a live version of this data.
The Small Business Administration maintains the Dynamic Small Business Search (DSBS) database. As a small business registers in the System for Award Management, there is an opportunity to fill out the small business profile. The information provided populates DSBS. DSBS is another tool contracting officers use to identify potential small business contractors for upcoming contracting opportunities. Small businesses can also use DSBS to identify other small businesses for teaming and joint venturing.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset consists of a review of case studies and descriptions of coral restoration methods from four sources: 1) the primary literature (i.e. published peer-reviewed scientific literature), 2) grey literature (e.g. scientific reports and technical summaries from experts in the field), 3) online descriptions (e.g. blogs and online videos describing projects), and 4) an online survey targeting restoration practitioners (doi:10.5061/dryad.p6r3816).
Included are only those case studies which actively conducted coral restoration (i.e. at least one stage of scleractinian coral life-history was involved). This excludes indirect coral restoration projects, such as disturbance mitigation (e.g. predator removal, disease control etc.) and passive restoration interventions (e.g. enforcement of control against dynamite fishing or water quality improvement). It also excludes many artificial reefs, in particular if the aim was fisheries enhancement (i.e. fish aggregation devices), and if corals were not included in the method. To the best of our abilities, duplication of case studies was avoided across the four separate sources, so that each case in the review and database represents a separate project.
This dataset is currently under embargo until the publication of review manuscript is made available.
Methods: More than 40 separate categories of data were recorded from each case study and entered into a database. These included data on (1) the information source, (2) the case study particulars (e.g. location, duration, spatial scale, objectives, etc.), (3) specific details about the methods, (4) coral details (e.g. genus, species, morphology), (5) monitoring details, and (6) the outcomes and conclusions.
Primary literature Multiple search engines were used to achieve the most complete coverage of the scientific literature. First, the scientific literature was searched using Google Scholar with the keywords “coral* + restoration”. Because the field (and therefore search results) are dominated by transplantation studies, separate searches were then conducted for other common techniques using “coral* + restoration + [technique name]”. This search was further complemented by using the same keywords in ISI Web of Knowledge (search yield n=738). Studies were then manually selected that fulfilled our criteria for active coral restoration described above (final yield n= 221). In those cases where a single paper describes several different projects or methods, these were split into separate case studies. Finally, prior reviews of coral restoration were consulted to obtain case studies from their reference lists.
Grey literature While many reports appeared in the Google Scholar literature searches, The Nature Conservancy (TNC) database of reports for North American coastal restoration projects (http://projects.tnc.org/coastal/) was also conducted. This was supplemented with reports listed in the reference lists of other papers, reports and reviews, and during the online searches (n=30).
Online records Small-scale projects conducted without substantial input from researchers, academics, non-governmental organisations (NGO) or coral reef managers often do not result in formal written accounts of methods. To access this information, we conducted online searches of YouTube, Facebook and Google, using the search terms “Coral restoration”. The information provided in videos, blog posts and websites to describe further projects (n=48) was also used. Due to the unverified nature of such accounts, the data collected from these online-only records was limited compared to peer reviewed literature and surveys. At the minimum, the location, the methods used and reported outcomes or lessons learned were included in this review.
Online survey To access information from projects not published elsewhere, an online survey targeting restoration practitioners was designed. The survey consisted of 25 questions querying restoration practitioners regarding projects they had undertaken under JCU human ethics H7218 (following the Australian National Statement on Ethical Conduct in Human Research, 2007). These data (n=63) are included in all calculations within this review, but are not publicly available to preserve the anonymity of participants. Although we encouraged participants to fill out a separate survey for each case study, it is possible that participants included multiple separate projects in a single survey, which may reduce the real number of case studies reported.
Data analysis Percentages, counts and other quantifications from the database refer to the total number of case studies with data in that category. Case studies where data were lacking for the category in question, or lack appropriate detail (e.g. reporting ‘mixed’ for coral genera) are not included in calculations. Many categories allowed multiple answers (e.g. coral species); these were split into separate records for calculations (e.g. coral species n). For this reason, absolute numbers may exceed the number of case studies in the database. However, percentages reflect the proportion of case studies in each category. We used the seven objectives outlined in [1] to classify the objective of each case study, with an additional two categories (‘scientific research’ and ‘ecological engineering’). We used Tableau to visualise and analyse the database (Desktop Professional Edition, version 10.5, Tableau Software). The data have been made available following the FAIR Guiding Principles for scientific data management and stewardship [2]. Data available from the Dryad Digital Repository downloaded here (https://doi.org/10.5061/dryad.p6r3816), and visually explored: https://public.tableau.com/views/CoralRestorationDatabase-Visualisation/Coralrestorationmethods?:embed=y&:display_count=yes&publish=yes&:showVizHome=no#1.
Limitations: While our expanded search enabled us to avoid the bias from the more limited published literature, we acknowledge that using sources that have not undergone rigorous peer-review potentially introduces another bias. Many government reports undergo an informal peer-review; however, survey results and online descriptions may present a subjective account of restoration outcomes. To reduce subjective assessment of case studies, we opted not to interpret results or survey answers, instead only recording what was explicitly stated in each document [3, 4].
Defining restoration In this review, active restoration methods are methods which reintroduce coral (e.g. coral fragment transplantation, or larval enhancement) or augment coral assemblages (e.g. substrate stabilisation, or algal removal), for the purposes of restoring the reef ecosystem. In the published literature and elsewhere, there are many terms that describe the same intervention. For clarity, we provide the terms we have used in the review, their definitions and alternative terms (see references). Passive restoration methods such as predator removal (e.g. crown-of-thorns starfish and Drupella control) have been excluded, unless they were conducted in conjunction with active restoration (e.g. macroalgal removal combined with transplantation).
Format: The data is supplied as an excel file with three separate tabs for 1) peer reviewed literature 2) grey literature, and 3) a description of the objectives form Hein et al. 2017. Survey responses have been excluded to preserve the anonymity of the respondents.
This dataset is a database that underpins a 2018 report and 2019 published review of coral restoration methods from around the world. - Bostrom-Einarsson L, Ceccarelli D, Babcock R.C., Bayraktarov E, Cook N, Harrison P, Hein M, Shaver E, Smith A, Stewart-Sinclair P.J, Vardi T, McLeod I.M. 2018 - Coral restoration in a changing world - A global synthesis of methods and techniques, report to the National Environmental Science Program. Reef and Rainforest Research Centre Ltd, Cairns (63pp.). - Review manuscript is currently under review.
Data Dictionary: The Data Dictionary is emended in the excel spreadsheet. Comments are included in the column titles to aid interpretation, and/or refer to additional information tabs. For more information on each column, open the red triangle [located top right of cell].
References: 1. Hein MY, Willis BL, Beeden R, Birtles A. The need for broader ecological and socioeconomic tools to evaluate the effectiveness of coral restoration programs. Restoration Ecology. Wiley/Blackwell (10.1111); 2017;25: 873–883. doi:10.1111/rec.12580 2. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016 3. Nature Publishing Group; 2016;3: 160018. doi:10.1038/sdata.2016.18 3.Miller RL, Marsh H, Cottrell A, Hamann M. Protecting Migratory Species in the Australian Marine Environment: A Cross-Jurisdictional Analysis of Policy and Management Plans. Front Mar Sci. Frontiers; 2018;5: 211. doi:10.3389/fmars.2018.00229 4. Ortega-Argueta A, Baxter G, Hockings M. Compliance of Australian threatened species recovery plans with legislative requirements. Journal of Environmental Management. Elsevier; 2011;92: 2054–2060.
Data Location:
This dataset is filed in the eAtlas enduring data repository at: data\2018-2021-NESP-TWQ-4\4.3_Best-practice-coral-restoration
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).