33 datasets found

i
Evolution of Web search engine interfaces through SERP screenshots and HTML...
rdm.inesctec.pt
Updated Jul 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-003
Explore at:
Dataset updated
Jul 26, 2021
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).
DataForSEO Google Full (Keywords+SERP) database, historical data available
datarade.ai
.json, .csv
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataForSEO (2023). DataForSEO Google Full (Keywords+SERP) database, historical data available [Dataset]. https://datarade.ai/data-products/dataforseo-google-full-keywords-serp-database-historical-d-dataforseo
Explore at:
.json, .csvAvailable download formats
Dataset updated
Aug 17, 2023
Dataset provided by
Authors
DataForSEO
Area covered
Costa Rica, Bolivia (Plurinational State of), Burkina Faso, Sweden, Portugal, Paraguay, United Kingdom, Côte d'Ivoire, Cyprus, South Africa
Description
You can check the fields description in the documentation: current Full database: https://docs.dataforseo.com/v3/databases/google/full/?bash; Historical Full database: https://docs.dataforseo.com/v3/databases/google/history/full/?bash.

Full Google Database is a combination of the Advanced Google SERP Database and Google Keyword Database.

Google SERP Database offers millions of SERPs collected in 67 regions with most of Google’s advanced SERP features, including featured snippets, knowledge graphs, people also ask sections, top stories, and more.

Google Keyword Database encompasses billions of search terms enriched with related Google Ads data: search volume trends, CPC, competition, and more.

This database is available in JSON format only.

You don’t have to download fresh data dumps in JSON – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.
Z
Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...
data.niaid.nih.gov
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haak, Fabian (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
Explore at:
Dataset updated
Mar 1, 2023
Dataset provided by
Schaer, Philipp
Haak, Fabian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
Search Engineing Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Search Engineing Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/search-engine-marketing-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset provided by
Authors
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Search Engine Market Outlook

The search engine market size was valued at approximately USD 124 billion in 2023 and is projected to reach USD 258 billion by 2032, witnessing a robust CAGR of 8.5% during the forecast period. This growth is largely attributed to the increasing reliance on digital platforms and the internet across various sectors, which has necessitated the use of search engines for data retrieval and information dissemination. With the proliferation of smartphones and the expansion of internet access globally, search engines have become indispensable tools for both businesses and consumers, driving the market's upward trajectory. The integration of artificial intelligence and machine learning technologies into search engines is transforming the way search engines operate, offering more personalized and efficient search results, thereby further propelling market growth.

One of the primary growth factors in the search engine market is the ever-increasing digitalization across industries. As businesses continue to transition from traditional modes of operation to digital platforms, the need for search engines to navigate and manage data becomes paramount. This shift is particularly evident in industries such as retail, BFSI, and healthcare, where vast amounts of data are generated and require efficient management and retrieval systems. The integration of AI and machine learning into search engine algorithms has enhanced their ability to process and interpret large datasets, thereby improving the accuracy and relevance of search results. This technological advancement not only improves user experience but also enhances the competitive edge of businesses, further fueling market growth.

Another significant growth factor is the expanding e-commerce sector, which relies heavily on search engines to connect consumers with products and services. With the rise of e-commerce giants and online marketplaces, consumers are increasingly using search engines to find the best prices, reviews, and availability of products, leading to a surge in search engine usage. Additionally, the implementation of voice search technology and the growing popularity of smart home devices have introduced new dynamics to search engine functionality. Consumers are now able to conduct searches verbally, which has necessitated the adaptation of search engines to incorporate natural language processing capabilities, further driving market growth.

The advertising and marketing sectors are also contributing significantly to the growth of the search engine market. Businesses are leveraging search engines as a primary tool for online advertising, given their wide reach and ability to target specific audiences. Pay-per-click advertising and search engine optimization strategies have become integral components of digital marketing campaigns, enabling businesses to enhance their visibility and engagement with potential customers. The measurable nature of these advertising techniques allows businesses to assess the effectiveness of their campaigns and make data-driven decisions, thereby increasing their reliance on search engines and contributing to overall market growth.

The evolution of search engines is closely tied to the development of Ai Enterprise Search, which is revolutionizing how businesses access and utilize information. Ai Enterprise Search leverages artificial intelligence to provide more accurate and contextually relevant search results, making it an invaluable tool for organizations that manage large volumes of data. By understanding user intent and learning from past interactions, Ai Enterprise Search systems can deliver personalized experiences that enhance productivity and decision-making. This capability is particularly beneficial in sectors such as finance and healthcare, where quick access to precise information is crucial. As businesses continue to digitize and data volumes grow, the demand for Ai Enterprise Search solutions is expected to increase, further driving the growth of the search engine market.

Regionally, North America holds a significant share of the search engine market, driven by the presence of major technology companies and a well-established digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth can be attributed to the rapid digital transformation in emerging economies such as China and India, where increasing internet penetration and smartphone adoption are driving demand for search engines. Additionally, government initiatives to
m
Ultimate Arabic News Dataset
data.mendeley.com
opendatalab.com
+1more
Updated Jul 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Hashim Al-Dulaimi (2022). Ultimate Arabic News Dataset [Dataset]. http://doi.org/10.17632/jz56k5wxz7.2
Explore at:
Unique identifier
https://doi.org/10.17632/jz56k5wxz7.2
Dataset updated
Jul 4, 2022
Authors
Ahmed Hashim Al-Dulaimi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.

Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.

The data we collect consists of two Primary files:

UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.

UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.

We have added two folders containing additional detailed datasets:

1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets:

Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.

2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases.

The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.
Data from: Inventory of online public databases and repositories holding...
catalog.data.gov
agdatacommons.nal.usda.gov
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
The State of Serverless Applications: Collection,Characterization, and...
zenodo.org
zip
Updated Aug 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad; Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad (2021). The State of Serverless Applications: Collection,Characterization, and Community Consensus - Replication Package [Dataset]. http://doi.org/10.5281/zenodo.5185055
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5185055
Dataset updated
Aug 12, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad; Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The replication package for our article The State of Serverless Applications: Collection,Characterization, and Community Consensus provides everything required to reproduce all results for the following three studies:

Serverless Application Collection

Serverless Application Characterization

Comparison Study

Serverless Application Collection

We collect descriptions of serverless applications from open-source projects, academic literature, industrial literature, and scientific computing.

Open-source Applications

As a starting point, we used an existing data set on open-source serverless projects from this study. We removed small and inactive projects based on the number of files, commits, contributors, and watchers. Next, we manually filtered the resulting data set to include only projects that implement serverless applications. We provide a table containing all projects that remained after the filtering alongside the notes from the manual filtering.

Academic Literature Applications

We based our search on an existing community-curated dataset on literature for serverless computing consisting of over 180 peer-reviewed articles. First, we filtered the articles based on title and abstract. In a second iteration, we filtered out any articles that implement only a single function for evaluation purposes or do not include sufficient detail to enable a review. As the authors were familiar with some additional publications describing serverless applications, we contributed them to the community-curated dataset and included them in this study. We provide a table with our notes from the manual filtering.

Scientific Computing Applications

Most of these scientific computing serverless applications are still at an early stage and therefore there is little public data available. One of the authors is employed at the German Aerospace Center (DLR) at the time of writing, which allowed us to collect information about several projects at DLR that are either currently moving to serverless solutions or are planning to do so. Additionally, an application from the German Electron Synchrotron (DESY) could be included. For each of these scientific computing applications, we provide a document containing a description of the project and the names of our contacts that provided information for the characterization of these applications.

SC1 Copernicus Sentinel-1 for near-real-time water monitoring

SC2 Reprocessing Sentinel 5 Precursor data with ProEO

SC3 High-Performance Data Analytics for Earth Observation

SC4 Tandem-L exploitation platform

SC5 Global Urban Footprint

SC6 DESY - High Throughput Data Taking

Collection of serverless applications

Based on the previously described methodology, we collected a diverse dataset of 89 serverless applications from open-source projects, academic literature, industrial literature, and scientific computing. This dataset is can be found in Dataset.xlsx.

Serverless Application Characterization

As previously described, we collected 89 serverless applications from four different sources. Subsequently, two randomly assigned reviewers out of seven available reviewers characterized each application along 22 characteristics in a structured collaborative review sheet. The characteristics and potential values were defined a priori by the authors and iteratively refined, extended, and generalized during the review process. The initial moderate inter-rater agreement was followed by a discussion and consolidation phase, where all differences between the two reviewers were discussed and resolved. The six scientific applications were not publicly available and therefore characterized by a single domain expert, who is either involved in the development of the applications or in direct contact with the development team.

Initial Ratings & Interrater Agreement Calculation

The initial reviews are available as a table, where every application is characterized along with the 22 characteristics. A single value indicates that both reviewers assigned the same value, whereas a value of the form [Reviewer 2] A | [Reviewer 4] B indicates that for this characteristic, reviewer two assigned the value A, whereas reviewer assigned the value B.

Our script for the calculation of the Fleiß-Kappa score based on this data is also publically available. It requires the python package pandas and statsmodels. It does not require any input and assumes that the file Initial Characterizations.csv is located in the same folder. It can be executed as follows:

python3 CalculateKappa.py

Results Including Unknown Data

In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find.

For six characteristics, many applications were assigned the ''Unknown'' value, i.e., the reviewers were not able to determine the value of this characteristic. Therefore, we excluded these characteristics from this study. For the remaining characteristics, the percentage of ''Unknowns'' ranges from 0–19% with two outliers at 25% and 30%. These ''Unknowns'' were excluded from the percentage values presented in the article. As part of our replication package, we provide the raw results for each characteristic including the ''Unknown'' percentages in the form of bar charts.

The script for the generation of these bar charts is also part of this replication package). It uses the python packages pandas, numpy, and matplotlib. It does not require any input and assumes that the file Dataset.csv is located in the same folder. It can be executed as follows:

python3 GenerateResultsIncludingUnknown.py

Final Dataset & Figure Generation

In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find. Following this process, we were able to resolve all conflicts, resulting in a collection of 89 applications described by 18 characteristics. This dataset is available here: link

The script to generate all figures shown in the chapter "Serverless Application Characterization can be found here. It does not require any input but assumes that the file Dataset.csv is located in the same folder. It uses the python packages pandas, numpy, and matplotlib. It can be executed as follows:

python3 GenerateFigures.py

Comparison Study

To identify existing surveys and datasets that also investigate one of our characteristics, we conducted a literature search using Google as our search engine, as we were mostly looking for grey literature. We used the following search term:

("serverless" OR "faas") AND ("dataset" OR "survey" OR "report") after: 2018-01-01

This search term looks for any combination of either serverless or faas alongside any of the terms dataset, survey, or report. We further limited the search to any articles after 2017, as serverless is a fast-moving field and therefore any older studies are likely outdated already. This search term resulted in a total of 173 search results. In order to validate if using only a single search engine is sufficient, and if the search term is broad enough, we
T
ag_news_subset
tensorflow.org
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
Explore at:
Unique identifier
https://identifiers.org/arxiv:1509.01626
Dataset updated
Dec 6, 2022
Description
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('ag_news_subset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
DBpedia Ontology
kaggle.com
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). DBpedia Ontology [Dataset]. https://www.kaggle.com/datasets/thedevastator/dbpedia-ontology-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DBpedia Ontology

Text Classification Dataset with 14 Classes

By dbpedia_14 (From Huggingface) [source]

About this dataset

The DBpedia Ontology Classification Dataset, known as dbpedia_14, is a comprehensive and meticulously constructed dataset containing a vast collection of text samples. These samples have been expertly classified into 14 distinct and non-overlapping classes. The dataset draws its information from the highly reliable and up-to-date DBpedia 2014 knowledge base, ensuring the accuracy and relevance of the data.

Each text sample in this extensive dataset consists of various components that provide valuable insights into its content. These components include a title, which succinctly summarizes the main topic or subject matter of the text sample, and content that comprehensively covers all relevant information related to a specific topic.

To facilitate effective training of machine learning models for text classification tasks, each text sample is further associated with a corresponding label. This categorical label serves as an essential element for supervised learning algorithms to classify new instances accurately.

Furthermore, this exceptional dataset is part of the larger DBpedia Ontology Classification Dataset with 14 Classes (dbpedia_14). It offers numerous possibilities for researchers, practitioners, and enthusiasts alike to conduct in-depth analyses ranging from sentiment analysis to topic modeling.

Aspiring data scientists will find great value in utilizing this well-organized dataset for training their machine learning models. Although specific details about train.csv and test.csv files are not provided here due to their dynamic nature, they play pivotal roles during model training and testing processes by respectively providing labeled training samples and unseen test samples.

Lastly, it's worth mentioning that users can refer to the included classes.txt file within this dataset for an exhaustive list of all 14 classes used in classifying these diverse text samples accurately.

Overall, with its wealth of carefully curated textual data across multiple domains and precise class labels assigned based on well-defined categories derived from DBpedia 2014 knowledge base, the DBpedia Ontology Classification Dataset (dbpedia_14) proves instrumental in advancing research efforts related to natural language processing (NLP), text classification, and other related fields

Research Ideas

Text classification: The DBpedia Ontology Classification Dataset can be used to train machine learning models for text classification tasks. With 14 different classes, the dataset is suitable for various classification tasks such as sentiment analysis, topic classification, or intent detection.

Ontology development: The dataset can also be used to improve or expand existing ontologies. By analyzing the text samples and their assigned labels, researchers can identify missing or incorrect relationships between concepts in the ontology and make improvements accordingly.

Semantic search engine: The DBpedia knowledge base is widely used in semantic search engines that aim to provide more accurate and relevant search results by understanding the meaning of user queries and matching them with structured data. This dataset can help in training models for improving the performance of these semantic search engines by enhancing their ability to classify and categorize information accurately based on user queries

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------------------------------------| | label | The class label assigned to each text sample. (Categorical) | | title | The heading or name given to each text sample, providing some context or overview of its content. (Text) |

File: test.csv | Column name | Description | |:--------------|:-----------------------...
e
Frequency lists of pivot words and GSE counts - Dataset - B2FIND
b2find.eudat.eu
Updated Dec 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Frequency lists of pivot words and GSE counts - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/86055e51-574d-5c42-821d-b0293d389864
Explore at:
Dataset updated
Dec 23, 2022
Description
The resource contains data used to estimate the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing (by Microsoft Corporation), and Yandex (by ООО «Яндекс», Russia). For this purpose, a special list of 100 rare Lithuanian words (pivot words) with specific characteristics was compiled. Shorter lists for Belarusian, Estonian, Finnish, Latvian, Polish, and Russian languages were also compiled. Pivot words are words with special characteristics that are used to estimate the amount of words in corpora. Pivot words that were used for the estimation of the amount of words indexed by GSE should meet the following special criteria: 1) frequency of occurrence - 10-100; 2) do not coincide with regular words in another language; 3) longer than 6 letters; 4) not of international origin; 5) not foreign loanwords; 6) not proper names of any kind; 7) not headword forms; 8) with only basic Latin letters; 9) not specific to particular domain or time period; 10) they should not coincide with variants of other words, when diacritics are removed; 11) not words that, when commonly misspelled coincide with words, in other languages. Low frequency of pivot words is crucial to consider the count of document matches reported by GSE as an indicator of the word count. Comparative results for neighbouring Belarusian, Estonian, Finnish, Latvian , Polish , and Russian languages have also been assessed. The results have been publish in https://www.bjmc.lu.lv/fileadmin/user_upload/lu_portal/projekti/bjmc/Contents/10_3_06_Dadurkevicius.pdf.
The language of sound search: Examining User Queries in Audio Search Engines...
zenodo.org
csv, zip
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benno Weck; Benno Weck; Frederic Font; Frederic Font (2024). The language of sound search: Examining User Queries in Audio Search Engines (supplementary materials) [Dataset]. http://doi.org/10.5281/zenodo.13622537
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13622537
Dataset updated
Oct 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Benno Weck; Benno Weck; Frederic Font; Frederic Font
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset accompanies the paper titled "The Language of Sound Search: Examining User Queries in Audio Search Engines." The study investigates user-generated textual queries within the context of sound search engines, which are commonly used for applications such as foley, sound effects, and general audio retrieval.

The paper addresses the gap in current research regarding the real-world needs and behaviors of users when designing text-based audio retrieval systems. By analyzing search queries collected from two sources — a custom survey and Freesound query logs — the study provides insights into user behavior in sound search contexts. Our findings reveal that users tend to formulate longer and more detailed queries when not constrained by existing systems, and that both survey and Freesound queries are predominantly keyword-based.

This dataset contains the raw data collected from the survey and annotations of Freesound query logs.

Files in This Dataset

The dataset includes the following files:

participants.csv
Contains data from the survey participants. Columns:

id: A unique identifier for each participant.

fluency: Self-reported English language proficiency.

experience: Whether the participant has used online sound libraries before.

passed_instructions: Boolean value indicating whether the participant advanced past the instructions page in the survey.

annotations.csv
Contains annotations of the survey responses, detailing the participants' interaction with the sound search tasks. Columns:

id: A unique identifier for each annotation.

participant_id: Links to the participant’s ID in participants.csv.

stimulus_id: Identifier for the stimulus presented to the participant (audio, image, or text description).

stimulus_type: The type of stimulus (audio, image, text).

audio_result_id: Identifier for the hypothetical audio result presented during the search task.

query1: Initial search query submitted based on the stimulus.

query2: Refined search query after seeing the hypothetical search result.

aspects1: Aspects considered important when formulating the initial query.

aspects2: Aspects considered important when refining the query.

result_relevance: Participant's rating of the hypothetical search result's relevance.

time: Time taken to complete the search task.

freesound_queries_annotated.csv
Contains annotated Freesound search queries. Columns:

query: Text of the search query submitted to Freesound.

count: The number of times the specific query was submitted.

topic: Annotated topic of the query, based on an ontology derived from AudioSet, with an additional category, Other, which includes non-English queries and NSFW-related content.

survey_stimuli_data.zip
This ZIP file contains three CSV files corresponding to the three stimulus types used in the survey:

Audio stimuli: Categorized sound recordings presented to participants.

Image stimuli: Annotated images that prompted sound-related queries.

Text stimuli: Summarized descriptions of sounds provided to participants.

More details on the stimuli and the survey methodology can be found in the accompanying paper.

Citation

If you use this dataset in your research, please cite the corresponding paper:

B. Weck and F. Font, ‘The Language of Sound Search: Examining User Queries in Audio Search Engines’, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024), Tokyo, Japan, Oct. 2024, pp. 181–185.

@inproceedings{Weck2024, author = "Weck, Benno and Font, Frederic", title = "The Language of Sound Search: Examining User Queries in Audio Search Engines", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)", address = "Tokyo, Japan", month = "October", year = "2024", pages = "181--185" }
R
Vogue_pk Dataset
universe.roboflow.com
zip
Updated May 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DEYAN (2023). Vogue_pk Dataset [Dataset]. https://universe.roboflow.com/deyan/vogue_pk
Explore at:
zipAvailable download formats
Dataset updated
May 2, 2023
Dataset authored and provided by
DEYAN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
PHOTOBY DESIGNER MODEL ISSUEDATE Bounding Boxes
Description
Here are a few use cases for this project:

Fashion Magazine Library Management: Operators of a large fashion magazine library can use the VOGUE_PK model to catalog their extensive collection. It can help to classify different editions by issue date, identify styles from specific stylists or designers, and even recognize featured models. This would simplify the process of finding specific issues or fashion styles.

Style Tracking and Analysis: Fashion researchers, analysts, and enthusiasts could use this model to track and analyze the evolution of styles by a particular designer or stylist over time. By identifying the designer or stylist in multiple issues, users can study trends, predict future fashion movements, or create comprehensive style portfolios.

Education and Training: Fashion design students or professionals could use this model as a learning tool to study and analyze the distinct characteristics of various famous designers and stylists' work in different issue dates.

Image-Based Fashion Search Engines: The "VOGUE_PK" model can be instrumental in constructing a powerful image-based search engine. Users could upload an image and receive similar styles, designers, models, and the specific stylist involved in those similar styles.

Content Creation: Fashion content creators, such as bloggers and journalists, can use the model to easily identify the key details about images they're using in articles, posts, or other content. The model can help to ensure that designer, model, stylist, and issue date are correctly attributed.
n
PSAP 911 Service Area Boundaries - Dataset - CKAN
nationaldataplatform.org
Updated Feb 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). PSAP 911 Service Area Boundaries - Dataset - CKAN [Dataset]. https://nationaldataplatform.org/catalog/dataset/psap-911-service-area-boundaries
Explore at:
Dataset updated
Feb 28, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
911 Public Safety Answering Point (PSAP) service area boundaries in the United States According to the National Emergency Number Association (NENA), a Public Safety Answering Point (PSAP) is a facility equipped and staffed to receive 9-1-1 calls. The service area is the geographic area within which a 911 call placed using a landline is answered at the associated PSAP. This dataset only includes primary PSAPs. Secondary PSAPs, backup PSAPs, and wireless PSAPs have been excluded from this dataset. Primary PSAPs receive calls directly, whereas secondary PSAPs receive calls that have been transferred by a primary PSAP. Backup PSAPs provide service in cases where another PSAP is inoperable. Most military bases have their own emergency telephone systems. To connect to such a system from within a military base, it may be necessary to dial a number other than 9 1 1. Due to the sensitive nature of military installations, TGS did not actively research these systems. If civilian authorities in surrounding areas volunteered information about these systems, or if adding a military PSAP was necessary to fill a hole in civilian provided data, TGS included it in this dataset. Otherwise, military installations are depicted as being covered by one or more adjoining civilian emergency telephone systems. In some cases, areas are covered by more than one PSAP boundary. In these cases, any of the applicable PSAPs may take a 911 call. Where a specific call is routed may depend on how busy the applicable PSAPs are (i.e., load balancing), operational status (i.e., redundancy), or time of day / day of week. If an area does not have 911 service, TGS included that area in the dataset along with the address and phone number of their dispatch center. These are areas where someone must dial a 7 or 10 digit number to get emergency services. These records can be identified by a "Y" in the [NON911EMNO] field. This indicates that dialing 911 inside one of these areas does not connect one with emergency services. This dataset was constructed by gathering information about PSAPs from state level officials. In some cases, this was geospatial information; in other cases, it was tabular. This information was supplemented with a list of PSAPs from the Federal Communications Commission (FCC). Each PSAP was researched to verify its tabular information. In cases where the source data was not geospatial, each PSAP was researched to determine its service area in terms of existing boundaries (e.g., city and county boundaries). In some cases, existing boundaries had to be modified to reflect coverage areas (e.g., "entire county north of Country Road 30"). However, there may be cases where minor deviations from existing boundaries are not reflected in this dataset, such as the case where a particular PSAPs coverage area includes an entire county plus the homes and businesses along a road which is partly in another county. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics.Homeland Security Use Cases: Use cases describe how the data may be used and help to define and clarify requirements. 1) A disaster has struck, or is predicted for, a locality. The PSAP that may be affected must be identified and verified to be operational. 2) In the event that the local PSAP is inoperable, adjacent PSAP locations could be identified and utilized.
n
Railroad Bridges - Dataset - CKAN
nationaldataplatform.org
Updated Feb 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Railroad Bridges - Dataset - CKAN [Dataset]. https://nationaldataplatform.org/catalog/dataset/railroad-bridges
Explore at:
Dataset updated
Feb 28, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Bridges-Rail in the United States According to The National Bridge Inspection Standards published in the Code of Federal Regulations (23 CFR 650.3), a bridge is: A structure including supports erected over a depression or an obstruction, such as water, highway, or railway, and having a track or passageway for carrying traffic or other moving loads. Each bridge was captured as a point which was placed in the center of the "main span" (highest and longest span). For bridges that cross navigable waterways, this was typically the part of the bridge over the navigation channel. If no "main span" was discernable using the imagery sources available, or if multiple non contiguous main spans were discernable, the point was placed in the center of the overall structure. Bridges that are sourced from the National Bridge Inventory (NBI) that cross state boundaries are an exception. Bridges that cross state boundaries are represented in the NBI by two records. The points for the two records have been located so as to be within the state indicated by the NBI's [STATE_CODE] attribute. In some cases, following these rules did not place the point at the location at which the bridge crosses what the user may judge as the most important feature intersected. For example, a given bridge may be many miles long, crossing nothing more than low lying ground for most of its length but crossing a major interstate at its far end. Due to the fact that bridges are often high narrow structures crossing depressions that may or may not be too narrow to be represented in the DEM used to orthorectify a given source of imagery, alignment with ortho imagery is highly variable. In particular, apparent bridge location in ortho imagery is highly dependent on collection angle. During verification, TechniGraphics used imagery from the following sources: NGA HSIP 133 City, State or Local; NAIP; DOQQ imagery. In cases where "bridge sway" or "tall structure lean" was evident, TGS attempted to compensate for these factors when capturing the bridge location. For instances in which the bridge was not visible in imagery, it was captured using topographic maps at the intersection of the water and rail line. TGS processed 784 entities previously with the HSIP Bridges-Roads (STRAHNET Option - HSIP 133 Cities and Gulf Coast). These entities were added into this dataset after processing. No entities were included in this dataset for American Samoa, Guam, Hawaii, the Commonwealth of the Northern Mariana Islands, or the Virgin Islands because there are no main line railways in these areas. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, leading and trailing spaces were trimmed from all text fields. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is given by the publication date which is 09/02/2009. A more precise measure of currentness cannot be provided since this is dependent on the NBI and the source of imagery used during processing.
e
Optimized data analysis avoiding trypsin artefacts
ebi.ac.uk
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katarina Fritz, Optimized data analysis avoiding trypsin artefacts [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD002726
Explore at:
Authors
Katarina Fritz
Variables measured
Proteomics
Description
Most bottom-up proteomics experiments share two features: The use of trypsin to digest proteins for mass spectrometry and the statistic driven matching of the measured peptide fragment spectra against protein database derived in silico generated spectra. While this extremely powerful approach in combination with latest generation mass spectrometers facilitates very deep proteome coverage, the assumptions made have to be met to generate meaningful results. One of these assumptions is that the measured spectra indeed have a match in the search space, since the search engine will always report the best match. However, one of the most abundant proteins in the sample, the protease, is often not represented in the employed database. It is therefore widely accepted in the community to include the protease and other common contaminants in the database to avoid false positive matches. Although this approach accounts for unmodified trypsin peptides, the most widely employed trypsin preparations are chemically modified to prevent autolysis and premature activity loss of the protease. In this study we observed numerous spectra of modified trypsin derived peptides in samples from our laboratory as well as in datasets downloaded from public repositories. In many cases the spectra were assigned to other proteins, often with good statistical significance. We therefore designed a new database search strategy employing an artificial amino acid which accounts for these peptides with a minimal increase in search space and the concomitant loss of statistical significance. Moreover, this approach can be easily implemented into existing workflows for many widely used search engines.
BIP! DB: A Dataset of Impact Measures for Research Products
zenodo.org
application/gzip
Updated Mar 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thanasis Vergoulis; Thanasis Vergoulis; Ilias Kanellos; Ilias Kanellos; Claudio Atzori; Claudio Atzori; Andrea Mannocci; Andrea Mannocci; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Sandro La Bruzzo; Sandro La Bruzzo; Natalia Manola; Paolo Manghi; Paolo Manghi; Natalia Manola (2024). BIP! DB: A Dataset of Impact Measures for Research Products [Dataset]. http://doi.org/10.5281/zenodo.10804822
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10804822
Dataset updated
Mar 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thanasis Vergoulis; Thanasis Vergoulis; Ilias Kanellos; Ilias Kanellos; Claudio Atzori; Claudio Atzori; Andrea Mannocci; Andrea Mannocci; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Sandro La Bruzzo; Sandro La Bruzzo; Natalia Manola; Paolo Manghi; Paolo Manghi; Natalia Manola
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains citation-based impact indicators (a.k.a, "measures") for ~187,8M distinct PIDs (persistent identifiers) that correspond to research products (scientific publications, datasets, etc). In particular, for each PID, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture):

Influence indicators (i.e., indicators of the "total" impact of each research product; how established it is in general)

Citation Count: The total number of citations of the product, the most well-known influence indicator.

PageRank score: An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).

Popularity indicators (i.e., indicators of the "current" impact of each research product; how popular the product is currently)

RAM score: A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).

AttRank score: A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.

Impulse indicators (i.e., indicators of the initial momentum that the research product received right after its publication)

Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.

More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (e.g., in [5]).

From version 5.1 onward, the impact indicators are calculated in two levels:

The PID level (assuming that each PID corresponds to a distinct research product).

The OpenAIRE-id level (leveraging PID synonyms based on OpenAIRE's deduplication algorithm [4] - each distinct article has its own OpenAIRE id).

Previous versions of the dataset only provided the scores at the PID level.

From version 12 onward, two types of PIDs are included in the dataset: DOIs and PMIDs (before that version, only DOIs were included).

Also, from version 7 onward, for each product in our files we also offer an impact class, which informs the user about the percentile into which the product score belongs compared to the impact scores of the rest products in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).

Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each product with a distinct PID that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the OpenAIRE article deduplication algorithm. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same product). As a result, each node in the citation network we build is a deduplicated product having a distinct OpenAIRE id. We still report the scores at PID level (i.e., we assign a score to each of the versions/instances of the product), however these PID-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated product will receive the same scores). We have removed a small number of instances (having a PID) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph.

For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format "identifier

From version 9 onward, we also provide topic-specific impact classes for PID-identified products. In particular, we associated those products with 2nd level concepts from OpenAlex; we chose to keep only the three most dominant concepts for each product, based on their confidence score, and only if this score was greater than 0.3. Then, for each product and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format "identifier

The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v7.1.0, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG [6,7], and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered. In addition, versions later than v.10 leverage the filtering rules described here to remove from the dataset PIDs with problematic metadata.

References:

[1] R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

[2] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380

[3] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)

[4] P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).

[5] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)

[6] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839

[7] K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045

Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder's API.

Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the CC0 license.

More details about BIP! DB can be found in our relevant peer-reviewed publication:

Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460

We kindly request that any published research that makes use of BIP! DB cite the above article.
Data from: Estimated spring crop yields using Flex Cropping Tool
catalog.data.gov
geodata.nal.usda.gov
+2more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Estimated spring crop yields using Flex Cropping Tool [Dataset]. https://catalog.data.gov/dataset/estimated-spring-crop-yields-using-flex-cropping-tool-fdbfd
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
Average estimated yields and associated CV values for current (2018) model runs. Based on work done by Harsimran Kaur et al in 2017. The following is from her thesis: Agro-ecological classes (AECs) of dryland cropping systems in the inland Pacific Northwest have been predicted to become more dynamic with greater use of annual fallow under projected climate change. At the same time, initiatives are being taken by growers either to intensify or diversify their cropping systems using oilseed and grain legume crops. The main objective of this study was to use a mechanistic model (CropSyst) to provide yield and soil water forecasts at regional scales which could compare fallow versus spring crop choices (flex/opportunity crop). Model simulations were based on historic weather data (1981-2010) as well as combined with actual year weather data for simulations at pre-planting dates starting in Dec. for representative years. Yield forecasts of spring pea, canola and wheat were compared to yield simulations using only weather of the representative year via linear regression analysis to assess pre-plant forecasts. Crop yield projections on pre-plant forecast date of Feb 1st had higher R2 with yield simulated using actual years weather data and lower CVs across the region as compared to forecasts based on historic weather data and other pre-season forecast dates (Dec. 1st and Jan. 1st). Therefore, Feb. 1st was considered the most reliable time to predict yield and other relevant outputs such as available water forecasts on a regional scale. Regional forecast maps of predicted spring crop yields and CVs showed ranges of 1 to 4367 kg/ha and 11 to 293% for spring canola, 72 to 2646 kg/ha and 11 to 143% for spring pea and 39 to 5330 kg/ha and 11 to 158% for spring wheat across study region for a representative year. These data combined with predicted available water after fallow and following spring crop yield as well as estimates of winter wheat yield reduction would collectively serve as information contributing to decisions related to crop intensification and diversification. Resources in this dataset:Resource Title: GeoData catalog record. File Name: Web Page, url: https://geodata.nal.usda.gov/geonetwork/srv/eng/catalog.search#/metadata/459d2dba-a346-4e54-9750-ef3178c18f38
World - Twitter Sentiment By Country
kaggle.com
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Jiang (2020). World - Twitter Sentiment By Country [Dataset]. https://www.kaggle.com/wjia26/twittersentimentbycountry/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
William Jiang
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
World
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">

Introduction

Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!

Content

Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)

Notes

There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.

Acknowledgements

Thanks to the tweepy package for making the data extraction via Twitter API so easy.

Shameless Plug

Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.

Here's an App I built using a live version of this data.
Dynamic Small Business Search (DSBS)
catalog.data.gov
data-dathere.dataops.dathere.com
+3more
Updated Apr 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Small Business Administration (2023). Dynamic Small Business Search (DSBS) [Dataset]. https://catalog.data.gov/dataset/dynamic-small-business-search-dsbs-4f0da
Explore at:
Dataset updated
Apr 11, 2023
Dataset provided by
Small Business Administrationhttps://www.sba.gov/
Description
The Small Business Administration maintains the Dynamic Small Business Search (DSBS) database. As a small business registers in the System for Award Management, there is an opportunity to fill out the small business profile. The information provided populates DSBS. DSBS is another tool contracting officers use to identify potential small business contractors for upcoming contracting opportunities. Small businesses can also use DSBS to identify other small businesses for teaming and joint venturing.
r
Coral restoration database – Dataset from Bostrom-Einarsson et al 2019 (NESP...
researchdata.edu.au
bin
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bostrom-Einarsson, Lisa, Dr.; Ceccarelli, Daniela, Dr.; Cook, Nathan, Mr.; Hein, Margaux, Dr.; Smith, Adam, Dr.; McLeod, Ian M, Dr. (2019). Coral restoration database – Dataset from Bostrom-Einarsson et al 2019 (NESP TWQ 4.3, JCU) [Dataset]. https://researchdata.edu.au/coral-restoration-database-43-jcu/1425277
Explore at:
binAvailable download formats
Dataset updated
2019
Dataset provided by
eAtlas
Authors
Bostrom-Einarsson, Lisa, Dr.; Ceccarelli, Daniela, Dr.; Cook, Nathan, Mr.; Hein, Margaux, Dr.; Smith, Adam, Dr.; McLeod, Ian M, Dr.
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Jan 1, 2017 - Jan 31, 2019
Description
This dataset consists of a review of case studies and descriptions of coral restoration methods from four sources: 1) the primary literature (i.e. published peer-reviewed scientific literature), 2) grey literature (e.g. scientific reports and technical summaries from experts in the field), 3) online descriptions (e.g. blogs and online videos describing projects), and 4) an online survey targeting restoration practitioners (doi:10.5061/dryad.p6r3816).

Included are only those case studies which actively conducted coral restoration (i.e. at least one stage of scleractinian coral life-history was involved). This excludes indirect coral restoration projects, such as disturbance mitigation (e.g. predator removal, disease control etc.) and passive restoration interventions (e.g. enforcement of control against dynamite fishing or water quality improvement). It also excludes many artificial reefs, in particular if the aim was fisheries enhancement (i.e. fish aggregation devices), and if corals were not included in the method. To the best of our abilities, duplication of case studies was avoided across the four separate sources, so that each case in the review and database represents a separate project.

This dataset is currently under embargo until the publication of review manuscript is made available.

Methods: More than 40 separate categories of data were recorded from each case study and entered into a database. These included data on (1) the information source, (2) the case study particulars (e.g. location, duration, spatial scale, objectives, etc.), (3) specific details about the methods, (4) coral details (e.g. genus, species, morphology), (5) monitoring details, and (6) the outcomes and conclusions.

Primary literature Multiple search engines were used to achieve the most complete coverage of the scientific literature. First, the scientific literature was searched using Google Scholar with the keywords “coral* + restoration”. Because the field (and therefore search results) are dominated by transplantation studies, separate searches were then conducted for other common techniques using “coral* + restoration + [technique name]”. This search was further complemented by using the same keywords in ISI Web of Knowledge (search yield n=738). Studies were then manually selected that fulfilled our criteria for active coral restoration described above (final yield n= 221). In those cases where a single paper describes several different projects or methods, these were split into separate case studies. Finally, prior reviews of coral restoration were consulted to obtain case studies from their reference lists.

Grey literature While many reports appeared in the Google Scholar literature searches, The Nature Conservancy (TNC) database of reports for North American coastal restoration projects (http://projects.tnc.org/coastal/) was also conducted. This was supplemented with reports listed in the reference lists of other papers, reports and reviews, and during the online searches (n=30).

Online records Small-scale projects conducted without substantial input from researchers, academics, non-governmental organisations (NGO) or coral reef managers often do not result in formal written accounts of methods. To access this information, we conducted online searches of YouTube, Facebook and Google, using the search terms “Coral restoration”. The information provided in videos, blog posts and websites to describe further projects (n=48) was also used. Due to the unverified nature of such accounts, the data collected from these online-only records was limited compared to peer reviewed literature and surveys. At the minimum, the location, the methods used and reported outcomes or lessons learned were included in this review.

Online survey To access information from projects not published elsewhere, an online survey targeting restoration practitioners was designed. The survey consisted of 25 questions querying restoration practitioners regarding projects they had undertaken under JCU human ethics H7218 (following the Australian National Statement on Ethical Conduct in Human Research, 2007). These data (n=63) are included in all calculations within this review, but are not publicly available to preserve the anonymity of participants. Although we encouraged participants to fill out a separate survey for each case study, it is possible that participants included multiple separate projects in a single survey, which may reduce the real number of case studies reported.

Data analysis Percentages, counts and other quantifications from the database refer to the total number of case studies with data in that category. Case studies where data were lacking for the category in question, or lack appropriate detail (e.g. reporting ‘mixed’ for coral genera) are not included in calculations. Many categories allowed multiple answers (e.g. coral species); these were split into separate records for calculations (e.g. coral species n). For this reason, absolute numbers may exceed the number of case studies in the database. However, percentages reflect the proportion of case studies in each category. We used the seven objectives outlined in [1] to classify the objective of each case study, with an additional two categories (‘scientific research’ and ‘ecological engineering’). We used Tableau to visualise and analyse the database (Desktop Professional Edition, version 10.5, Tableau Software). The data have been made available following the FAIR Guiding Principles for scientific data management and stewardship [2]. Data available from the Dryad Digital Repository downloaded here (https://doi.org/10.5061/dryad.p6r3816), and visually explored: https://public.tableau.com/views/CoralRestorationDatabase-Visualisation/Coralrestorationmethods?:embed=y&:display_count=yes&publish=yes&:showVizHome=no#1.

Limitations: While our expanded search enabled us to avoid the bias from the more limited published literature, we acknowledge that using sources that have not undergone rigorous peer-review potentially introduces another bias. Many government reports undergo an informal peer-review; however, survey results and online descriptions may present a subjective account of restoration outcomes. To reduce subjective assessment of case studies, we opted not to interpret results or survey answers, instead only recording what was explicitly stated in each document [3, 4].

Defining restoration In this review, active restoration methods are methods which reintroduce coral (e.g. coral fragment transplantation, or larval enhancement) or augment coral assemblages (e.g. substrate stabilisation, or algal removal), for the purposes of restoring the reef ecosystem. In the published literature and elsewhere, there are many terms that describe the same intervention. For clarity, we provide the terms we have used in the review, their definitions and alternative terms (see references). Passive restoration methods such as predator removal (e.g. crown-of-thorns starfish and Drupella control) have been excluded, unless they were conducted in conjunction with active restoration (e.g. macroalgal removal combined with transplantation).

Format: The data is supplied as an excel file with three separate tabs for 1) peer reviewed literature 2) grey literature, and 3) a description of the objectives form Hein et al. 2017. Survey responses have been excluded to preserve the anonymity of the respondents.

This dataset is a database that underpins a 2018 report and 2019 published review of coral restoration methods from around the world. - Bostrom-Einarsson L, Ceccarelli D, Babcock R.C., Bayraktarov E, Cook N, Harrison P, Hein M, Shaver E, Smith A, Stewart-Sinclair P.J, Vardi T, McLeod I.M. 2018 - Coral restoration in a changing world - A global synthesis of methods and techniques, report to the National Environmental Science Program. Reef and Rainforest Research Centre Ltd, Cairns (63pp.). - Review manuscript is currently under review.

Data Dictionary: The Data Dictionary is emended in the excel spreadsheet. Comments are included in the column titles to aid interpretation, and/or refer to additional information tabs. For more information on each column, open the red triangle [located top right of cell].

References: 1. Hein MY, Willis BL, Beeden R, Birtles A. The need for broader ecological and socioeconomic tools to evaluate the effectiveness of coral restoration programs. Restoration Ecology. Wiley/Blackwell (10.1111); 2017;25: 873–883. doi:10.1111/rec.12580 2. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016 3. Nature Publishing Group; 2016;3: 160018. doi:10.1038/sdata.2016.18 3.Miller RL, Marsh H, Cottrell A, Hamann M. Protecting Migratory Species in the Australian Marine Environment: A Cross-Jurisdictional Analysis of Policy and Management Plans. Front Mar Sci. Frontiers; 2018;5: 211. doi:10.3389/fmars.2018.00229 4. Ortega-Argueta A, Baxter G, Hockings M. Compliance of Australian threatened species recovery plans with legislative requirements. Journal of Environmental Management. Elsevier; 2011;92: 2054–2060.

Data Location:

This dataset is filed in the eAtlas enduring data repository at: data\2018-2021-NESP-TWQ-4\4.3_Best-practice-coral-restoration

Facebook

Twitter

Click to copy link

Link copied

Cite

(2021). Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-003

Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN

Explore at:

Dataset updated

Jul 26, 2021

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).

Clear search

Close search

Google apps

Main menu

Evolution of Web search engine interfaces through SERP screenshots and HTML...

DataForSEO Google Full (Keywords+SERP) database, historical data available

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

Search Engineing Market Report | Global Forecast From 2025 To 2033

Search Engine Market Outlook

Ultimate Arabic News Dataset

Data from: Inventory of online public databases and repositories holding...

The State of Serverless Applications: Collection,Characterization, and...

ag_news_subset

DBpedia Ontology

DBpedia Ontology

Text Classification Dataset with 14 Classes

About this dataset

Research Ideas

Acknowledgements

License

Columns

Frequency lists of pivot words and GSE counts - Dataset - B2FIND

The language of sound search: Examining User Queries in Audio Search Engines...

Overview

Files in This Dataset

Citation

Vogue_pk Dataset

PSAP 911 Service Area Boundaries - Dataset - CKAN

Railroad Bridges - Dataset - CKAN

Optimized data analysis avoiding trypsin artefacts

BIP! DB: A Dataset of Impact Measures for Research Products

Data from: Estimated spring crop yields using Flex Cropping Tool

World - Twitter Sentiment By Country

Introduction

Content

Notes

Acknowledgements

Shameless Plug

Dynamic Small Business Search (DSBS)

Coral restoration database – Dataset from Bostrom-Einarsson et al 2019 (NESP...

Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN