Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).
You can check the fields description in the documentation: current Full database: https://docs.dataforseo.com/v3/databases/google/full/?bash; Historical Full database: https://docs.dataforseo.com/v3/databases/google/history/full/?bash.
Full Google Database is a combination of the Advanced Google SERP Database and Google Keyword Database.
Google SERP Database offers millions of SERPs collected in 67 regions with most of Google’s advanced SERP features, including featured snippets, knowledge graphs, people also ask sections, top stories, and more.
Google Keyword Database encompasses billions of search terms enriched with related Google Ads data: search volume trends, CPC, competition, and more.
This database is available in JSON format only.
You don’t have to download fresh data dumps in JSON – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract (our paper)
The frequency of a web search keyword generally reflects the degree of public interest in a particular subject matter. Search logs are therefore useful resources for trend analysis. However, access to search logs is typically restricted to search engine providers. In this paper, we investigate whether search frequency can be estimated from a different resource such as Wikipedia page views of open data. We found frequently searched keywords to have remarkably high correlations with Wikipedia page views. This suggests that Wikipedia page views can be an effective tool for determining popular global web search trends.
Data
personal-name.txt.gz:
The first column is the Wikipedia article id, the second column is the search keyword, the third column is the Wikipedia article title, and the fourth column is the total of page views from 2008 to 2014.
personal-name_data_google-trends.txt.gz, personal-name_data_wikipedia.txt.gz:
The first column is the period to be collected, the second column is the source (Google or Wikipedia), the third column is the Wikipedia article id, the fourth column is the search keyword, the fifth column is the date, and the sixth column is the value of search trend or page view.
Publication
This data set was created for our study. If you make use of this data set, please cite:
Mitsuo Yoshida, Yuki Arase, Takaaki Tsunoda, Mikio Yamamoto. Wikipedia Page View Reflects Web Search Trend. Proceedings of the 2015 ACM Web Science Conference (WebSci '15). no.65, pp.1-2, 2015.
http://dx.doi.org/10.1145/2786451.2786495
http://arxiv.org/abs/1509.02218 (author-created version)
Note
The raw data of Wikipedia page views is available in the following page.
http://dumps.wikimedia.org/other/pagecounts-raw/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in
Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.
Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)
The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles.
Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.
To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.
Dataset 2: Search Query Suggestions (suggestions.csv)
The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.
The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".
We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.
AllSides Scraper
At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.
We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The search engine market size was valued at approximately USD 124 billion in 2023 and is projected to reach USD 258 billion by 2032, witnessing a robust CAGR of 8.5% during the forecast period. This growth is largely attributed to the increasing reliance on digital platforms and the internet across various sectors, which has necessitated the use of search engines for data retrieval and information dissemination. With the proliferation of smartphones and the expansion of internet access globally, search engines have become indispensable tools for both businesses and consumers, driving the market's upward trajectory. The integration of artificial intelligence and machine learning technologies into search engines is transforming the way search engines operate, offering more personalized and efficient search results, thereby further propelling market growth.
One of the primary growth factors in the search engine market is the ever-increasing digitalization across industries. As businesses continue to transition from traditional modes of operation to digital platforms, the need for search engines to navigate and manage data becomes paramount. This shift is particularly evident in industries such as retail, BFSI, and healthcare, where vast amounts of data are generated and require efficient management and retrieval systems. The integration of AI and machine learning into search engine algorithms has enhanced their ability to process and interpret large datasets, thereby improving the accuracy and relevance of search results. This technological advancement not only improves user experience but also enhances the competitive edge of businesses, further fueling market growth.
Another significant growth factor is the expanding e-commerce sector, which relies heavily on search engines to connect consumers with products and services. With the rise of e-commerce giants and online marketplaces, consumers are increasingly using search engines to find the best prices, reviews, and availability of products, leading to a surge in search engine usage. Additionally, the implementation of voice search technology and the growing popularity of smart home devices have introduced new dynamics to search engine functionality. Consumers are now able to conduct searches verbally, which has necessitated the adaptation of search engines to incorporate natural language processing capabilities, further driving market growth.
The advertising and marketing sectors are also contributing significantly to the growth of the search engine market. Businesses are leveraging search engines as a primary tool for online advertising, given their wide reach and ability to target specific audiences. Pay-per-click advertising and search engine optimization strategies have become integral components of digital marketing campaigns, enabling businesses to enhance their visibility and engagement with potential customers. The measurable nature of these advertising techniques allows businesses to assess the effectiveness of their campaigns and make data-driven decisions, thereby increasing their reliance on search engines and contributing to overall market growth.
The evolution of search engines is closely tied to the development of Ai Enterprise Search, which is revolutionizing how businesses access and utilize information. Ai Enterprise Search leverages artificial intelligence to provide more accurate and contextually relevant search results, making it an invaluable tool for organizations that manage large volumes of data. By understanding user intent and learning from past interactions, Ai Enterprise Search systems can deliver personalized experiences that enhance productivity and decision-making. This capability is particularly beneficial in sectors such as finance and healthcare, where quick access to precise information is crucial. As businesses continue to digitize and data volumes grow, the demand for Ai Enterprise Search solutions is expected to increase, further driving the growth of the search engine market.
Regionally, North America holds a significant share of the search engine market, driven by the presence of major technology companies and a well-established digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth can be attributed to the rapid digital transformation in emerging economies such as China and India, where increasing internet penetration and smartphone adoption are driving demand for search engines. Additionally, government initiatives to
For Login DuckDuckGo Please Visit: 👉 DuckDuckGo Login Account
In today’s digital age, privacy has become one of the most valued aspects of online activity. With increasing concerns over data tracking, surveillance, and targeted advertising, users are turning to privacy-first alternatives for everyday browsing. One of the most recognized names in private search is DuckDuckGo. Unlike mainstream search engines, DuckDuckGo emphasizes anonymity and transparency. However, many people wonder: Is there such a thing as a "https://duckduckgo-account.blogspot.com/ ">DuckDuckGo login account ?
In this comprehensive guide, we’ll explore everything you need to know about the DuckDuckGo login account, what it offers (or doesn’t), and how to get the most out of DuckDuckGo’s privacy features.
Does DuckDuckGo Offer a Login Account? To clarify up front: DuckDuckGo does not require or offer a traditional login account like Google or Yahoo. The concept of a DuckDuckGo login account is somewhat misleading if interpreted through the lens of typical internet services.
DuckDuckGo's entire business model is built around privacy. The company does not track users, store personal information, or create user profiles. As a result, there’s no need—or intention—to implement a system that asks users to log in. This stands in stark contrast to other search engines that rely on login-based ecosystems to collect and use personal data for targeted ads.
That said, some users still search for the term DuckDuckGo login account, usually because they’re trying to save settings, sync devices, or use features that may suggest a form of account system. Let’s break down what’s possible and what alternatives exist within DuckDuckGo’s platform.
Saving Settings Without a DuckDuckGo Login Account Even without a traditional DuckDuckGo login account, users can still save their preferences. DuckDuckGo provides two primary ways to retain search settings:
Local Storage (Cookies) When you customize your settings on the DuckDuckGo account homepage, such as theme, region, or safe search options, those preferences are stored in your browser’s local storage. As long as you don’t clear cookies or use incognito mode, these settings will persist.
Cloud Save Feature To cater to users who want to retain settings across multiple devices without a DuckDuckGo login account, DuckDuckGo offers a feature called "Cloud Save." Instead of creating an account with a username or password, you generate a passphrase or unique key. This key can be used to retrieve your saved settings on another device or browser.
While it’s not a conventional login system, it’s the closest DuckDuckGo comes to offering account-like functionality—without compromising privacy.
Why DuckDuckGo Avoids Login Accounts Understanding why there is no DuckDuckGo login account comes down to the company’s core mission: to offer a private, non-tracking search experience. Introducing login accounts would:
Require collecting some user data (e.g., email, password)
Introduce potential tracking mechanisms
Undermine their commitment to full anonymity
By avoiding a login system, DuckDuckGo keeps user trust intact and continues to deliver on its promise of complete privacy. For users who value anonymity, the absence of a DuckDuckGo login account is actually a feature, not a flaw.
DuckDuckGo and Device Syncing One of the most commonly searched reasons behind the term DuckDuckGo login account is the desire to sync settings or preferences across multiple devices. Although DuckDuckGo doesn’t use accounts, the Cloud Save feature mentioned earlier serves this purpose without compromising security or anonymity.
You simply export your settings using a unique passphrase on one device, then import them using the same phrase on another. This offers similar benefits to a synced account—without the need for usernames, passwords, or emails.
DuckDuckGo Privacy Tools Without a Login DuckDuckGo is more than just a search engine. It also offers a range of privacy tools—all without needing a DuckDuckGo login account:
DuckDuckGo Privacy Browser (Mobile): Available for iOS and Android, this browser includes tracking protection, forced HTTPS, and built-in private search.
DuckDuckGo Privacy Essentials (Desktop Extension): For Chrome, Firefox, and Edge, this extension blocks trackers, grades websites on privacy, and enhances encryption.
Email Protection: DuckDuckGo recently launched a service that allows users to create "@duck.com" email addresses that forward to their real email—removing trackers in the process. Users sign up for this using a token or limited identifier, but it still doesn’t constitute a full DuckDuckGo login account.
Is a DuckDuckGo Login Account Needed? For most users, the absence of a DuckDuckGo login account is not only acceptable—it’s ideal. You can:
Use the search engine privately
Customize and save settings
Sync preferences across devices
Block trackers and protect email
—all without an account.
While some people may find the lack of a traditional login unfamiliar at first, it quickly becomes a refreshing break from constant credential requests, data tracking, and login fatigue.
The Future of DuckDuckGo Accounts As of now, DuckDuckGo maintains its position against traditional account systems. However, it’s clear the company is exploring privacy-preserving ways to offer more user features—like Email Protection and Cloud Save. These features may continue to evolve, but the core commitment remains: no tracking, no personal data storage, and no typical DuckDuckGo login account.
Final Thoughts While the term DuckDuckGo login account is frequently searched, it represents a misunderstanding of how the platform operates . Unlike other tech companies that monetize personal data, DuckDuckGo has stayed true to its promise of privacy .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.
Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.
UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.
UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.
1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets:
Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.
2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases.
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:
(1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).
(2) The features are basically extracted by us, and are those widely used in the research community.
In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The replication package for our article The State of Serverless Applications: Collection,Characterization, and Community Consensus provides everything required to reproduce all results for the following three studies:
Serverless Application Collection
We collect descriptions of serverless applications from open-source projects, academic literature, industrial literature, and scientific computing.
Open-source Applications
As a starting point, we used an existing data set on open-source serverless projects from this study. We removed small and inactive projects based on the number of files, commits, contributors, and watchers. Next, we manually filtered the resulting data set to include only projects that implement serverless applications. We provide a table containing all projects that remained after the filtering alongside the notes from the manual filtering.
Academic Literature Applications
We based our search on an existing community-curated dataset on literature for serverless computing consisting of over 180 peer-reviewed articles. First, we filtered the articles based on title and abstract. In a second iteration, we filtered out any articles that implement only a single function for evaluation purposes or do not include sufficient detail to enable a review. As the authors were familiar with some additional publications describing serverless applications, we contributed them to the community-curated dataset and included them in this study. We provide a table with our notes from the manual filtering.
Scientific Computing Applications
Most of these scientific computing serverless applications are still at an early stage and therefore there is little public data available. One of the authors is employed at the German Aerospace Center (DLR) at the time of writing, which allowed us to collect information about several projects at DLR that are either currently moving to serverless solutions or are planning to do so. Additionally, an application from the German Electron Synchrotron (DESY) could be included. For each of these scientific computing applications, we provide a document containing a description of the project and the names of our contacts that provided information for the characterization of these applications.
Collection of serverless applications
Based on the previously described methodology, we collected a diverse dataset of 89 serverless applications from open-source projects, academic literature, industrial literature, and scientific computing. This dataset is can be found in Dataset.xlsx.
Serverless Application Characterization
As previously described, we collected 89 serverless applications from four different sources. Subsequently, two randomly assigned reviewers out of seven available reviewers characterized each application along 22 characteristics in a structured collaborative review sheet. The characteristics and potential values were defined a priori by the authors and iteratively refined, extended, and generalized during the review process. The initial moderate inter-rater agreement was followed by a discussion and consolidation phase, where all differences between the two reviewers were discussed and resolved. The six scientific applications were not publicly available and therefore characterized by a single domain expert, who is either involved in the development of the applications or in direct contact with the development team.
Initial Ratings & Interrater Agreement Calculation
The initial reviews are available as a table, where every application is characterized along with the 22 characteristics. A single value indicates that both reviewers assigned the same value, whereas a value of the form [Reviewer 2] A | [Reviewer 4] B
indicates that for this characteristic, reviewer two assigned the value A, whereas reviewer assigned the value B.
Our script for the calculation of the Fleiß-Kappa score based on this data is also publically available. It requires the python package pandas
and statsmodels
. It does not require any input and assumes that the file Initial Characterizations.csv
is located in the same folder. It can be executed as follows:
python3 CalculateKappa.py
Results Including Unknown Data
In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find.
For six characteristics, many applications were assigned the ''Unknown'' value, i.e., the reviewers were not able to determine the value of this characteristic. Therefore, we excluded these characteristics from this study. For the remaining characteristics, the percentage of ''Unknowns'' ranges from 0–19% with two outliers at 25% and 30%. These ''Unknowns'' were excluded from the percentage values presented in the article. As part of our replication package, we provide the raw results for each characteristic including the ''Unknown'' percentages in the form of bar charts.
The script for the generation of these bar charts is also part of this replication package). It uses the python packages pandas
, numpy
, and matplotlib
. It does not require any input and assumes that the file Dataset.csv
is located in the same folder. It can be executed as follows:
python3 GenerateResultsIncludingUnknown.py
Final Dataset & Figure Generation
In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find. Following this process, we were able to resolve all conflicts, resulting in a collection of 89 applications described by 18 characteristics. This dataset is available here: link
The script to generate all figures shown in the chapter "Serverless Application Characterization can be found here. It does not require any input but assumes that the file Dataset.csv
is located in the same folder. It uses the python packages pandas
, numpy
, and matplotlib
. It can be executed as follows:
python3 GenerateFigures.py
Comparison Study
To identify existing surveys and datasets that also investigate one of our characteristics, we conducted a literature search using Google as our search engine, as we were mostly looking for grey literature. We used the following search term:
("serverless" OR "faas") AND ("dataset" OR "survey" OR "report") after: 2018-01-01
This search term looks for any combination of either serverless or faas alongside any of the terms dataset, survey, or report. We further limited the search to any articles after 2017, as serverless is a fast-moving field and therefore any older studies are likely outdated already. This search term resulted in a total of 173 search results. In order to validate if using only a single search engine is sufficient, and if the search term is broad enough, we
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ag_news_subset', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exact pattern matching algorithms are popular and used widely in several applications, such as molecular biology, text processing, image processing, web search engines, network intrusion detection systems and operating systems. The focus of these algorithms is to achieve time efficiency according to applications but not memory consumption. In this work, we propose a novel idea to achieve both time efficiency and memory consumption by splitting query string for searching in Corpus. For a given text, the proposed algorithm split the query pattern into two equal halves and considers the second (right) half as a query string for searching in Corpus. Once the match is found with second halves, the proposed algorithm applies brute force procedure to find remaining match by referring the location of right half. Experimental results on different S1 Dataset, namely Arabic, English, Chinese, Italian and French text databases show that the proposed algorithm outperforms the existing S1 Algorithm in terms of time efficiency and memory consumption as the length of the query pattern increases.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Bridges-Rail in the United States According to The National Bridge Inspection Standards published in the Code of Federal Regulations (23 CFR 650.3), a bridge is: A structure including supports erected over a depression or an obstruction, such as water, highway, or railway, and having a track or passageway for carrying traffic or other moving loads. Each bridge was captured as a point which was placed in the center of the "main span" (highest and longest span). For bridges that cross navigable waterways, this was typically the part of the bridge over the navigation channel. If no "main span" was discernable using the imagery sources available, or if multiple non contiguous main spans were discernable, the point was placed in the center of the overall structure. Bridges that are sourced from the National Bridge Inventory (NBI) that cross state boundaries are an exception. Bridges that cross state boundaries are represented in the NBI by two records. The points for the two records have been located so as to be within the state indicated by the NBI's [STATE_CODE] attribute. In some cases, following these rules did not place the point at the location at which the bridge crosses what the user may judge as the most important feature intersected. For example, a given bridge may be many miles long, crossing nothing more than low lying ground for most of its length but crossing a major interstate at its far end. Due to the fact that bridges are often high narrow structures crossing depressions that may or may not be too narrow to be represented in the DEM used to orthorectify a given source of imagery, alignment with ortho imagery is highly variable. In particular, apparent bridge location in ortho imagery is highly dependent on collection angle. During verification, TechniGraphics used imagery from the following sources: NGA HSIP 133 City, State or Local; NAIP; DOQQ imagery. In cases where "bridge sway" or "tall structure lean" was evident, TGS attempted to compensate for these factors when capturing the bridge location. For instances in which the bridge was not visible in imagery, it was captured using topographic maps at the intersection of the water and rail line. TGS processed 784 entities previously with the HSIP Bridges-Roads (STRAHNET Option - HSIP 133 Cities and Gulf Coast). These entities were added into this dataset after processing. No entities were included in this dataset for American Samoa, Guam, Hawaii, the Commonwealth of the Northern Mariana Islands, or the Virgin Islands because there are no main line railways in these areas. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, leading and trailing spaces were trimmed from all text fields. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is given by the publication date which is 09/02/2009. A more precise measure of currentness cannot be provided since this is dependent on the NBI and the source of imagery used during processing.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
911 Public Safety Answering Point (PSAP) service area boundaries in the United States According to the National Emergency Number Association (NENA), a Public Safety Answering Point (PSAP) is a facility equipped and staffed to receive 9-1-1 calls. The service area is the geographic area within which a 911 call placed using a landline is answered at the associated PSAP. This dataset only includes primary PSAPs. Secondary PSAPs, backup PSAPs, and wireless PSAPs have been excluded from this dataset. Primary PSAPs receive calls directly, whereas secondary PSAPs receive calls that have been transferred by a primary PSAP. Backup PSAPs provide service in cases where another PSAP is inoperable. Most military bases have their own emergency telephone systems. To connect to such a system from within a military base, it may be necessary to dial a number other than 9 1 1. Due to the sensitive nature of military installations, TGS did not actively research these systems. If civilian authorities in surrounding areas volunteered information about these systems, or if adding a military PSAP was necessary to fill a hole in civilian provided data, TGS included it in this dataset. Otherwise, military installations are depicted as being covered by one or more adjoining civilian emergency telephone systems. In some cases, areas are covered by more than one PSAP boundary. In these cases, any of the applicable PSAPs may take a 911 call. Where a specific call is routed may depend on how busy the applicable PSAPs are (i.e., load balancing), operational status (i.e., redundancy), or time of day / day of week. If an area does not have 911 service, TGS included that area in the dataset along with the address and phone number of their dispatch center. These are areas where someone must dial a 7 or 10 digit number to get emergency services. These records can be identified by a "Y" in the [NON911EMNO] field. This indicates that dialing 911 inside one of these areas does not connect one with emergency services. This dataset was constructed by gathering information about PSAPs from state level officials. In some cases, this was geospatial information; in other cases, it was tabular. This information was supplemented with a list of PSAPs from the Federal Communications Commission (FCC). Each PSAP was researched to verify its tabular information. In cases where the source data was not geospatial, each PSAP was researched to determine its service area in terms of existing boundaries (e.g., city and county boundaries). In some cases, existing boundaries had to be modified to reflect coverage areas (e.g., "entire county north of Country Road 30"). However, there may be cases where minor deviations from existing boundaries are not reflected in this dataset, such as the case where a particular PSAPs coverage area includes an entire county plus the homes and businesses along a road which is partly in another county. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics.Homeland Security Use Cases: Use cases describe how the data may be used and help to define and clarify requirements. 1) A disaster has struck, or is predicted for, a locality. The PSAP that may be affected must be identified and verified to be operational. 2) In the event that the local PSAP is inoperable, adjacent PSAP locations could be identified and utilized.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Fashion Magazine Library Management: Operators of a large fashion magazine library can use the VOGUE_PK model to catalog their extensive collection. It can help to classify different editions by issue date, identify styles from specific stylists or designers, and even recognize featured models. This would simplify the process of finding specific issues or fashion styles.
Style Tracking and Analysis: Fashion researchers, analysts, and enthusiasts could use this model to track and analyze the evolution of styles by a particular designer or stylist over time. By identifying the designer or stylist in multiple issues, users can study trends, predict future fashion movements, or create comprehensive style portfolios.
Education and Training: Fashion design students or professionals could use this model as a learning tool to study and analyze the distinct characteristics of various famous designers and stylists' work in different issue dates.
Image-Based Fashion Search Engines: The "VOGUE_PK" model can be instrumental in constructing a powerful image-based search engine. Users could upload an image and receive similar styles, designers, models, and the specific stylist involved in those similar styles.
Content Creation: Fashion content creators, such as bloggers and journalists, can use the model to easily identify the key details about images they're using in articles, posts, or other content. The model can help to ensure that designer, model, stylist, and issue date are correctly attributed.
Most bottom-up proteomics experiments share two features: The use of trypsin to digest proteins for mass spectrometry and the statistic driven matching of the measured peptide fragment spectra against protein database derived in silico generated spectra. While this extremely powerful approach in combination with latest generation mass spectrometers facilitates very deep proteome coverage, the assumptions made have to be met to generate meaningful results. One of these assumptions is that the measured spectra indeed have a match in the search space, since the search engine will always report the best match. However, one of the most abundant proteins in the sample, the protease, is often not represented in the employed database. It is therefore widely accepted in the community to include the protease and other common contaminants in the database to avoid false positive matches. Although this approach accounts for unmodified trypsin peptides, the most widely employed trypsin preparations are chemically modified to prevent autolysis and premature activity loss of the protease. In this study we observed numerous spectra of modified trypsin derived peptides in samples from our laboratory as well as in datasets downloaded from public repositories. In many cases the spectra were assigned to other proteins, often with good statistical significance. We therefore designed a new database search strategy employing an artificial amino acid which accounts for these peptides with a minimal increase in search space and the concomitant loss of statistical significance. Moreover, this approach can be easily implemented into existing workflows for many widely used search engines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This readme file was generated on 2024-01-15 by Salvatore Romano
GENERAL INFORMATION
Title of Dataset:
A dataset to assess Microsoft Copilot Answers in the Context of Swiss, Bavarian and Hesse Elections.
Author/Principal Investigator Information
Name: Salvatore Romano
ORCID: 0000-0003-0856-4989
Institution: Universitat Oberta de Catalunya, AID4So.
Address: Rambla del Poblenou, 154. 08018 Barcelona.
Email: salvatore@aiforensics.org
Author/Associate or Co-investigator Information
Name: Riccardo Angius
ORCID: 0000-0003-0291-3332
Institution: Ai Forensics
Address: Paris, France.
Email: riccardo@aiforensics.org
Date of data collection:
from 2023-09-21 to 2023-10-02.
Geographic location of data collection:
Switzerland and Germany.
Information about funding sources that supported the collection of the data:
The data collection and analysis was supported by AlgorithmWatch's DataSkop project, funded by Germany’s Federal Ministry of Education and Research (BMBF) as part of the program “Mensch-Technik-Interaktion” (human-technology interaction). dataskop.net
In Switzerland, the investigation was realized with the support of Stiftung Mercator Schweiz.
AI Forensics contribution was supported in part by the Open Society Foundations.
AI Forensics data collection infrastructure is supported by the Bright Initiative.
SHARING/ACCESS INFORMATION
Licenses/restrictions placed on the data:
This publication is licensed under a Creative Commons Attribution 4.0 International License.
https://creativecommons.org/licenses/by/4.0/deed.en
Links to publications that cite or use the data:
https://aiforensics.org//uploads/AIF_AW_Bing_Chat_Elections_Report_ca7200fe8d.pdf
Links to other publicly accessible locations of the data:
NA
Links/relationships to ancillary data sets:
NA
Was data derived from another source?
NA
If yes, list source(s):
Recommended citation for this dataset:
S. Romano, R. Angius, N. Kerby, P. Bouchaud, J. Amidei, A. Kaltenbrunner. 2024. A dataset to assess Microsoft Copilot Answers in the Context of Swiss, Bavarian and Hesse Elections. https://aiforensics.org//uploads/AIF_AW_Bing_Chat_Elections_Report_ca7200fe8d.pdf
DATA & FILE OVERVIEW
File List:
Microsof-Copilot-Answers_in-Swiss-Bavarian-Hess-Elections.csv
The only dataset for this research. It includes rows with prompts and responses from Microsoft Copilot, along with associated metadata for each entry.
Relationship between files, if important:
NA
Additional related data collected that was not included in the current data package:
NA
Are there multiple versions of the dataset?
NA
If yes, name of file(s) that was updated:
Why was the file updated?
When was the file updated?
METHODOLOGICAL INFORMATION
Description of methods used for collection/generation of data:
In our algorithmic auditing research, we adopted for a sock-puppet audit methodology (Sandvig at Al., 2014). This method aligns with the growing interdisciplinary focus on algorithm audits, which prioritize fairness, accountability, and transparency to uncover biases in algorithmic systems (Bandy, 2021). Sock-puppet auditing offers a fully controlled environment to understand the behavior of the system.
Every sample was collected by running a new browser instance connected to the internet via a network of VPNs and residential IPs based in Switzerland and Germany, then accessing Microsoft Copilot through its official URL. Every time, the settings for Language and Country/Region were set to match those of potential voters from the respective regions (English, German, French, or Italian, and Switzerland or Germany). We did not simulate any form of user history or additional personalization. Importantly, Microsoft Copilot's default settings remained unchanged, ensuring that all interactions occurred in the ``Conversation Style" set as ``Balanced".
Sandvig, C.; Hamilton, K.; Karahalios, K.; and Langbort, C. 2014. Auditing algorithms: Research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry,
22(2014): 4349–4357.
Bandy, J. 2021. Problematic machine behavior: A systematic literature review of algorithm audits. Proceedings of the acm on human-computer interaction, 5(CSCW1): 1–34
Methods for processing the data:
The process involved analyzing the HTML code of the web pages that were accessed. During this examination, key metadata were identified and extracted from the HTML structure. Once this information was successfully extracted, the rest of the HTML page, which primarily consisted of code and elements not pertinent to the needed information, was discarded. This approach ensured that only the most relevant and useful data was retained, while all unnecessary and extraneous HTML components were efficiently removed, streamlining the data collection and analysis process.
Instrument- or software-specific information needed to interpret the data:
NA
Standards and calibration information, if appropriate:
NA
Environmental/experimental conditions:
NA
Describe any quality-assurance procedures performed on the data:
NA
People involved with sample collection, processing, analysis and/or submission:
Salvatore Romano, Riccardo Angius, Natalie Kerby, Paul Bouchaud, Jacopo Amidei, Andreas Kaltenbrunner.
DATA-SPECIFIC INFORMATION FOR:
Microsof-Copilot-Answers_in-Swiss-Bavarian-Hess-Elections.csv
Number of variables: Number of Variables:
33
Number of cases/rows:
5562
Variable List:
prompt - (object) Text of the prompt.
answer - (object) Text of the answer.
country - (object) Country information.
language - (object) Language of the text.
input_conversation_id - (object) Identifier for the conversation.
conversation_group_ids - (object) Group IDs for the conversation.
conversation_group_names - (object) Group names for the conversation.
experiment_id - (object) Identifier for the experiment group.
experiment_name - (object) Name of the experiment group.
begin - (object) Start time.
end - (object) End time.
datetime - (int64) Datetime stamp.
week - (int64) Week number.
attributions - (object) Link quoted in the text.
attribution_links - (object) Links for attributions.
search_query - (object) Search query used by the chatbot.
unlabelled - (int64) Unlabelled flag.
exploratory_sample - (int64) Exploratory sample flag.
very_relevant - (int64) Very relevant flag.
needs_review - (int64) Needs review flag.
misleading_factual_error - (int64) Misleading factual error flag.
nonsense_factual_error - (int64) Nonsense factual error flag.
rejects_question_framing - (int64) Rejects question framing flag.
deflection - (int64) Deflection flag.
shield - (int64) Shield flag.
wrong_answer_language - (int64) Wrong answer language flag.
political_imbalance - (int64) Political imbalance flag.
refusal - (int64) Refusal flag.
factual_error - (int64) Factual error flag.
evasion - (int64) Evasion flag.
absolutely_accurate - (int64) Absolutely accurate flag.
macrocategory - (object) Macro-category of the content.
Missing data codes:
NA
Specialized formats or other abbreviations used:
NA
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Microsoft is an American company that develops and distributes software and services such as: a search engine (Bing), cloud solutions and the computer operating system Windows.
Market capitalization of Microsoft (MSFT)
Market cap: $3.085 Trillion USD
As of February 2025 Microsoft has a market cap of $3.085 Trillion USD. This makes Microsoft the world's 2nd most valuable company by market cap according to our data. The market capitalization, commonly called market cap, is the total market value of a publicly traded company's outstanding shares and is commonly used to measure how much a company is worth.
Revenue for Microsoft (MSFT)
Revenue in 2024 (TTM): $254.19 Billion USD
According to Microsoft's latest financial reports the company's current revenue (TTM ) is $254.19 Billion USD. In 2023 the company made a revenue of $227.58 Billion USD an increase over the revenue in the year 2022 that were of $204.09 Billion USD. The revenue is the total amount of income that a company generates by the sale of goods or services. Unlike with the earnings no expenses are subtracted.
Earnings for Microsoft (MSFT)
Earnings in 2024 (TTM): $110.77 Billion USD
According to Microsoft's latest financial reports the company's current earnings are $254.19 Billion USD. In 2023 the company made an earning of $101.21 Billion USD, an increase over its 2022 earnings that were of $82.58 Billion USD. The earnings displayed on this page are the earnings before interest and taxes or simply EBIT.
End of Day market cap according to different sources On Feb 2nd, 2025 the market cap of Microsoft was reported to be:
$3.085 Trillion USD by Nasdaq
$3.085 Trillion USD by CompaniesMarketCap
$3.085 Trillion USD by Yahoo Finance
Geography: USA
Time period: March 1986- February 2025
Unit of analysis: Microsoft Stock Data 2025
Variable | Description |
---|---|
date | date |
open | The price at market open. |
high | The highest price for that day. |
low | The lowest price for that day. |
close | The price at market close, adjusted for splits. |
adj_close | The closing price after adjustments for all applicable splits and dividend distributions. Data is adjusted using appropriate split and dividend multipliers, adhering to Center for Research in Security Prices (CRSP) standards. |
volume | The number of shares traded on that day. |
This dataset belongs to me. I’m sharing it here for free. You may do with it as you wish.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18335022%2F0304ad0416e7e55515daf890288d7f7f%2FScreenshot%202025-02-03%20152019.png?generation=1738662588735376&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18335022%2Fba7629dd0c4dc3e2ea1dbac361b94de1%2FScreenshot%202025-02-03%20152147.png?generation=1738662611945343&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18335022%2Fa9f48f1ec5fdf2a363a138389294d5b0%2FScreenshot%202025-02-03%20152159.png?generation=1738662631268574&alt=media" alt="">
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Discover a world of knowledge and power with scicite! Through its labeled data of scholarly citations extracted from scientific articles, scicite unlocks the key to uncovering information in multiple fields like computer science, biomedicine, ecology and beyond. Laid out in easily digestible columns including strings, section names, labels, isKeyCitations, label2s and more – you’ll soon find yourself losing track of time as you explore this goldmine of facts and figures. With a quick glance at each entry noted down in the dataset’s information log, you can quickly start pinpointing pertinent pieces of info straight away; from sources to key citations to start/end indices that say it all. Don't be left behind - unlock the power hidden within today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset consists of three CSV files, each containing different elements related to scholarly citations gathered from scientific articles: train.csv, test.csv and validation.csv. These can be used in a variety of ways in order to gain insight into the research process and improve its accuracy and efficiency.
Extracting useful information from citations: The labels attached to each citation section can help in extracting specific information about the sources cited or any other data included for research purposes. Additionally, isKeyCitation gives an indication if the source referred is a key citation which could be looked into in greater detail by researchers or practitioners.
Identifying relationships between citations: scicite's sectionName column helps identify related elements of writing including introduction and abstracts that enable the identification of Potential relationships between these elements and references found within them thus helping better understand what connections scholar have made previously with their research pieces
Improving accuracy in data gathering: With string, citeStart and citeEnd columns available along with source labels one can easily identify if certain references are repeated multiple times while also double checking accuracy through start/end values associated with them
Validation purposes : Last but not least one can also use this dataset for validating documents written by scholars for peer review where similar sections found prior inside unrelated documents can be used as reference points that need to match signaling correctness on original authors part
- Developing a search engine to quickly find citations relevant to specific topics and research areas.
- Creating algorithms that can predict key citations and streamline the research process by automatically including only the most important references in a paper.
- Designing AI systems that can accurately classify, analyze and summarize different scholarly works based on the citation frequency, source type & label assigned to them
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------| | string | The string of text associated with the citation. (String) | | sectionName | The name of the section the citation is found in. (String) | | label | The label associated with the citation. (String) | | isKeyCitation | A boolean value indicating whether the citation is a key citation. (Boolean) | | label2 | The second label associated with the citation. (String) | | citeEnd | The end index of the citation in the text. (Integer) | | citeStart | The start index of the citation in the text. (Integer) | | source | The source of the citation. (String) ...
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset consists of a review of case studies and descriptions of coral restoration methods from four sources: 1) the primary literature (i.e. published peer-reviewed scientific literature), 2) grey literature (e.g. scientific reports and technical summaries from experts in the field), 3) online descriptions (e.g. blogs and online videos describing projects), and 4) an online survey targeting restoration practitioners (doi:10.5061/dryad.p6r3816).
Included are only those case studies which actively conducted coral restoration (i.e. at least one stage of scleractinian coral life-history was involved). This excludes indirect coral restoration projects, such as disturbance mitigation (e.g. predator removal, disease control etc.) and passive restoration interventions (e.g. enforcement of control against dynamite fishing or water quality improvement). It also excludes many artificial reefs, in particular if the aim was fisheries enhancement (i.e. fish aggregation devices), and if corals were not included in the method. To the best of our abilities, duplication of case studies was avoided across the four separate sources, so that each case in the review and database represents a separate project.
This dataset is currently under embargo until the publication of review manuscript is made available.
Methods: More than 40 separate categories of data were recorded from each case study and entered into a database. These included data on (1) the information source, (2) the case study particulars (e.g. location, duration, spatial scale, objectives, etc.), (3) specific details about the methods, (4) coral details (e.g. genus, species, morphology), (5) monitoring details, and (6) the outcomes and conclusions.
Primary literature Multiple search engines were used to achieve the most complete coverage of the scientific literature. First, the scientific literature was searched using Google Scholar with the keywords “coral* + restoration”. Because the field (and therefore search results) are dominated by transplantation studies, separate searches were then conducted for other common techniques using “coral* + restoration + [technique name]”. This search was further complemented by using the same keywords in ISI Web of Knowledge (search yield n=738). Studies were then manually selected that fulfilled our criteria for active coral restoration described above (final yield n= 221). In those cases where a single paper describes several different projects or methods, these were split into separate case studies. Finally, prior reviews of coral restoration were consulted to obtain case studies from their reference lists.
Grey literature While many reports appeared in the Google Scholar literature searches, The Nature Conservancy (TNC) database of reports for North American coastal restoration projects (http://projects.tnc.org/coastal/) was also conducted. This was supplemented with reports listed in the reference lists of other papers, reports and reviews, and during the online searches (n=30).
Online records Small-scale projects conducted without substantial input from researchers, academics, non-governmental organisations (NGO) or coral reef managers often do not result in formal written accounts of methods. To access this information, we conducted online searches of YouTube, Facebook and Google, using the search terms “Coral restoration”. The information provided in videos, blog posts and websites to describe further projects (n=48) was also used. Due to the unverified nature of such accounts, the data collected from these online-only records was limited compared to peer reviewed literature and surveys. At the minimum, the location, the methods used and reported outcomes or lessons learned were included in this review.
Online survey To access information from projects not published elsewhere, an online survey targeting restoration practitioners was designed. The survey consisted of 25 questions querying restoration practitioners regarding projects they had undertaken under JCU human ethics H7218 (following the Australian National Statement on Ethical Conduct in Human Research, 2007). These data (n=63) are included in all calculations within this review, but are not publicly available to preserve the anonymity of participants. Although we encouraged participants to fill out a separate survey for each case study, it is possible that participants included multiple separate projects in a single survey, which may reduce the real number of case studies reported.
Data analysis Percentages, counts and other quantifications from the database refer to the total number of case studies with data in that category. Case studies where data were lacking for the category in question, or lack appropriate detail (e.g. reporting ‘mixed’ for coral genera) are not included in calculations. Many categories allowed multiple answers (e.g. coral species); these were split into separate records for calculations (e.g. coral species n). For this reason, absolute numbers may exceed the number of case studies in the database. However, percentages reflect the proportion of case studies in each category. We used the seven objectives outlined in [1] to classify the objective of each case study, with an additional two categories (‘scientific research’ and ‘ecological engineering’). We used Tableau to visualise and analyse the database (Desktop Professional Edition, version 10.5, Tableau Software). The data have been made available following the FAIR Guiding Principles for scientific data management and stewardship [2]. Data available from the Dryad Digital Repository downloaded here (https://doi.org/10.5061/dryad.p6r3816), and visually explored: https://public.tableau.com/views/CoralRestorationDatabase-Visualisation/Coralrestorationmethods?:embed=y&:display_count=yes&publish=yes&:showVizHome=no#1.
Limitations: While our expanded search enabled us to avoid the bias from the more limited published literature, we acknowledge that using sources that have not undergone rigorous peer-review potentially introduces another bias. Many government reports undergo an informal peer-review; however, survey results and online descriptions may present a subjective account of restoration outcomes. To reduce subjective assessment of case studies, we opted not to interpret results or survey answers, instead only recording what was explicitly stated in each document [3, 4].
Defining restoration In this review, active restoration methods are methods which reintroduce coral (e.g. coral fragment transplantation, or larval enhancement) or augment coral assemblages (e.g. substrate stabilisation, or algal removal), for the purposes of restoring the reef ecosystem. In the published literature and elsewhere, there are many terms that describe the same intervention. For clarity, we provide the terms we have used in the review, their definitions and alternative terms (see references). Passive restoration methods such as predator removal (e.g. crown-of-thorns starfish and Drupella control) have been excluded, unless they were conducted in conjunction with active restoration (e.g. macroalgal removal combined with transplantation).
Format: The data is supplied as an excel file with three separate tabs for 1) peer reviewed literature 2) grey literature, and 3) a description of the objectives form Hein et al. 2017. Survey responses have been excluded to preserve the anonymity of the respondents.
This dataset is a database that underpins a 2018 report and 2019 published review of coral restoration methods from around the world. - Bostrom-Einarsson L, Ceccarelli D, Babcock R.C., Bayraktarov E, Cook N, Harrison P, Hein M, Shaver E, Smith A, Stewart-Sinclair P.J, Vardi T, McLeod I.M. 2018 - Coral restoration in a changing world - A global synthesis of methods and techniques, report to the National Environmental Science Program. Reef and Rainforest Research Centre Ltd, Cairns (63pp.). - Review manuscript is currently under review.
Data Dictionary: The Data Dictionary is emended in the excel spreadsheet. Comments are included in the column titles to aid interpretation, and/or refer to additional information tabs. For more information on each column, open the red triangle [located top right of cell].
References: 1. Hein MY, Willis BL, Beeden R, Birtles A. The need for broader ecological and socioeconomic tools to evaluate the effectiveness of coral restoration programs. Restoration Ecology. Wiley/Blackwell (10.1111); 2017;25: 873–883. doi:10.1111/rec.12580 2. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016 3. Nature Publishing Group; 2016;3: 160018. doi:10.1038/sdata.2016.18 3.Miller RL, Marsh H, Cottrell A, Hamann M. Protecting Migratory Species in the Australian Marine Environment: A Cross-Jurisdictional Analysis of Policy and Management Plans. Front Mar Sci. Frontiers; 2018;5: 211. doi:10.3389/fmars.2018.00229 4. Ortega-Argueta A, Baxter G, Hockings M. Compliance of Australian threatened species recovery plans with legislative requirements. Journal of Environmental Management. Elsevier; 2011;92: 2054–2060.
Data Location:
This dataset is filed in the eAtlas enduring data repository at: data\2018-2021-NESP-TWQ-4\4.3_Best-practice-coral-restoration
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).