52 datasets found

i
Evolution of Web search engine interfaces through SERP screenshots and HTML...
rdm.inesctec.pt
Updated Jul 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-003
Explore at:
Dataset updated
Jul 26, 2021
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).
e
Just Google It - Digital Research Practices of Humanities Scholars - Dataset...
b2find.eudat.eu
b2find.dkrz.de
Updated Jul 2, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2013). Just Google It - Digital Research Practices of Humanities Scholars - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/6917ae26-1a4f-5c9f-a4f5-dc4bca30e9bf
Explore at:
Dataset updated
Jul 2, 2013
Description
The transition from analog to digital archives and the recent explosion of online content offers researchers novel ways of engaging with data. The crucial question for ensuring a balance between the supply and demand-side of data, is whether this trend connects to existing scholarly practices and to the average search skills of researchers. To gain insight into this process a survey was conducted among nearly three hundred (N= 288) humanities scholars in the Netherlands and Belgium with the aim of finding answers to the following questions: 1) To what extent are digital databases and archives used? 2) What are the preferences in search functionalities 3) Are there differences in search strategies between novices and experts of information retrieval? Our results show that while scholars actively engage in research online they mainly search for text and images. General search systems such as Google and JSTOR are predominant, while large-scale collections such as Europeana are rarely consulted. Searching with keywords is the dominant search strategy and advanced search options are rarely used. When comparing novice and more experienced searchers, the first tend to have a more narrow selection of search engines, and mostly use keywords. Our overall findings indicate that Google is the key player among available search engines. This dominant use illustrates the paradoxical attitude of scholars toward Google: while transparency of provenance and selection are deemed key academic requirements, the workings of the Google algorithm remain unclear. We conclude that Google introduces a black box into digital scholarly practices, indicating scholars will become increasingly dependent on such black boxed algorithms. This calls for a reconsideration of the academic principles of provenance and context.
Next Generation Search Engines Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Next Generation Search Engines Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/next-generation-search-engines-market-global-industry-analysis
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Aug 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Next Generation Search Engines Market Outlook

According to our latest research, the global Next Generation Search Engines market size reached USD 16.2 billion in 2024, with a robust year-on-year growth driven by rapid technological advancements and escalating demand for intelligent search solutions across industries. The market is expected to witness a CAGR of 18.7% during the forecast period from 2025 to 2033, propelling the market to a projected value of USD 82.3 billion by 2033. The accelerating adoption of artificial intelligence (AI), machine learning (ML), and natural language processing (NLP) within search technologies is a key growth factor, as organizations seek more accurate, context-aware, and personalized information retrieval solutions.

One of the most significant growth drivers for the Next Generation Search Engines market is the exponential increase in digital content and data generation worldwide. Enterprises and consumers alike are producing vast amounts of unstructured data daily, from documents and emails to social media posts and multimedia files. Traditional search engines often struggle to deliver relevant results from such complex datasets. Next generation search engines, powered by AI and ML algorithms, are uniquely positioned to address this challenge by providing semantic understanding, contextual relevance, and intent-driven results. This capability is especially critical for industries like healthcare, BFSI, and e-commerce, where timely and precise information retrieval can directly impact decision-making, operational efficiency, and customer satisfaction.

Another major factor fueling the growth of the Next Generation Search Engines market is the proliferation of mobile devices and the evolution of user interaction paradigms. As consumers increasingly rely on smartphones, tablets, and voice assistants, there is a growing demand for search solutions that support voice and visual queries, in addition to traditional text-based searches. Technologies such as voice search and visual search are gaining traction, enabling users to interact with search engines more naturally and intuitively. This shift is prompting enterprises to invest in advanced search platforms that can seamlessly integrate with diverse devices and channels, enhancing user engagement and accessibility. The integration of NLP further empowers these platforms to understand complex queries, colloquial language, and regional dialects, making search experiences more inclusive and effective.

Furthermore, the rise of enterprise digital transformation initiatives is accelerating the adoption of next generation search technologies across various sectors. Organizations are increasingly seeking to unlock the value of their internal data assets by deploying enterprise search solutions that can index, analyze, and retrieve information from multiple sources, including databases, intranets, cloud storage, and third-party applications. These advanced search engines not only improve knowledge management and collaboration but also support compliance, security, and data governance requirements. As businesses continue to embrace hybrid and remote work models, the need for efficient, secure, and scalable search capabilities becomes even more pronounced, driving sustained investment in this market.

Regionally, North America currently dominates the Next Generation Search Engines market, owing to the early adoption of AI-driven technologies, strong presence of leading technology vendors, and high digital literacy rates. However, Asia Pacific is emerging as the fastest-growing region, fueled by rapid digitalization, expanding internet penetration, and increasing investments in AI research and development. Europe is also witnessing steady growth, supported by robust regulatory frameworks and growing demand for advanced search solutions in sectors such as BFSI, healthcare, and education. Latin America and the Middle East & Africa are gradually catching up, as enterprises in these regions recognize the value of next generation search engines in enhancing operational efficiency and customer experience.
Z
Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...
data.niaid.nih.gov
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haak, Fabian (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
Explore at:
Dataset updated
Mar 1, 2023
Dataset provided by
Haak, Fabian
Schaer, Philipp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
e
Semantic Query Analysis from the Global Science Gateway - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Semantic Query Analysis from the Global Science Gateway - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2cf68914-a4ff-535e-89bc-9b86b2ca555c
Explore at:
Dataset updated
Oct 12, 2024
Description
Nowadays web portals play an essential role in searching and retrieving information in the several fields of knowledge: they are ever more technologically advanced and designed for supporting the storage of a huge amount of information in natural language originating from the queries launched by users worldwide.A good example is given by the WorldWideScience search engine:The database is available at . It is based on a similar gateway, Science.gov, which is the major path to U.S. government science information, as it pulls together Web-based resources from various agencies. The information in the database is intended to be of high quality and authority, as well as the most current available from the participating countries in the Alliance, so users will find that the results will be more refined than those from a general search of Google. It covers the fields of medicine, agriculture, the environment, and energy, as well as basic sciences. Most of the information may be obtained free of charge (the database itself may be used free of charge) and is considered ‘‘open domain.’’ As of this writing, there are about 60 countries participating in WorldWideScience.org, providing access to 50+databases and information portals. Not all content is in English. (Bronson, 2009)Given this scenario, we focused on building a corpus constituted by the query logs registered by the GreyGuide: Repository and Portal to Good Practices and Resources in Grey Literature and received by the WorldWideScience.org (The Global Science Gateway) portal: the aim is to retrieve information related to social media which as of today represent a considerable source of data more and more widely used for research ends.This project includes eight months of query logs registered between July 2017 and February 2018 for a total of 445,827 queries. The analysis mainly concentrates on the semantics of the queries received from the portal clients: it is a process of information retrieval from a rich digital catalogue whose language is dynamic, is evolving and follows – as well as reflects – the cultural changes of our modern society.
d
Google SERP Data, Web Search Data, Google Images Data | Real-Time API
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWeb Ninja, Google SERP Data, Web Search Data, Google Images Data | Real-Time API [Dataset]. https://datarade.ai/data-products/openweb-ninja-google-data-google-image-data-google-serp-d-openweb-ninja
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
OpenWeb Ninja
Area covered
Tokelau, Burundi, Panama, Uganda, Barbados, Ireland, Grenada, Virgin Islands (U.S.), Uruguay, South Georgia and the South Sandwich Islands
Description
OpenWeb Ninja's Google Images Data (Google SERP Data) API provides real-time image search capabilities for images sourced from all public sources on the web.

The API enables you to search and access more than 100 billion images from across the web including advanced filtering capabilities as supported by Google Advanced Image Search. The API provides Google Images Data (Google SERP Data) including details such as image URL, title, size information, thumbnail, source information, and more data points. The API supports advanced filtering and options such as file type, image color, usage rights, creation time, and more. In addition, any Advanced Google Search operators can be used with the API.

OpenWeb Ninja's Google Images Data & Google SERP Data API common use cases:

Creative Media Production: Enhance digital content with a vast array of real-time images, ensuring engaging and brand-aligned visuals for blogs, social media, and advertising.

AI Model Enhancement: Train and refine AI models with diverse, annotated images, improving object recognition and image classification accuracy.

Trend Analysis: Identify emerging market trends and consumer preferences through real-time visual data, enabling proactive business decisions.

Innovative Product Design: Inspire product innovation by exploring current design trends and competitor products, ensuring market-relevant offerings.

Advanced Search Optimization: Improve search engines and applications with enriched image datasets, providing users with accurate, relevant, and visually appealing search results.

OpenWeb Ninja's Annotated Imagery Data & Google SERP Data Stats & Capabilities:

100B+ Images: Access an extensive database of over 100 billion images.

Images Data from all Public Sources (Google SERP Data): Benefit from a comprehensive aggregation of image data from various public websites, ensuring a wide range of sources and perspectives.

Extensive Search and Filtering Capabilities: Utilize advanced search operators and filters to refine image searches by file type, color, usage rights, creation time, and more, making it easy to find exactly what you need.

Rich Data Points: Each image comes with more than 10 data points, including URL, title (annotation), size information, thumbnail, and source information, providing a detailed context for each image.
u
Data from: Inventory of online public databases and repositories holding...
agdatacommons.nal.usda.gov
s.cnmilf.com
+3more
txt
Updated Feb 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erin Antognoli; Jonathan Sears; Cynthia Parr (2024). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. http://doi.org/10.15482/USDA.ADC/1389839
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1389839
Dataset updated
Feb 8, 2024
Dataset provided by
Ag Data Commons
Authors
Erin Antognoli; Jonathan Sears; Cynthia Parr
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to

establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data

Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered.
Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review:

Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection.
Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation.

See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
h
google_search_terms_training_data
huggingface.co
Updated Jul 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoshang Chenoy (2024). google_search_terms_training_data [Dataset]. https://huggingface.co/datasets/hoshangc/google_search_terms_training_data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2024
Authors
Hoshang Chenoy
Description
Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

Dataset Name: Google Search Trends Top Rising Search Terms Description: The Google Search Trends Top Rising Search Terms dataset provides valuable insights into the most rapidly growing search queries on the Google search engine. It offers a comprehensive collection of trending search… See the full description on the dataset page: https://huggingface.co/datasets/hoshangc/google_search_terms_training_data.
h
agnewsadapted
huggingface.co
Updated Apr 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Ferreira Brigham (2023). agnewsadapted [Dataset]. https://huggingface.co/datasets/ebrigham/agnewsadapted
Explore at:
Dataset updated
Apr 13, 2023
Authors
Eduardo Ferreira Brigham
Description
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
Histione HCD and CID datasets for evaluation of search engines
data.niaid.nih.gov
ebi.ac.uk
xml
Updated Sep 1, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zuo-Fei Yuan; Benjamin A. Garcia (2014). Histione HCD and CID datasets for evaluation of search engines [Dataset]. https://data.niaid.nih.gov/resources?id=pxd001118
Explore at:
xmlAvailable download formats
Dataset updated
Sep 1, 2014
Dataset provided by
Epigenetics Program, Department of Biochemistry and Biophysics, Perelman School of Medicine, University of Pennsylvania
UPenn
Authors
Zuo-Fei Yuan; Benjamin A. Garcia
Variables measured
Proteomics
Description
To identify confident spectra from histone peptides containing PTMs, we present a method in which one kind of modification is searched each time. We then combine the identifications of multiple search engines to obtain confident results. We find that two search engines, pFind and Mascot, identify most of the confident results. This study will be beneficial those who are interested in histone proteomics analysis.
The State of Serverless Applications: Collection,Characterization, and...
zenodo.org
zip
Updated Aug 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad; Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad (2021). The State of Serverless Applications: Collection,Characterization, and Community Consensus - Replication Package [Dataset]. http://doi.org/10.5281/zenodo.5185055
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5185055
Dataset updated
Aug 12, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad; Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The replication package for our article The State of Serverless Applications: Collection,Characterization, and Community Consensus provides everything required to reproduce all results for the following three studies:

Serverless Application Collection

Serverless Application Characterization

Comparison Study

Serverless Application Collection

We collect descriptions of serverless applications from open-source projects, academic literature, industrial literature, and scientific computing.

Open-source Applications

As a starting point, we used an existing data set on open-source serverless projects from this study. We removed small and inactive projects based on the number of files, commits, contributors, and watchers. Next, we manually filtered the resulting data set to include only projects that implement serverless applications. We provide a table containing all projects that remained after the filtering alongside the notes from the manual filtering.

Academic Literature Applications

We based our search on an existing community-curated dataset on literature for serverless computing consisting of over 180 peer-reviewed articles. First, we filtered the articles based on title and abstract. In a second iteration, we filtered out any articles that implement only a single function for evaluation purposes or do not include sufficient detail to enable a review. As the authors were familiar with some additional publications describing serverless applications, we contributed them to the community-curated dataset and included them in this study. We provide a table with our notes from the manual filtering.

Scientific Computing Applications

Most of these scientific computing serverless applications are still at an early stage and therefore there is little public data available. One of the authors is employed at the German Aerospace Center (DLR) at the time of writing, which allowed us to collect information about several projects at DLR that are either currently moving to serverless solutions or are planning to do so. Additionally, an application from the German Electron Synchrotron (DESY) could be included. For each of these scientific computing applications, we provide a document containing a description of the project and the names of our contacts that provided information for the characterization of these applications.

SC1 Copernicus Sentinel-1 for near-real-time water monitoring

SC2 Reprocessing Sentinel 5 Precursor data with ProEO

SC3 High-Performance Data Analytics for Earth Observation

SC4 Tandem-L exploitation platform

SC5 Global Urban Footprint

SC6 DESY - High Throughput Data Taking

Collection of serverless applications

Based on the previously described methodology, we collected a diverse dataset of 89 serverless applications from open-source projects, academic literature, industrial literature, and scientific computing. This dataset is can be found in Dataset.xlsx.

Serverless Application Characterization

As previously described, we collected 89 serverless applications from four different sources. Subsequently, two randomly assigned reviewers out of seven available reviewers characterized each application along 22 characteristics in a structured collaborative review sheet. The characteristics and potential values were defined a priori by the authors and iteratively refined, extended, and generalized during the review process. The initial moderate inter-rater agreement was followed by a discussion and consolidation phase, where all differences between the two reviewers were discussed and resolved. The six scientific applications were not publicly available and therefore characterized by a single domain expert, who is either involved in the development of the applications or in direct contact with the development team.

Initial Ratings & Interrater Agreement Calculation

The initial reviews are available as a table, where every application is characterized along with the 22 characteristics. A single value indicates that both reviewers assigned the same value, whereas a value of the form [Reviewer 2] A | [Reviewer 4] B indicates that for this characteristic, reviewer two assigned the value A, whereas reviewer assigned the value B.

Our script for the calculation of the Fleiß-Kappa score based on this data is also publically available. It requires the python package pandas and statsmodels. It does not require any input and assumes that the file Initial Characterizations.csv is located in the same folder. It can be executed as follows:

python3 CalculateKappa.py

Results Including Unknown Data

In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find.

For six characteristics, many applications were assigned the ''Unknown'' value, i.e., the reviewers were not able to determine the value of this characteristic. Therefore, we excluded these characteristics from this study. For the remaining characteristics, the percentage of ''Unknowns'' ranges from 0–19% with two outliers at 25% and 30%. These ''Unknowns'' were excluded from the percentage values presented in the article. As part of our replication package, we provide the raw results for each characteristic including the ''Unknown'' percentages in the form of bar charts.

The script for the generation of these bar charts is also part of this replication package). It uses the python packages pandas, numpy, and matplotlib. It does not require any input and assumes that the file Dataset.csv is located in the same folder. It can be executed as follows:

python3 GenerateResultsIncludingUnknown.py

Final Dataset & Figure Generation

In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find. Following this process, we were able to resolve all conflicts, resulting in a collection of 89 applications described by 18 characteristics. This dataset is available here: link

The script to generate all figures shown in the chapter "Serverless Application Characterization can be found here. It does not require any input but assumes that the file Dataset.csv is located in the same folder. It uses the python packages pandas, numpy, and matplotlib. It can be executed as follows:

python3 GenerateFigures.py

Comparison Study

To identify existing surveys and datasets that also investigate one of our characteristics, we conducted a literature search using Google as our search engine, as we were mostly looking for grey literature. We used the following search term:

("serverless" OR "faas") AND ("dataset" OR "survey" OR "report") after: 2018-01-01

This search term looks for any combination of either serverless or faas alongside any of the terms dataset, survey, or report. We further limited the search to any articles after 2017, as serverless is a fast-moving field and therefore any older studies are likely outdated already. This search term resulted in a total of 173 search results. In order to validate if using only a single search engine is sufficient, and if the search term is broad enough, we
Indian Pharmaceutical Products
kaggle.com
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishgeeky (2025). Indian Pharmaceutical Products [Dataset]. http://doi.org/10.34740/kaggle/ds/7699271
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/7699271
Dataset updated
Jun 19, 2025
Dataset provided by
Kaggle
Authors
Rishgeeky
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains a comprehensive collection of over 250,000 pharmaceutical products available in India, including details like medicine name, price (INR), manufacturer, packaging, and active compositions.

Each entry reflects structured real-world pharmaceutical product data, useful for analyzing trends in medicine pricing, formulations, discontinued products, and market competition. The dataset was cleaned to remove duplicates, extract quantities from packaging labels, and enrich fields like medicine form and composition structure.

Columns Included:

id: Unique ID for each medicine

name: Brand name of the drug

price_inr: Retail price in Indian Rupees

is_discontinued: Whether the product is active or discontinued

manufacturer_name: Drug manufacturing company

packaging: Original packaging info (e.g., "strip of 10 tablets")

pack_quantity: Number or volume extracted from packaging

pack_unit: Unit of measurement (e.g., tablets, ml)

active_ingredient_1 & active_ingredient_2: Composition of the medicine

medicine_form: Extracted form such as Tablet, Syrup, Injection, etc.

Possible Use Cases:

Analyzing drug price variations across manufacturers

Identifying top manufacturers or most common drug compositions

Drug recommendation or search engine (based on active ingredients)

Research in pharmacoeconomics, generic vs. branded pricing

Disclaimer: This dataset is compiled for educational and analytical use only. It does not provide medical advice or endorsements.
f
Data from: S1 Dataset -
figshare.com
xlsx
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muath Saad Alassaf; Ayman Bakkari; Jehad Saleh; Abdulsamad Habeeb; Bashaer Fahad Aljuhani; Ahmad A. Qazali; Ahmed Yaseen Alqutaibi (2025). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0312832.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312832.s002
Dataset updated
Jan 24, 2025
Dataset provided by
PLOS ONE
Authors
Muath Saad Alassaf; Ayman Bakkari; Jehad Saleh; Abdulsamad Habeeb; Bashaer Fahad Aljuhani; Ahmad A. Qazali; Ahmed Yaseen Alqutaibi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThis study aimed to investigate the quality and readability of online English health information about dental sensitivity and how patients evaluate and utilize these web-based information.MethodsThe credibility and readability of health information was obtained from three search engines. We conducted searches in "incognito" mode to reduce the possibility of biases. Quality assessment utilized JAMA benchmarks, the DISCERN tool, and HONcode. Readability was analyzed using the SMOG, FRE, and FKGL indices.ResultsOut of 600 websites, 90 were included, with 62.2% affiliated with dental or medical centers, among these websites, 80% exclusively related to dental implant treatments. Regarding JAMA benchmarks, currency was the most commonly achieved and 87.8% of websites fell into the "moderate quality" category. Word and sentence counts ranged widely with a mean of 815.7 (±435.4) and 60.2 (±33.3), respectively. FKGL averaging 8.6 (±1.6), SMOG scores averaging 7.6 (±1.1), and FRE scale showed a mean of 58.28 (±9.1), with "fair difficult" being the most common category.ConclusionThe overall evaluation using DISCERN indicated a moderate quality level, with a notable absence of referencing. JAMA benchmarks revealed a general non-adherence among websites, as none of the websites met all of the four criteria. Only one website was HON code certified, suggesting a lack of reliable sources for web-based health information accuracy. Readability assessments showed varying results, with the majority being "fair difficult". Although readability did not significantly differ across affiliations, a wide range of the number of words and sentences count was observed between them.
Most popular database management systems worldwide 2024
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most popular database management systems worldwide 2024 [Dataset]. https://www.statista.com/statistics/809750/worldwide-popularity-ranking-database-management-systems/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jun 2024
Area covered
Worldwide
Description
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
f
Description of the real-world dataset.
plos.figshare.com
xls
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fadi K. Dib; Peter Rodgers (2023). Description of the real-world dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0287744.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0287744.t010
Dataset updated
Jun 27, 2023
Dataset provided by
PLOS ONE
Authors
Fadi K. Dib; Peter Rodgers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Graph drawing, involving the automatic layout of graphs, is vital for clear data visualization and interpretation but poses challenges due to the optimization of a multi-metric objective function, an area where current search-based methods seek improvement. In this paper, we investigate the performance of Jaya algorithm for automatic graph layout with straight lines. Jaya algorithm has not been previously used in the field of graph drawing. Unlike most population-based methods, Jaya algorithm is a parameter-less algorithm in that it requires no algorithm-specific control parameters and only population size and number of iterations need to be specified, which makes it easy for researchers to apply in the field. To improve Jaya algorithm’s performance, we applied Latin Hypercube Sampling to initialize the population of individuals so that they widely cover the search space. We developed a visualization tool that simplifies the integration of search methods, allowing for easy performance testing of algorithms on graphs with weighted aesthetic metrics. We benchmarked the Jaya algorithm and its enhanced version against Hill Climbing and Simulated Annealing, commonly used graph-drawing search algorithms which have a limited number of parameters, to demonstrate Jaya algorithm’s effectiveness in the field. We conducted experiments on synthetic datasets with varying numbers of nodes and edges using the Erdős–Rényi model and real-world graph datasets and evaluated the quality of the generated layouts, and the performance of the methods based on number of function evaluations. We also conducted a scalability experiment on Jaya algorithm to evaluate its ability to handle large-scale graphs. Our results showed that Jaya algorithm significantly outperforms Hill Climbing and Simulated Annealing in terms of the quality of the generated graph layouts and the speed at which the layouts were produced. Using improved population sampling generated better layouts compared to the original Jaya algorithm using the same number of function evaluations. Moreover, Jaya algorithm was able to draw layouts for graphs with 500 nodes in a reasonable time.
Dataset used to guide the development of Scout and bechmark XL-MS search...
data.niaid.nih.gov
xml
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Ruwolt; Fan Liu (2024). Dataset used to guide the development of Scout and bechmark XL-MS search engines [Dataset]. https://data.niaid.nih.gov/resources?id=pxd052022
Explore at:
xmlAvailable download formats
Dataset updated
Aug 3, 2024
Dataset provided by
Leibniz-Forschungsinstitut fuer Molekulare Pharmakologie
Leibniz-Forschungsinstitut for Molecular Pharmacology
Authors
Max Ruwolt; Fan Liu
Variables measured
Proteomics
Description
This submission includes the raw data analyzed and search results described in our manuscript “Proteome-Scale Recombinant Standards And A Robust High-Speed Search Engine To Advance Cross-Linking MS-Based Interactomics”. In this study, we develop a strategy to generate a well-controlled XL-MS standard by systematically mixing and cross-linking recombinant proteins. The standard can be split into independent datasets, each of which has the MS2-level complexity of a typical proteome-wide XL-MS experiment. The raw datasets included in this submission were used to (1) guide the development of Scout, a machine learning-based search engine for XL-MS with MS-cleavable cross-linkers (batch 1), test different LC-MS acquisition methods (batch 2), and directly compare Scout to widely used XL-MS search engines (batches 3 and 4).
Z
Search Engines Comparison and Websites Performance
data.niaid.nih.gov
zenodo.org
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ntimo, Georgios (2023). Search Engines Comparison and Websites Performance [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8102699
Explore at:
Dataset updated
Jul 1, 2023
Dataset provided by
Ntararas, Vasilios
Ntimo, Georgios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The current dataset is consisted of 200 search results extracted from Google and Bing engines (100 of Google and 100 of Bing). The search terms are selected from the 10 most search keywords of 2021 based on the provided data of Google Trends. The rest of the sheets include the performance of the websites according to three technical evaluation aspects. That is, SEO, Speed and Security. The performance dataset has been developed through the utilization of CheckBot crawling tool. The whole dataset can help information retrieval scientists to compare the two engines in terms of their position/ranking and their performance related to these factors.

For more information about the thinking of the of the structure of the dataset please contact the Information Management Lab of University of West Attica.

Contact Persons: Vasilis Ntararas (lb17032@uniwa.gr) , Georgios Ntimo (lb17100@uniwa.gr) and Ioannis C. Drivas (idrivas@uniwa.gr)
World - Twitter Sentiment By Country
kaggle.com
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Jiang (2020). World - Twitter Sentiment By Country [Dataset]. https://www.kaggle.com/wjia26/twittersentimentbycountry/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
William Jiang
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
World
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">

Introduction

Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!

Content

Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)

Notes

There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.

Acknowledgements

Thanks to the tweepy package for making the data extraction via Twitter API so easy.

Shameless Plug

Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.

Here's an App I built using a live version of this data.
f
Data from: Comparative Evaluation of Proteome Discoverer and FragPipe for...
acs.figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianen He; Youqi Liu; Yan Zhou; Lu Li; He Wang; Shanjun Chen; Jinlong Gao; Wenhao Jiang; Yi Yu; Weigang Ge; Hui-Yin Chang; Ziquan Fan; Alexey I. Nesvizhskii; Tiannan Guo; Yaoting Sun (2023). Comparative Evaluation of Proteome Discoverer and FragPipe for the TMT-Based Proteome Quantification [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00390.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.2c00390.s002
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Tianen He; Youqi Liu; Yan Zhou; Lu Li; He Wang; Shanjun Chen; Jinlong Gao; Wenhao Jiang; Yi Yu; Weigang Ge; Hui-Yin Chang; Ziquan Fan; Alexey I. Nesvizhskii; Tiannan Guo; Yaoting Sun
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Isobaric labeling-based proteomics is widely applied in deep proteome quantification. Among the platforms for isobaric labeled proteomic data analysis, the commercial software Proteome Discoverer (PD) is widely used, incorporating the search engine CHIMERYS, while FragPipe (FP) is relatively new, free for noncommercial purposes, and integrates the engine MSFragger. Here, we compared PD and FP over three public proteomic data sets labeled using 6plex, 10plex, and 16plex tandem mass tags. Our results showed the protein abundances generated by the two software are highly correlated. PD quantified more proteins (10.02%, 15.44%, 8.19%) than FP with comparable NA ratios (0.00% vs. 0.00%, 0.85% vs. 0.38%, and 11.74% vs. 10.52%) in the three data sets. Using the 16plex data set, PD and FP outputs showed high consistency in quantifying technical replicates, batch effects, and functional enrichment in differentially expressed proteins. However, FP saved 93.93%, 96.65%, and 96.41% of processing time compared to PD for analyzing the three data sets, respectively. In conclusion, while PD is a well-maintained commercial software integrating various additional functions and can quantify more proteins, FP is freely available and achieves similar output with a shorter computational time. Our results will guide users in choosing the most suitable quantification software for their needs.
d
Data from: Prediction of DIII-D Pedestal Structure from Externally...
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeger, Emi U.; Laggner, Florian M.; Bortolon, Alessandro; Rea, Cristina; Meneghini, Orso; Saarelma, Samuli; Sammuli, Brian S.; Smith, Sterling P.; Zhao, Jinjin (2023). Prediction of DIII-D Pedestal Structure from Externally Controllable Parameters [Dataset]. http://doi.org/10.7910/DVN/SA4E4W
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SA4E4W
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Zeger, Emi U.; Laggner, Florian M.; Bortolon, Alessandro; Rea, Cristina; Meneghini, Orso; Saarelma, Samuli; Sammuli, Brian S.; Smith, Sterling P.; Zhao, Jinjin
Description
The sharp increase of pressure at the edge of a high confinement mode (H-mode) plasma, the pedestal, strongly impacts overall plasma performance. Predicting the pedestal is a necessity to control and optimize tokamak operations. An experimental data-driven machine learning (ML) approach is presented that predicts the pedestal heights and widths of electron density (ne) and electron temperature (Te) profiles as well as the separatrix ne from externally controllable parameters such as the plasma shape, heating method and power, and gas puff rate and integrated gas puff. The OMFIT framework was used with DIII-D data to efficiently, robustly, and automatically build a database of pedestal parameters to train machine learning models. Database creation was enabled by the search engine tool for DIII-D data, TokSearch, which parallelizes data fetching, enabling fast searches through basic signals of thousands of DIII-D shots and selection of relevant time intervals. Principal Component Analysis (PCA) separated the database into three clusters that represent classes of plasma shapes that are regularly used in DIII-D. The most important parameters for setting the pedestal structure were plasma current (Ip), toroidal magnetic field (Bφ), neutral beam heating power (PNBI) and shaping quantities. The Deep Jointly Informed Neural Networks (DJINN) algorithm was applied to identify suitable neural network (NN) architectures that appropriately capture the features of the pedestal database. Separate NNs were implemented for each pedestal parameter, and ensembling methods were used to improve the prediction accuracy and allowed estimation of the prediction uncertainty. The pedestal predictions of the test dataset lie within the measurement uncertainties of the pedestal parameters. The NN outperformed simple Linear Regression (LR) analysis, indicating non-linear dependencies in the pedestal structure. The presented achievements illustrate a promising path for future research, using feature extraction to infer experimental trends and thereby improve pedestal models as well as deploying NN for a fast pedestal prediction in DIII-D scenario development.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2021). Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-003

Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN

Explore at:

Dataset updated

Jul 26, 2021

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).

Clear search

Close search

Google apps

Main menu

Evolution of Web search engine interfaces through SERP screenshots and HTML...

Just Google It - Digital Research Practices of Humanities Scholars - Dataset...

Next Generation Search Engines Market Research Report 2033

Next Generation Search Engines Market Outlook

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

Semantic Query Analysis from the Global Science Gateway - Dataset - B2FIND

Google SERP Data, Web Search Data, Google Images Data | Real-Time API

Data from: Inventory of online public databases and repositories holding...

google_search_terms_training_data

agnewsadapted

Histione HCD and CID datasets for evaluation of search engines

The State of Serverless Applications: Collection,Characterization, and...

Indian Pharmaceutical Products

Data from: S1 Dataset -

Most popular database management systems worldwide 2024

Description of the real-world dataset.

Dataset used to guide the development of Scout and bechmark XL-MS search...

Search Engines Comparison and Websites Performance

World - Twitter Sentiment By Country

Introduction

Content

Notes

Acknowledgements

Shameless Plug

Data from: Comparative Evaluation of Proteome Discoverer and FragPipe for...

Data from: Prediction of DIII-D Pedestal Structure from Externally...

Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN