56 datasets found
  1. i

    Evolution of Web search engine interfaces through SERP screenshots and HTML...

    • rdm.inesctec.pt
    Updated Jul 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-003
    Explore at:
    Dataset updated
    Jul 26, 2021
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).

  2. DataForSEO Google Full (Keywords+SERP) database, historical data available

    • datarade.ai
    .json, .csv
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataForSEO (2023). DataForSEO Google Full (Keywords+SERP) database, historical data available [Dataset]. https://datarade.ai/data-products/dataforseo-google-full-keywords-serp-database-historical-d-dataforseo
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    Authors
    DataForSEO
    Area covered
    South Africa, Burkina Faso, Costa Rica, Paraguay, United Kingdom, Côte d'Ivoire, Cyprus, Sweden, Bolivia (Plurinational State of), Portugal
    Description

    You can check the fields description in the documentation: current Full database: https://docs.dataforseo.com/v3/databases/google/full/?bash; Historical Full database: https://docs.dataforseo.com/v3/databases/google/history/full/?bash.

    Full Google Database is a combination of the Advanced Google SERP Database and Google Keyword Database.

    Google SERP Database offers millions of SERPs collected in 67 regions with most of Google’s advanced SERP features, including featured snippets, knowledge graphs, people also ask sections, top stories, and more.

    Google Keyword Database encompasses billions of search terms enriched with related Google Ads data: search volume trends, CPC, competition, and more.

    This database is available in JSON format only.

    You don’t have to download fresh data dumps in JSON – we can deliver data straight to your storage or database. We send terrabytes of data to dozens of customers every month using Amazon S3, Google Cloud Storage, Microsoft Azure Blob, Eleasticsearch, and Google Big Query. Let us know if you’d like to get your data to any other storage or database.

  3. D

    Search Engineing Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Search Engineing Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/search-engine-marketing-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Search Engine Market Outlook



    The search engine market size was valued at approximately USD 124 billion in 2023 and is projected to reach USD 258 billion by 2032, witnessing a robust CAGR of 8.5% during the forecast period. This growth is largely attributed to the increasing reliance on digital platforms and the internet across various sectors, which has necessitated the use of search engines for data retrieval and information dissemination. With the proliferation of smartphones and the expansion of internet access globally, search engines have become indispensable tools for both businesses and consumers, driving the market's upward trajectory. The integration of artificial intelligence and machine learning technologies into search engines is transforming the way search engines operate, offering more personalized and efficient search results, thereby further propelling market growth.



    One of the primary growth factors in the search engine market is the ever-increasing digitalization across industries. As businesses continue to transition from traditional modes of operation to digital platforms, the need for search engines to navigate and manage data becomes paramount. This shift is particularly evident in industries such as retail, BFSI, and healthcare, where vast amounts of data are generated and require efficient management and retrieval systems. The integration of AI and machine learning into search engine algorithms has enhanced their ability to process and interpret large datasets, thereby improving the accuracy and relevance of search results. This technological advancement not only improves user experience but also enhances the competitive edge of businesses, further fueling market growth.



    Another significant growth factor is the expanding e-commerce sector, which relies heavily on search engines to connect consumers with products and services. With the rise of e-commerce giants and online marketplaces, consumers are increasingly using search engines to find the best prices, reviews, and availability of products, leading to a surge in search engine usage. Additionally, the implementation of voice search technology and the growing popularity of smart home devices have introduced new dynamics to search engine functionality. Consumers are now able to conduct searches verbally, which has necessitated the adaptation of search engines to incorporate natural language processing capabilities, further driving market growth.



    The advertising and marketing sectors are also contributing significantly to the growth of the search engine market. Businesses are leveraging search engines as a primary tool for online advertising, given their wide reach and ability to target specific audiences. Pay-per-click advertising and search engine optimization strategies have become integral components of digital marketing campaigns, enabling businesses to enhance their visibility and engagement with potential customers. The measurable nature of these advertising techniques allows businesses to assess the effectiveness of their campaigns and make data-driven decisions, thereby increasing their reliance on search engines and contributing to overall market growth.



    The evolution of search engines is closely tied to the development of Ai Enterprise Search, which is revolutionizing how businesses access and utilize information. Ai Enterprise Search leverages artificial intelligence to provide more accurate and contextually relevant search results, making it an invaluable tool for organizations that manage large volumes of data. By understanding user intent and learning from past interactions, Ai Enterprise Search systems can deliver personalized experiences that enhance productivity and decision-making. This capability is particularly beneficial in sectors such as finance and healthcare, where quick access to precise information is crucial. As businesses continue to digitize and data volumes grow, the demand for Ai Enterprise Search solutions is expected to increase, further driving the growth of the search engine market.



    Regionally, North America holds a significant share of the search engine market, driven by the presence of major technology companies and a well-established digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth can be attributed to the rapid digital transformation in emerging economies such as China and India, where increasing internet penetration and smartphone adoption are driving demand for search engines. Additionally, government initiatives to

  4. Z

    Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

    • data.niaid.nih.gov
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haak, Fabian (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
    Explore at:
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Schaer, Philipp
    Haak, Fabian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

    Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

    Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

    The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

    To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

    Dataset 2: Search Query Suggestions (suggestions.csv)

    The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

    The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

    We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

    AllSides Scraper

    At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

    We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

  5. Data from: Inventory of online public databases and repositories holding...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt

  6. d

    Just Google It - Digital Research Practices of Humanities Scholars - Dataset...

    • b2find.dkrz.de
    • b2find.eudat.eu
    Updated Jul 2, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2013). Just Google It - Digital Research Practices of Humanities Scholars - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/8c2af070-9cc3-5fed-b8bd-a7cec6e94c09
    Explore at:
    Dataset updated
    Jul 2, 2013
    Description

    The transition from analog to digital archives and the recent explosion of online content offers researchers novel ways of engaging with data. The crucial question for ensuring a balance between the supply and demand-side of data, is whether this trend connects to existing scholarly practices and to the average search skills of researchers. To gain insight into this process a survey was conducted among nearly three hundred (N= 288) humanities scholars in the Netherlands and Belgium with the aim of finding answers to the following questions: 1) To what extent are digital databases and archives used? 2) What are the preferences in search functionalities 3) Are there differences in search strategies between novices and experts of information retrieval? Our results show that while scholars actively engage in research online they mainly search for text and images. General search systems such as Google and JSTOR are predominant, while large-scale collections such as Europeana are rarely consulted. Searching with keywords is the dominant search strategy and advanced search options are rarely used. When comparing novice and more experienced searchers, the first tend to have a more narrow selection of search engines, and mostly use keywords. Our overall findings indicate that Google is the key player among available search engines. This dominant use illustrates the paradoxical attitude of scholars toward Google: while transparency of provenance and selection are deemed key academic requirements, the workings of the Google algorithm remain unclear. We conclude that Google introduces a black box into digital scholarly practices, indicating scholars will become increasingly dependent on such black boxed algorithms. This calls for a reconsideration of the academic principles of provenance and context.

  7. d

    China Consumer Interest from Baidu Search Index Analytics | Online Search...

    • datarade.ai
    .json, .csv
    Updated Apr 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datago Technology Limited (2024). China Consumer Interest from Baidu Search Index Analytics | Online Search Trends Data | 3000+ Global Consumer Bands | Daily Update [Dataset]. https://datarade.ai/data-products/china-consumer-interest-from-baidu-search-index-analytics-o-datago-technology-limited
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Apr 1, 2024
    Dataset authored and provided by
    Datago Technology Limited
    Area covered
    China
    Description

    Baidu Search Index is a big data analytics tool developed by Baidu to track changes in keyword search popularity within its search engine. By analyzing trends in the Baidu Search Index for specific keywords, users can effectively monitor public interest in topics, companies, or brands.

    As an ecosystem partner of Baidu Index, Datago has direct access to keyword search index data from Baidu's database, leveraging this information to build the BSIA-Consumer. This database encompasses popular brands that are actively searched by Chinese consumers, along with their commonly used names. By tracking Baidu Index search trends for these keywords, Datago precisely maps them to their corresponding publicly listed stocks.

    The database covers over 1,100 consumer stocks and 3,000+ brand keywords across China, the United States, Europe, and Japan, with a particular focus on popular sectors like luxury goods and vehicles. Through its analysis of Chinese consumer search interest, this database offers investors a unique perspective on market sentiment, consumer preferences, and brand influence, including:

    • Brand Influence Tracking – By leveraging Baidu Search Index data, investors can assess the level of consumer interest in various brands, helping to evaluate their influence and trends within the Chinese market.

    • Consumer Stock Mapping – BSIA-consumer provides an accurate linkage between brand keywords and their associated consumer stocks, enabling investor analysis driven by consumer interest.

    • Coverage of Popular Consumer Goods – BSIA-consumer focuses specifically on trending sectors like luxury goods and vehicles, offering valuable insights into these industries.

    • Coverage: 1000+ consumer stocks

    • History: 2016-01-01

    • Update Frequency: Daily

  8. Dataset used to guide the development of Scout and bechmark XL-MS search...

    • data.niaid.nih.gov
    xml
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Ruwolt; Fan Liu (2024). Dataset used to guide the development of Scout and bechmark XL-MS search engines [Dataset]. https://data.niaid.nih.gov/resources?id=pxd052022
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    Leibniz-Forschungsinstitut for Molecular Pharmacology
    Leibniz-Forschungsinstitut fuer Molekulare Pharmakologie
    Authors
    Max Ruwolt; Fan Liu
    Variables measured
    Proteomics
    Description

    This submission includes the raw data analyzed and search results described in our manuscript “Proteome-Scale Recombinant Standards And A Robust High-Speed Search Engine To Advance Cross-Linking MS-Based Interactomics”. In this study, we develop a strategy to generate a well-controlled XL-MS standard by systematically mixing and cross-linking recombinant proteins. The standard can be split into independent datasets, each of which has the MS2-level complexity of a typical proteome-wide XL-MS experiment. The raw datasets included in this submission were used to (1) guide the development of Scout, a machine learning-based search engine for XL-MS with MS-cleavable cross-linkers (batch 1), test different LC-MS acquisition methods (batch 2), and directly compare Scout to widely used XL-MS search engines (batches 3 and 4).

  9. e

    Semantic Query Analysis from the Global Science Gateway - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Semantic Query Analysis from the Global Science Gateway - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2cf68914-a4ff-535e-89bc-9b86b2ca555c
    Explore at:
    Dataset updated
    Oct 12, 2024
    Description

    Nowadays web portals play an essential role in searching and retrieving information in the several fields of knowledge: they are ever more technologically advanced and designed for supporting the storage of a huge amount of information in natural language originating from the queries launched by users worldwide.A good example is given by the WorldWideScience search engine:The database is available at . It is based on a similar gateway, Science.gov, which is the major path to U.S. government science information, as it pulls together Web-based resources from various agencies. The information in the database is intended to be of high quality and authority, as well as the most current available from the participating countries in the Alliance, so users will find that the results will be more refined than those from a general search of Google. It covers the fields of medicine, agriculture, the environment, and energy, as well as basic sciences. Most of the information may be obtained free of charge (the database itself may be used free of charge) and is considered ‘‘open domain.’’ As of this writing, there are about 60 countries participating in WorldWideScience.org, providing access to 50+databases and information portals. Not all content is in English. (Bronson, 2009)Given this scenario, we focused on building a corpus constituted by the query logs registered by the GreyGuide: Repository and Portal to Good Practices and Resources in Grey Literature and received by the WorldWideScience.org (The Global Science Gateway) portal: the aim is to retrieve information related to social media which as of today represent a considerable source of data more and more widely used for research ends.This project includes eight months of query logs registered between July 2017 and February 2018 for a total of 445,827 queries. The analysis mainly concentrates on the semantics of the queries received from the portal clients: it is a process of information retrieval from a rich digital catalogue whose language is dynamic, is evolving and follows – as well as reflects – the cultural changes of our modern society.

  10. h

    healthsearchqa

    • huggingface.co
    Updated Mar 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AISC Team D2 (2024). healthsearchqa [Dataset]. https://huggingface.co/datasets/aisc-team-d2/healthsearchqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2024
    Dataset authored and provided by
    AISC Team D2
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    HealthSearchQA

    Dataset of consumer health questions released by Google for the Med-PaLM paper (arXiv preprint). From the paper: We curated our own additional dataset consisting of 3,173 commonly searched consumer questions, referred to as HealthSearchQA. The dataset was curated using seed medical conditions and their associated symptoms. We used the seed data to retrieve publicly-available commonly searched questions generated by a search engine, which were displayed to all users… See the full description on the dataset page: https://huggingface.co/datasets/aisc-team-d2/healthsearchqa.

  11. Data from: Analysis of the Quantitative Impact of Social Networks General...

    • figshare.com
    • produccioncientifica.ucm.es
    doc
    Updated Oct 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Parra; Santiago Martínez Arias; Sergio Mena Muñoz (2022). Analysis of the Quantitative Impact of Social Networks General Data.doc [Dataset]. http://doi.org/10.6084/m9.figshare.21329421.v1
    Explore at:
    docAvailable download formats
    Dataset updated
    Oct 14, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    David Parra; Santiago Martínez Arias; Sergio Mena Muñoz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union". Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content? To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic. In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
    Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained. To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market. It includes:

    Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures

  12. f

    Data from: Comparative Evaluation of Proteome Discoverer and FragPipe for...

    • acs.figshare.com
    zip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianen He; Youqi Liu; Yan Zhou; Lu Li; He Wang; Shanjun Chen; Jinlong Gao; Wenhao Jiang; Yi Yu; Weigang Ge; Hui-Yin Chang; Ziquan Fan; Alexey I. Nesvizhskii; Tiannan Guo; Yaoting Sun (2023). Comparative Evaluation of Proteome Discoverer and FragPipe for the TMT-Based Proteome Quantification [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00390.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tianen He; Youqi Liu; Yan Zhou; Lu Li; He Wang; Shanjun Chen; Jinlong Gao; Wenhao Jiang; Yi Yu; Weigang Ge; Hui-Yin Chang; Ziquan Fan; Alexey I. Nesvizhskii; Tiannan Guo; Yaoting Sun
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Isobaric labeling-based proteomics is widely applied in deep proteome quantification. Among the platforms for isobaric labeled proteomic data analysis, the commercial software Proteome Discoverer (PD) is widely used, incorporating the search engine CHIMERYS, while FragPipe (FP) is relatively new, free for noncommercial purposes, and integrates the engine MSFragger. Here, we compared PD and FP over three public proteomic data sets labeled using 6plex, 10plex, and 16plex tandem mass tags. Our results showed the protein abundances generated by the two software are highly correlated. PD quantified more proteins (10.02%, 15.44%, 8.19%) than FP with comparable NA ratios (0.00% vs. 0.00%, 0.85% vs. 0.38%, and 11.74% vs. 10.52%) in the three data sets. Using the 16plex data set, PD and FP outputs showed high consistency in quantifying technical replicates, batch effects, and functional enrichment in differentially expressed proteins. However, FP saved 93.93%, 96.65%, and 96.41% of processing time compared to PD for analyzing the three data sets, respectively. In conclusion, while PD is a well-maintained commercial software integrating various additional functions and can quantify more proteins, FP is freely available and achieves similar output with a shorter computational time. Our results will guide users in choosing the most suitable quantification software for their needs.

  13. DBpedia Ontology

    • kaggle.com
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). DBpedia Ontology [Dataset]. https://www.kaggle.com/datasets/thedevastator/dbpedia-ontology-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    DBpedia Ontology

    Text Classification Dataset with 14 Classes

    By dbpedia_14 (From Huggingface) [source]

    About this dataset

    The DBpedia Ontology Classification Dataset, known as dbpedia_14, is a comprehensive and meticulously constructed dataset containing a vast collection of text samples. These samples have been expertly classified into 14 distinct and non-overlapping classes. The dataset draws its information from the highly reliable and up-to-date DBpedia 2014 knowledge base, ensuring the accuracy and relevance of the data.

    Each text sample in this extensive dataset consists of various components that provide valuable insights into its content. These components include a title, which succinctly summarizes the main topic or subject matter of the text sample, and content that comprehensively covers all relevant information related to a specific topic.

    To facilitate effective training of machine learning models for text classification tasks, each text sample is further associated with a corresponding label. This categorical label serves as an essential element for supervised learning algorithms to classify new instances accurately.

    Furthermore, this exceptional dataset is part of the larger DBpedia Ontology Classification Dataset with 14 Classes (dbpedia_14). It offers numerous possibilities for researchers, practitioners, and enthusiasts alike to conduct in-depth analyses ranging from sentiment analysis to topic modeling.

    Aspiring data scientists will find great value in utilizing this well-organized dataset for training their machine learning models. Although specific details about train.csv and test.csv files are not provided here due to their dynamic nature, they play pivotal roles during model training and testing processes by respectively providing labeled training samples and unseen test samples.

    Lastly, it's worth mentioning that users can refer to the included classes.txt file within this dataset for an exhaustive list of all 14 classes used in classifying these diverse text samples accurately.

    Overall, with its wealth of carefully curated textual data across multiple domains and precise class labels assigned based on well-defined categories derived from DBpedia 2014 knowledge base, the DBpedia Ontology Classification Dataset (dbpedia_14) proves instrumental in advancing research efforts related to natural language processing (NLP), text classification, and other related fields

    Research Ideas

    • Text classification: The DBpedia Ontology Classification Dataset can be used to train machine learning models for text classification tasks. With 14 different classes, the dataset is suitable for various classification tasks such as sentiment analysis, topic classification, or intent detection.
    • Ontology development: The dataset can also be used to improve or expand existing ontologies. By analyzing the text samples and their assigned labels, researchers can identify missing or incorrect relationships between concepts in the ontology and make improvements accordingly.
    • Semantic search engine: The DBpedia knowledge base is widely used in semantic search engines that aim to provide more accurate and relevant search results by understanding the meaning of user queries and matching them with structured data. This dataset can help in training models for improving the performance of these semantic search engines by enhancing their ability to classify and categorize information accurately based on user queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------------------------------------| | label | The class label assigned to each text sample. (Categorical) | | title | The heading or name given to each text sample, providing some context or overview of its content. (Text) |

    File: test.csv | Column name | Description | |:--------------|:-----------------------...

  14. The State of Serverless Applications: Collection,Characterization, and...

    • zenodo.org
    zip
    Updated Aug 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad; Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad (2021). The State of Serverless Applications: Collection,Characterization, and Community Consensus - Replication Package [Dataset]. http://doi.org/10.5281/zenodo.5185055
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 12, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad; Simon Eismann; Joel Scheuner; Erwin van Eyk; Maximilian Schwinger; Johannes Grohmann; NIkolas Herbst; Cristina Abad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The replication package for our article The State of Serverless Applications: Collection,Characterization, and Community Consensus provides everything required to reproduce all results for the following three studies:

    • Serverless Application Collection
    • Serverless Application Characterization
    • Comparison Study

    Serverless Application Collection

    We collect descriptions of serverless applications from open-source projects, academic literature, industrial literature, and scientific computing.

    Open-source Applications

    As a starting point, we used an existing data set on open-source serverless projects from this study. We removed small and inactive projects based on the number of files, commits, contributors, and watchers. Next, we manually filtered the resulting data set to include only projects that implement serverless applications. We provide a table containing all projects that remained after the filtering alongside the notes from the manual filtering.

    Academic Literature Applications

    We based our search on an existing community-curated dataset on literature for serverless computing consisting of over 180 peer-reviewed articles. First, we filtered the articles based on title and abstract. In a second iteration, we filtered out any articles that implement only a single function for evaluation purposes or do not include sufficient detail to enable a review. As the authors were familiar with some additional publications describing serverless applications, we contributed them to the community-curated dataset and included them in this study. We provide a table with our notes from the manual filtering.

    Scientific Computing Applications

    Most of these scientific computing serverless applications are still at an early stage and therefore there is little public data available. One of the authors is employed at the German Aerospace Center (DLR) at the time of writing, which allowed us to collect information about several projects at DLR that are either currently moving to serverless solutions or are planning to do so. Additionally, an application from the German Electron Synchrotron (DESY) could be included. For each of these scientific computing applications, we provide a document containing a description of the project and the names of our contacts that provided information for the characterization of these applications.

    • SC1 Copernicus Sentinel-1 for near-real-time water monitoring
    • SC2 Reprocessing Sentinel 5 Precursor data with ProEO
    • SC3 High-Performance Data Analytics for Earth Observation
    • SC4 Tandem-L exploitation platform
    • SC5 Global Urban Footprint
    • SC6 DESY - High Throughput Data Taking

    Collection of serverless applications

    Based on the previously described methodology, we collected a diverse dataset of 89 serverless applications from open-source projects, academic literature, industrial literature, and scientific computing. This dataset is can be found in Dataset.xlsx.

    Serverless Application Characterization

    As previously described, we collected 89 serverless applications from four different sources. Subsequently, two randomly assigned reviewers out of seven available reviewers characterized each application along 22 characteristics in a structured collaborative review sheet. The characteristics and potential values were defined a priori by the authors and iteratively refined, extended, and generalized during the review process. The initial moderate inter-rater agreement was followed by a discussion and consolidation phase, where all differences between the two reviewers were discussed and resolved. The six scientific applications were not publicly available and therefore characterized by a single domain expert, who is either involved in the development of the applications or in direct contact with the development team.

    Initial Ratings & Interrater Agreement Calculation

    The initial reviews are available as a table, where every application is characterized along with the 22 characteristics. A single value indicates that both reviewers assigned the same value, whereas a value of the form [Reviewer 2] A | [Reviewer 4] B indicates that for this characteristic, reviewer two assigned the value A, whereas reviewer assigned the value B.

    Our script for the calculation of the Fleiß-Kappa score based on this data is also publically available. It requires the python package pandas and statsmodels. It does not require any input and assumes that the file Initial Characterizations.csv is located in the same folder. It can be executed as follows:

    python3 CalculateKappa.py
    

    Results Including Unknown Data

    In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find.

    For six characteristics, many applications were assigned the ''Unknown'' value, i.e., the reviewers were not able to determine the value of this characteristic. Therefore, we excluded these characteristics from this study. For the remaining characteristics, the percentage of ''Unknowns'' ranges from 0–19% with two outliers at 25% and 30%. These ''Unknowns'' were excluded from the percentage values presented in the article. As part of our replication package, we provide the raw results for each characteristic including the ''Unknown'' percentages in the form of bar charts.

    The script for the generation of these bar charts is also part of this replication package). It uses the python packages pandas, numpy, and matplotlib. It does not require any input and assumes that the file Dataset.csv is located in the same folder. It can be executed as follows:

    python3 GenerateResultsIncludingUnknown.py
    

    Final Dataset & Figure Generation

    In the following discussion and consolidation phase, the reviewers compared their notes and tried to reach a consensus for the characteristics with conflicting assignments. In a few cases, the two reviewers had different interpretations of a characteristic. These conflicts were discussed among all authors to ensure that characteristic interpretations were consistent. However, for most conflicts, the consolidation was a quick process as the most frequent type of conflict was that one reviewer found additional documentation that the other reviewer did not find. Following this process, we were able to resolve all conflicts, resulting in a collection of 89 applications described by 18 characteristics. This dataset is available here: link

    The script to generate all figures shown in the chapter "Serverless Application Characterization can be found here. It does not require any input but assumes that the file Dataset.csv is located in the same folder. It uses the python packages pandas, numpy, and matplotlib. It can be executed as follows:

    python3 GenerateFigures.py
    

    Comparison Study

    To identify existing surveys and datasets that also investigate one of our characteristics, we conducted a literature search using Google as our search engine, as we were mostly looking for grey literature. We used the following search term:

    ("serverless" OR "faas") AND ("dataset" OR "survey" OR "report") after: 2018-01-01
    

    This search term looks for any combination of either serverless or faas alongside any of the terms dataset, survey, or report. We further limited the search to any articles after 2017, as serverless is a fast-moving field and therefore any older studies are likely outdated already. This search term resulted in a total of 173 search results. In order to validate if using only a single search engine is sufficient, and if the search term is broad enough, we

  15. f

    Data from: Database Search Strategies for Proteomic Data Sets Generated by...

    • acs.figshare.com
    text/x-perl
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve M. M. Sweet; Andrew W. Jones; Debbie L. Cunningham; John K. Heath; Andrew J. Creese; Helen J. Cooper (2023). Database Search Strategies for Proteomic Data Sets Generated by Electron Capture Dissociation Mass Spectrometry [Dataset]. http://doi.org/10.1021/pr9008282.s001
    Explore at:
    text/x-perlAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Steve M. M. Sweet; Andrew W. Jones; Debbie L. Cunningham; John K. Heath; Andrew J. Creese; Helen J. Cooper
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Large data sets of electron capture dissociation (ECD) mass spectra from proteomic experiments are rich in information; however, extracting that information in an optimal manner is not straightforward. Protein database search engines currently available are designed for low resolution CID data, from which Fourier transform ion cyclotron resonance (FT-ICR) ECD data differs significantly. ECD mass spectra contain both z-prime and z-dot fragment ions (and c-prime and c-dot); ECD mass spectra contain abundant peaks derived from neutral losses from charge-reduced precursor ions; FT-ICR ECD spectra are acquired with a larger precursor m/z isolation window than their low-resolution CID counterparts. Here, we consider three distinct stages of postacquisition analysis: (1) processing of ECD mass spectra prior to the database search; (2) the database search step itself and (3) postsearch processing of results. We demonstrate that each of these steps has an effect on the number of peptides identified, with the postsearch processing of results having the largest effect. We compare two commonly used search engines: Mascot and OMSSA. Using an ECD data set of modest size (3341 mass spectra) from a complex sample (mouse whole cell lysate), we demonstrate that search results can be improved from 630 identifications (19% identification success rate) to 1643 identifications (49% identification success rate). We focus in particular on improving identification rates for doubly charged precursors, which are typically low for ECD fragmentation. We compare our presearch processing algorithm with a similar algorithm recently developed for electron transfer dissociation (ETD) data.

  16. h

    msmarco-msmarco-distilbert-base-v3

    • huggingface.co
    Updated Nov 19, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2014). msmarco-msmarco-distilbert-base-v3 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 19, 2014
    Dataset authored and provided by
    Sentence Transformers
    Description

    MS MARCO with hard negatives from msmarco-distilbert-base-v3

    MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.

      Related Datasets
    

    These are the datasets generated using the 13 different models:

    msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3.

  17. R

    Food_new Dataset

    • universe.roboflow.com
    zip
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allergen30 (2024). Food_new Dataset [Dataset]. https://universe.roboflow.com/allergen30/food_new-uuulf
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    Allergen30
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Food Bounding Boxes
    Description

    Allergen30

    About Allergen30

    Allergen30 is created by Mayank Mishra, Nikunj Bansal, Tanmay Sarkar and Tanupriya Choudhury with a goal of building a robust detection model that can assist people in avoiding possible allergic reactions.

    It contains more than 6,000 images of 30 commonly used food items which can cause an adverse reaction within a human body. This dataset is one of the first research attempts in training a deep learning based computer vision model to detect the presence of such food items from images. It also serves as a benchmark for evaluating the efficacy of object detection methods in learning the otherwise difficult visual cues related to food items.

    Description of class labels

    There are multiple food items pertaining to specific food intolerances which can trigger an allergic reaction. Such food intolerance primarily include Lactose, Histamine, Gluten, Salicylate, Caffeine and Ovomucoid intolerance. https://github.com/mmayank74567/mmayank74567.github.io/blob/master/images/FoodIntol.png?raw=true" alt="Food intolerance">

    The following table contains the description relating to the 30 class labels in our dataset.

    S. No.AllergenFood labelDescription
    1OvomucoideggImages of egg with yolk (e.g. sunny side up eggs)
    2Ovomucoidwhole_egg_boiledImages of soft and hard boiled eggs
    3Lactose/HistaminemilkImages of milk in a glass
    4LactoseicecreamImages of icecream scoops
    5LactosecheeseImages of swiss cheese
    6Lactose/ Caffeinemilk_based_beverageImages of tea/ coffee with milk in a cup/glass
    7Lactose/CaffeinechocolateImages of chocolate bars
    8Caffeinenon_milk_based_beverageImages of soft drinks and tea/coffee without milk in a cup/glass
    9Histaminecooked_meatImages of cooked meat
    10Histamineraw_meatImages of raw meat
    11HistaminealcoholImages of alcohol bottles
    12Histaminealcohol_glassImages of wine glasses with alcohol
    13HistaminespinachImages of spinach bundle
    14HistamineavocadoImages of avocado sliced in half
    15HistamineeggplantImages of eggplant
    16SalicylateblueberryImages of blueberry
    17SalicylateblackberryImages of blackberry
    18SalicylatestrawberryImages of strawberry
    19SalicylatepineappleImages of pineapple
    20SalicylatecapsicumImages of bell pepper
    21SalicylatemushroomImages of mushrooms
    22SalicylatedatesImages of dates
    23SalicylatealmondsImages of almonds
    24SalicylatepistachiosImages of pistachios
    25SalicylatetomatoImages of tomato and tomato slices
    26GlutenrotiImages of roti
    27GlutenpastaImages of one serving of penne pasta
    28GlutenbreadImages of bread slices
    29Glutenbread_loafImages of bread loaf
    30GlutenpizzaImages of pizza and pizza slices

    Data collection

    We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.

    Fair use

    This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).

    **Citatio

  18. e

    Frequency lists of pivot words and GSE counts - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Dec 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Frequency lists of pivot words and GSE counts - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/86055e51-574d-5c42-821d-b0293d389864
    Explore at:
    Dataset updated
    Dec 23, 2022
    Description

    The resource contains data used to estimate the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing (by Microsoft Corporation), and Yandex (by ООО «Яндекс», Russia). For this purpose, a special list of 100 rare Lithuanian words (pivot words) with specific characteristics was compiled. Shorter lists for Belarusian, Estonian, Finnish, Latvian, Polish, and Russian languages were also compiled. Pivot words are words with special characteristics that are used to estimate the amount of words in corpora. Pivot words that were used for the estimation of the amount of words indexed by GSE should meet the following special criteria: 1) frequency of occurrence - 10-100; 2) do not coincide with regular words in another language; 3) longer than 6 letters; 4) not of international origin; 5) not foreign loanwords; 6) not proper names of any kind; 7) not headword forms; 8) with only basic Latin letters; 9) not specific to particular domain or time period; 10) they should not coincide with variants of other words, when diacritics are removed; 11) not words that, when commonly misspelled coincide with words, in other languages. Low frequency of pivot words is crucial to consider the count of document matches reported by GSE as an indicator of the word count. Comparative results for neighbouring Belarusian, Estonian, Finnish, Latvian , Polish , and Russian languages have also been assessed. The results have been publish in https://www.bjmc.lu.lv/fileadmin/user_upload/lu_portal/projekti/bjmc/Contents/10_3_06_Dadurkevicius.pdf.

  19. World - Twitter Sentiment By Country

    • kaggle.com
    zip
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Jiang (2020). World - Twitter Sentiment By Country [Dataset]. https://www.kaggle.com/wjia26/twittersentimentbycountry
    Explore at:
    zip(787579784 bytes)Available download formats
    Dataset updated
    Nov 10, 2020
    Authors
    William Jiang
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Area covered
    World
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">

    Introduction

    Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!

    Content

    Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)

    Notes

    There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.

    Acknowledgements

    Thanks to the tweepy package for making the data extraction via Twitter API so easy.

    Shameless Plug

    Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.

    Here's an App I built using a live version of this data.

  20. d

    1M+ Car Images | AI Training Data | Object Detection Data | Annotated...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds, 1M+ Car Images | AI Training Data | Object Detection Data | Annotated imagery data | Global Coverage [Dataset]. https://datarade.ai/data-products/750k-car-images-ai-training-data-object-detection-data-data-seeds
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Data Seeds
    Area covered
    Åland Islands, Indonesia, Pitcairn, Libya, Azerbaijan, Bonaire, Zambia, Poland, Tajikistan, Palau
    Description

    This dataset features over 1,000,000 high-quality images of cars, sourced globally from photographers, enthusiasts, and automotive content creators. Optimized for AI and machine learning applications, it provides richly annotated and visually diverse automotive imagery suitable for a wide array of use cases in mobility, computer vision, and retail.

    Key Features: 1. Comprehensive Metadata: each image includes full EXIF data and detailed annotations such as car make, model, year, body type, view angle (front, rear, side, interior), and condition (e.g., showroom, on-road, vintage, damaged). Ideal for training in classification, detection, OCR for license plates, and damage assessment.

    1. Unique Sourcing Capabilities: the dataset is built from images submitted through a proprietary gamified photography platform with auto-themed competitions. Custom datasets can be delivered within 72 hours targeting specific brands, regions, lighting conditions, or functional contexts (e.g., race cars, commercial vehicles, taxis).

    2. Global Diversity: contributors from over 100 countries ensure broad coverage of car types, manufacturing regions, driving orientations, and environmental settings—from luxury sedans in urban Europe to pickups in rural America and tuk-tuks in Southeast Asia.

    3. High-Quality Imagery: images range from standard to ultra-HD and include professional-grade automotive photography, dealership shots, roadside captures, and street-level scenes. A mix of static and dynamic compositions supports diverse model training.

    4. Popularity Scores: each image includes a popularity score derived from GuruShots competition performance, offering valuable signals for consumer appeal, aesthetic evaluation, and trend modeling.

    5. AI-Ready Design: this dataset is structured for use in applications like vehicle detection, make/model recognition, automated insurance assessment, smart parking systems, and visual search. It’s compatible with all major ML frameworks and edge-device deployments.

    6. Licensing & Compliance: fully compliant with privacy and automotive content use standards, offering transparent and flexible licensing for commercial and academic use.

    Use Cases: 1. Training AI for vehicle recognition in smart city, surveillance, and autonomous driving systems. 2. Powering car search engines, automotive e-commerce platforms, and dealership inventory tools. 3. Supporting damage detection, condition grading, and automated insurance workflows. 4. Enhancing mobility research, traffic analytics, and vision-based safety systems.

    This dataset delivers a large-scale, high-fidelity foundation for AI innovation in transportation, automotive tech, and intelligent infrastructure. Custom dataset curation and region-specific filters are available. Contact us to learn more!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2021). Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-003

Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN

Explore at:
Dataset updated
Jul 26, 2021
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).

Search
Clear search
Close search
Google apps
Main menu