https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision). Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks. Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision? The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
This dataset was curated for Search Engine Optimization (SEO) analysis tasks, including categorization and spam detection. It covers 12 diverse topics: basketball, books, cats, gardening, history, movies, music, recipes, sports, technology, travel, and weather. Some topics have hierarchical relationships, such as sports and basketball, while others are closely related (e.g., movies and music) or unrelated (e.g., basketball and gardening), with varying degrees of overlap among them. For each topic, approximately 300 search queries were generated using large language models (LLMs) like GPT, Llama, and Claude. The top 10 URLs from the Google Search Console’s search engine results page (SERP) were retrieved for each query.
Dataset from Visual Product Recognition Challenge 2023 served on AI-Crowd platform https://www.aicrowd.com/challenges/visual-product-recognition-challenge-2023
Organizers provided just testing datasets. As training, you could use any dataset, e.g.. Product 10 K or your own. You can get all the information about the dataset on the same page. Here is the main description
Testset format
Test set contains 2 files: gallery.csv and queries.csv.
gallery.csv defines the database of images from marketplaces. Each row contains the following information: - product_id - unique int32 identifier of product image that is used in result ranking NumPy array; - img_path - path to the product image in the "data" folder.
queries.csv defines a set of user images that will be used as queries to search the database. Each row contains the following information: - product_id - unique int32 identifier of user image that is used in result ranking NumPy array; - img_path - path to user image in "data" folder; - bbox_x, bbox_y, bbox_w, bbox_h - bounding box coordinates of the product in the user image.
The confluence of Search and Recommendation (S&R) services is a vital aspect of online content platforms like Kuaishou and TikTok. The integration of S&R modeling is a highly intuitive approach adopted by industry practitioners. However, there is a noticeable lack of research conducted in this area within the academia, primarily due to the absence of publicly available datasets. Consequently, a substantial gap has emerged between academia and industry regarding research endeavors in this field. To bridge this gap, we introduce the first large-scale, real-world dataset KuaiSAR of integrated Search And Recommendation behaviors collected from Kuaishou, a leading short-video app in China with over 300 million daily active users. Previous research in this field has predominantly employed publicly available datasets that are semi-synthetic and simulated, with artificially fabricated search behaviors. Distinct from previous datasets, KuaiSAR records genuine user behaviors, the occurrence of each interaction within either search or recommendation service, and the users’ transitions between the two services. This work aids in joint modeling of S&R, and the utilization of search data for recommenders (and recommendation data for search engines). Additionally, due to the diverse feedback labels of user-video interactions, KuaiSAR also supports a wide range of other tasks, including intent recommendation, multi-task learning, and long sequential multi-behavior modeling etc. We believe this dataset will facilitate innovative research and enrich our understanding of S&R services integration in real-world applications.
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Dataset Name: Google Search Trends Top Rising Search Terms Description: The Google Search Trends Top Rising Search Terms dataset provides valuable insights into the most rapidly growing search queries on the Google search engine. It offers a comprehensive collection of trending search… See the full description on the dataset page: https://huggingface.co/datasets/hoshangc/google_search_terms_training_data.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The search engine market size was valued at approximately USD 124 billion in 2023 and is projected to reach USD 258 billion by 2032, witnessing a robust CAGR of 8.5% during the forecast period. This growth is largely attributed to the increasing reliance on digital platforms and the internet across various sectors, which has necessitated the use of search engines for data retrieval and information dissemination. With the proliferation of smartphones and the expansion of internet access globally, search engines have become indispensable tools for both businesses and consumers, driving the market's upward trajectory. The integration of artificial intelligence and machine learning technologies into search engines is transforming the way search engines operate, offering more personalized and efficient search results, thereby further propelling market growth.
One of the primary growth factors in the search engine market is the ever-increasing digitalization across industries. As businesses continue to transition from traditional modes of operation to digital platforms, the need for search engines to navigate and manage data becomes paramount. This shift is particularly evident in industries such as retail, BFSI, and healthcare, where vast amounts of data are generated and require efficient management and retrieval systems. The integration of AI and machine learning into search engine algorithms has enhanced their ability to process and interpret large datasets, thereby improving the accuracy and relevance of search results. This technological advancement not only improves user experience but also enhances the competitive edge of businesses, further fueling market growth.
Another significant growth factor is the expanding e-commerce sector, which relies heavily on search engines to connect consumers with products and services. With the rise of e-commerce giants and online marketplaces, consumers are increasingly using search engines to find the best prices, reviews, and availability of products, leading to a surge in search engine usage. Additionally, the implementation of voice search technology and the growing popularity of smart home devices have introduced new dynamics to search engine functionality. Consumers are now able to conduct searches verbally, which has necessitated the adaptation of search engines to incorporate natural language processing capabilities, further driving market growth.
The advertising and marketing sectors are also contributing significantly to the growth of the search engine market. Businesses are leveraging search engines as a primary tool for online advertising, given their wide reach and ability to target specific audiences. Pay-per-click advertising and search engine optimization strategies have become integral components of digital marketing campaigns, enabling businesses to enhance their visibility and engagement with potential customers. The measurable nature of these advertising techniques allows businesses to assess the effectiveness of their campaigns and make data-driven decisions, thereby increasing their reliance on search engines and contributing to overall market growth.
The evolution of search engines is closely tied to the development of Ai Enterprise Search, which is revolutionizing how businesses access and utilize information. Ai Enterprise Search leverages artificial intelligence to provide more accurate and contextually relevant search results, making it an invaluable tool for organizations that manage large volumes of data. By understanding user intent and learning from past interactions, Ai Enterprise Search systems can deliver personalized experiences that enhance productivity and decision-making. This capability is particularly beneficial in sectors such as finance and healthcare, where quick access to precise information is crucial. As businesses continue to digitize and data volumes grow, the demand for Ai Enterprise Search solutions is expected to increase, further driving the growth of the search engine market.
Regionally, North America holds a significant share of the search engine market, driven by the presence of major technology companies and a well-established digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth can be attributed to the rapid digital transformation in emerging economies such as China and India, where increasing internet penetration and smartphone adoption are driving demand for search engines. Additionally, government initiatives to
Deep Learning Hard (DL-HARD) is an annotated dataset designed to more effectively evaluate neural ranking models on complex topics. It builds on TREC Deep Learning (DL) questions extensively annotated with query intent categories, answer types, wikified entities, topic categories, and result type metadata from a leading web search engine.
DL-HARD contains 50 queries from the official 2019/2020 evaluation benchmark, half of which are newly and independently assessed. Overall, DL-HARD is a new resource that promotes research on neural ranking methods by focusing on challenging and complex queries.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global crawler based search engine market size was estimated to be USD 25 billion in 2023 and is projected to reach USD 75 billion by 2032, growing at a compound annual growth rate (CAGR) of 12.5% during the forecast period. This growth is driven by the increasing need for sophisticated search engine solutions in various industries such as e-commerce, BFSI, and healthcare. The demand for efficient data retrieval and the rising importance of search engine optimization (SEO) are significant factors fueling market expansion.
One of the primary growth factors for the crawler based search engine market is the exponential growth of data generated across different platforms. With the advent of big data and the Internet of Things (IoT), the amount of structured and unstructured data has surged, necessitating advanced search solutions that can efficiently index and retrieve relevant information. This has led to the adoption of crawler-based search engines, which are capable of handling large volumes of data and providing accurate search results quickly. Furthermore, the increasing reliance on digital platforms for business operations and customer interactions is also pushing companies to invest in robust search engine technologies.
Another contributing factor to the marketÂ’s growth is the rising importance of personalized search experiences. Modern consumers expect search engines to understand their preferences and deliver highly relevant results. Crawler-based search engines utilize advanced algorithms and artificial intelligence (AI) techniques to analyze user behavior and preferences, thereby offering personalized search experiences. This not only enhances user satisfaction but also boosts engagement and retention rates, making these search engines an attractive investment for businesses across various sectors.
Moreover, the growing emphasis on search engine optimization (SEO) and digital marketing strategies has further bolstered the demand for crawler-based search engines. Businesses are increasingly leveraging these search engines to optimize their online presence and improve their search engine rankings. By crawling and indexing web pages efficiently, these search engines enable businesses to gain insights into their website performance and make data-driven decisions to enhance their SEO strategies. This, in turn, drives market growth as companies strive to stay competitive in the digital landscape.
Insight Engines are becoming increasingly vital in the realm of data management and retrieval. These engines are designed to provide users with deeper insights by analyzing large datasets and delivering contextual information. As businesses generate vast amounts of data, Insight Engines help in transforming this data into actionable insights, enabling organizations to make informed decisions. They leverage advanced technologies such as natural language processing and machine learning to understand user queries and provide precise answers. This capability is particularly beneficial for industries that rely heavily on data-driven strategies, as it enhances the ability to uncover hidden patterns and trends within data.
Regionally, North America holds a significant share of the crawler-based search engine market, primarily due to the presence of major technology companies and the rapid adoption of advanced search solutions in the region. The Asia Pacific region is also expected to witness substantial growth during the forecast period, driven by the increasing digitization efforts and the rising number of internet users in countries like China and India. Additionally, Europe and Latin America are anticipated to contribute to market growth, supported by the growing emphasis on digital transformation and data-driven decision-making in these regions.
The crawler-based search engine market can be segmented by component into software, hardware, and services. The software segment dominates the market, driven by the continuous advancements in search engine algorithms and the integration of artificial intelligence (AI) and machine learning (ML) technologies. Search engines are becoming more sophisticated, capable of understanding natural language queries and providing more accurate and relevant search results. The demand for such advanced software solutions is increasing as businesses seek to enhance their search capabilities and deliver better user experiences.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset gathers the most crucial SEO statistics for the year, providing an overview of the dominant trends and best practices in the field of search engine optimization. Aimed at digital marketing professionals, site owners, and SEO analysts, this collection of information serves as a guide to navigate the evolving SEO landscape with confidence and accuracy. Mode of Data Production: The statistics have been carefully selected and compiled from a variety of credible and recognized sources in the SEO industry, including research reports, web traffic data analytics, and consumer and marketing professional surveys. Each statistic was checked for reliability and relevance to current trends. Categories Included: User search behaviour: Statistics on the evolution of search modes, including voice and mobile search. Mobile Optimisation: Data on the importance of site optimization for mobile devices. Importance of Backlinks: Insights on the role of backlinks in SEO ranking and the need to prioritize quality. Content quality: Statistics highlighting the importance of relevant and engaging content for SEO. Search engine algorithms: Information on the impact of algorithm updates on SEO strategies. Usefulness of the Data: This dataset is designed to help users quickly understand current SEO dynamics and apply that knowledge in optimizing their digital marketing strategies. It provides a solid foundation for benchmarking, strategic planning, and informed decision-making in the field of SEO. Update and Accessibility: To ensure relevance and timeliness, the dataset will be regularly updated with new information and emerging trends in the SEO world.
MMSearch 🔥: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Official repository for the paper "MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines". 🌟 For more details, please refer to the project page with dataset exploration and visualization tools: https://mmsearch.github.io/. [🌐 Webpage] [📖 Paper] [🤗 Huggingface Dataset] [🏆 Leaderboard] [🔍 Visualization]
💥 News
[2024.09.25] 🌟 The evaluation code now… See the full description on the dataset page: https://huggingface.co/datasets/CaraJ/MMSearch.
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ag_news_subset', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
According to our latest research, the Quantum-Enhanced Neural Search Engine market size reached USD 1.82 billion globally in 2024, reflecting the rapid adoption of quantum computing and advanced neural network architectures in enterprise search solutions. The market is projected to grow at a robust CAGR of 28.7% from 2025 to 2033, culminating in a forecasted market size of USD 15.46 billion by the end of 2033. This remarkable trajectory is primarily driven by the demand for highly efficient, accurate, and context-aware search engines capable of processing vast and complex datasets across industries.
Several key growth factors are propelling the quantum-enhanced neural search engine market forward. The exponential increase in unstructured data, combined with the limitations of classical search algorithms, has created a significant need for more sophisticated search technologies. Quantum computing, when integrated with neural search algorithms, delivers unparalleled computational power and speed, enabling real-time semantic understanding and contextual relevance in search results. Organizations across sectors such as healthcare, finance, and e-commerce are investing heavily in these technologies to improve data-driven decision-making, enhance user experiences, and maintain a competitive edge in the digital era. The synergy between quantum computing and neural networks is unlocking new possibilities for natural language processing, image recognition, and predictive analytics, further fueling market growth.
Another significant driver is the growing adoption of artificial intelligence and machine learning across enterprise operations. As businesses transition towards digital transformation, the need for intelligent search capabilities that can extract actionable insights from massive datasets becomes increasingly critical. Quantum-enhanced neural search engines offer a transformative leap in search efficiency, delivering faster and more accurate results than traditional systems. This is particularly valuable for industries dealing with sensitive or time-critical information, such as BFSI and healthcare, where the ability to retrieve relevant data instantaneously can have a direct impact on operational efficiency and customer satisfaction. Additionally, the scalability and adaptability of these solutions make them attractive to both large enterprises and SMEs, supporting widespread market penetration.
The ongoing advancements in quantum hardware and software ecosystems are also contributing to the market’s expansion. Major technology players and startups alike are investing in the development of quantum processors, quantum-safe algorithms, and hybrid quantum-classical architectures tailored for search applications. As quantum computing becomes more accessible through cloud-based platforms, organizations of all sizes can leverage its power without the need for significant upfront infrastructure investments. This democratization of quantum technology is expected to accelerate adoption rates, drive innovation in search engine design, and lower barriers to entry for new market participants. Furthermore, collaborative efforts between academia, industry, and government agencies are fostering a vibrant ecosystem that supports research, standardization, and commercialization of quantum-enhanced neural search solutions.
From a regional perspective, North America currently leads the quantum-enhanced neural search engine market, accounting for the largest share in 2024, primarily due to its advanced technological infrastructure, significant R&D investments, and early adoption by key industry players. Europe follows closely, supported by robust governmental initiatives and a strong presence of quantum research institutions. The Asia Pacific region is witnessing the fastest growth, driven by increasing digitalization, expanding tech startups, and supportive regulatory frameworks, particularly in countries like China, Japan, and South Korea. Latin America and the Middle East & Africa are also emerging as promising markets, with growing interest in quantum technologies and AI-driven solutions to address local industry challenges. Each region presents unique opportunities and challenges, shaping the competitive landscape and influencing market dynamics over the forecast period.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This dataset is a sample extraction of product listings from Zoro.com, a leading industrial supply e-commerce platform. It provides structured product-level data that can be used for market research, price comparison engines, product matching models, and e-commerce analytics.
The sample includes a variety of products across tools, hardware, safety equipment, and industrial supplies — with clean, structured fields suitable for both analysis and model training.
Also available: Grainger Product Datasets – structured data from a top industrial supplier.
Submit your custom data requests via the Zoro products page or contact us directly at contact@crawlfeeds.com.
Ideal for previewing before requesting larger or full Zoro datasets
Building product comparison or search engines
Price intelligence and competitor monitoring
Product classification and attribute extraction
Training data for e-commerce AI models
This is a sample of a much larger dataset extracted from Zoro.com.
👉 Contact us to access full datasets or request custom category extractions.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to
establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data
Approach
The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered.
Search methods
We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects.
We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories.
Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo.
Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories.
Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals.
Evaluation
We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results.
We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind.
Results
A summary of the major findings from our data review:
Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors.
There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection.
Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation.
See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
TripClick is a large-scale dataset of click logs in the health domain, obtained from user interactions of the Trip Database health web search engine.
Provide:
Approximately 5.2 million user interactions IR evaluation benchmark Trainin data for deep learning IR models
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains citation-based impact indicators (a.k.a, "measures") for ~187,8M distinct PIDs (persistent identifiers) that correspond to research products (scientific publications, datasets, etc). In particular, for each PID, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture):
Influence indicators (i.e., indicators of the "total" impact of each research product; how established it is in general)
Citation Count: The total number of citations of the product, the most well-known influence indicator.
PageRank score: An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).
Popularity indicators (i.e., indicators of the "current" impact of each research product; how popular the product is currently)
RAM score: A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).
AttRank score: A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.
Impulse indicators (i.e., indicators of the initial momentum that the research product received right after its publication)
Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.
More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (e.g., in [5]).
From version 5.1 onward, the impact indicators are calculated in two levels:
Previous versions of the dataset only provided the scores at the PID level.
From version 12 onward, two types of PIDs are included in the dataset: DOIs and PMIDs (before that version, only DOIs were included).
Also, from version 7 onward, for each product in our files we also offer an impact class, which informs the user about the percentile into which the product score belongs compared to the impact scores of the rest products in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).
Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each product with a distinct PID that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the OpenAIRE article deduplication algorithm. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same product). As a result, each node in the citation network we build is a deduplicated product having a distinct OpenAIRE id. We still report the scores at PID level (i.e., we assign a score to each of the versions/instances of the product), however these PID-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated product will receive the same scores). We have removed a small number of instances (having a PID) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph.
For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format "identifier
From version 9 onward, we also provide topic-specific impact classes for PID-identified products. In particular, we associated those products with 2nd level concepts from OpenAlex; we chose to keep only the three most dominant concepts for each product, based on their confidence score, and only if this score was greater than 0.3. Then, for each product and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format "identifier
The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v7.1.0, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG [6,7], and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered. In addition, versions later than v.10 leverage the filtering rules described here to remove from the dataset PIDs with problematic metadata.
References:
[1] R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
[2] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380
[3] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)
[4] P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).
[5] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)
[6] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839
[7] K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045
Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder's API.
Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the CC0 license.
More details about BIP! DB can be found in our relevant peer-reviewed publication:
Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460
We kindly request that any published research that makes use of BIP! DB cite the above article.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2: Table S1. The list of selected activity types in the PubChem.
Knowledge about the general graph structure of the hyperlink graph is important for designing ranking methods for search engines. To amend the ranking calculated by search engines for different websites, search engine optimization agencies focus on linkage structure for their clients. An extreme appearance of ranking manipulation manifests in spam networks, where pages and websites publishing dubious content try to increase their ratings by setting a massive number of links to other pages and retrieve backlinks. The WDC Hyperlink Graph on first level subdomain level has been extracted from the Common Crawl 2012 web corpus and covers 95 million first level subdomains, linked by almost 2 billion connections, which are derived from the hyperlinks of the pages contained by the first level subdomains.
The research focus in the field of remotely sensed imagery has shifted from collection and warehousing of data ' tasks for which a mature technology already exists, to auto-extraction of information and knowledge discovery from this valuable resource ' tasks for which technology is still under active development. In particular, intelligent algorithms for analysis of very large rasters, either high resolutions images or medium resolution global datasets, that are becoming more and more prevalent, are lacking. We propose to develop the Geospatial Pattern Analysis Toolbox (GeoPAT) a computationally efficient, scalable, and robust suite of algorithms that supports GIS processes such as segmentation, unsupervised/supervised classification of segments, query and retrieval, and change detection in giga-pixel and larger rasters. At the core of the technology that underpins GeoPAT is the novel concept of pattern-based image analysis. Unlike pixel-based or object-based (OBIA) image analysis, GeoPAT partitions an image into overlapping square scenes containing 1,000'100,000 pixels and performs further processing on those scenes using pattern signature and pattern similarity ' concepts first developed in the field of Content-Based Image Retrieval. This fusion of methods from two different areas of research results in orders of magnitude performance boost in application to very large images without sacrificing quality of the output.
GeoPAT v.1.0 already exists as the GRASS GIS add-on that has been developed and tested on medium resolution continental-scale datasets including the National Land Cover Dataset and the National Elevation Dataset. Proposed project will develop GeoPAT v.2.0 ' much improved and extended version of the present software. We estimate an overall entry TRL for GeoPAT v.1.0 to be 3-4 and the planned exit TRL for GeoPAT v.2.0 to be 5-6. Moreover, several new important functionalities will be added. Proposed improvements includes conversion of GeoPAT from being the GRASS add-on to stand-alone software capable of being integrated with other systems, full implementation of web-based interface, writing new modules to extent it applicability to high resolution images/rasters and medium resolution climate data, extension to spatio-temporal domain, enabling hierarchical search and segmentation, development of improved pattern signature and their similarity measures, parallelization of the code, implementation of divide and conquer strategy to speed up selected modules.
The proposed technology will contribute to a wide range of Earth Science investigations and missions through enabling extraction of information from diverse types of very large datasets. Analyzing the entire dataset without the need of sub-dividing it due to software limitations offers important advantage of uniformity and consistency. We propose to demonstrate the utilization of GeoPAT technology on two specific applications. The first application is a web-based, real time, visual search engine for local physiography utilizing query-by-example on the entire, global-extent SRTM 90 m resolution dataset. User selects region where process of interest is known to occur and the search engine identifies other areas around the world with similar physiographic character and thus potential for similar process. The second application is monitoring urban areas in their entirety at the high resolution including mapping of impervious surface and identifying settlements for improved disaggregation of census data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The main file contains an entry (N=28530) per search result in all collected pages. It comprises the following columns:
Manually annotated abstracts resulting from the searches.
The zip contains an HTML per search engine result page collected (N=2853). See column filename from the main dataset.
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision). Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks. Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision? The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.