MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AI Search Providers Benchmark Dataset
📊 Dataset Structure
Each entry contains:
id: Unique identifier for the QA pair question: The query text expected_answer: The correct answer category: Topic category area: Broader area classification (News/Knowledge)
🎯 Categories
The dataset covers various domains including:
Entertainment Sports Technology General News Finance Architecture Arts Astronomy Auto (Automotive) E-sports Fashion False Premise
📈… See the full description on the dataset page: https://huggingface.co/datasets/junzhang1207/search-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These files represent the data and accompanying documents of an independent research study by a student researcher examining the searchability and usability of machine learning dataset metadata.
The purpose of this exploratory study was to understand how machine learning (ML) practitioners are searching for and evaluating datasets for use in their work. This research will help inform development of the ML dataset metadata standard Croissant, which is actively being developed by the Croissant MLCommons working group, so it can aid ML practitioners' workflows and promote best practices like Responsible Artificial Intelligence (RAI).
The study consisted of a pre-interview Qualtrics survey ("Survey_questions_pre_interview.pdf") that focused on ranking various metadata elements on a Likert importance scale.
The interview consisted of open questions ("Interview_script_and_questions.pdf") on a range of topics from search of datasets to interoperability to AI used in dataset search. Additionally, participants were asked to share their screen at one point and recall a recent dataset search they had performed.
The resulting survey dataset ("Survey_p1.csv") and interview ("Interview_p1.txt") of participants are presented in open standard formats for accessibility. Identifying data has been removed from the files so there will be missing columns and rows potentially referenced in the files.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation
FinDER is a benchmark dataset designed for evaluating Retrieval-Augmented Generation (RAG) in financial question answering. It consists of 5,703 expert-annotated query–evidence–answer triplets derived from real-world 10-K filings and ambiguous financial queries submitted by industry professionals. This dataset captures the domain-specific challenges of financial QA, including short… See the full description on the dataset page: https://huggingface.co/datasets/Linq-AI-Research/FinDER.
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.
Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.
We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Chest Finder is a dataset for object detection tasks - it contains Chests annotations for 245 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The U.S. AI Training Dataset Market size was valued at USD 590.4 million in 2023 and is projected to reach USD 1880.70 million by 2032, exhibiting a CAGR of 18.0 % during the forecasts period. The U. S. AI training dataset market deals with the generation, selection, and organization of datasets used in training artificial intelligence. These datasets contain the requisite information that the machine learning algorithms need to infer and learn from. Conducts include the advancement and improvement of AI solutions in different fields of business like transport, medical analysis, computing language, and money related measurements. The applications include training the models for activities such as image classification, predictive modeling, and natural language interface. Other emerging trends are the change in direction of more and better-quality, various and annotated data for the improvement of model efficiency, synthetic data generation for data shortage, and data confidentiality and ethical issues in dataset management. Furthermore, due to arising technologies in artificial intelligence and machine learning, there is a noticeable development in building and using the datasets. Recent developments include: In February 2024, Google struck a deal worth USD 60 million per year with Reddit that will give the former real-time access to the latter’s data and use Google AI to enhance Reddit’s search capabilities. , In February 2024, Microsoft announced around USD 2.1 billion investment in Mistral AI to expedite the growth and deployment of large language models. The U.S. giant is expected to underpin Mistral AI with Azure AI supercomputing infrastructure to provide top-notch scale and performance for AI training and inference workloads. .
Introducing a comprehensive and openly accessible dataset designed for researchers and data scientists in the field of artificial intelligence. This dataset encompasses a collection of over 4,000 AI tools, meticulously categorized into more than 50 distinct categories. This valuable resource has been generously shared by its owner, TasticAI, and is freely available for various purposes such as research, benchmarking, market surveys, and more. Dataset Overview: The dataset provides an extensive repository of AI tools, each accompanied by a wealth of information to facilitate your research endeavors. Here is a brief overview of the key components: AI Tool Name: Each AI tool is listed with its name, providing an easy reference point for users to identify specific tools within the dataset. Description: A concise one-line description is provided for each AI tool. This description offers a quick glimpse into the tool's purpose and functionality. AI Tool Category: The dataset is thoughtfully organized into more than 50 distinct categories, ensuring that you can easily locate AI tools that align with your research interests or project needs. Whether you are working on natural language processing, computer vision, machine learning, or other AI subfields, you will find a dedicated category. Images: Visual representation is crucial for understanding and identifying AI tools. To aid your exploration, the dataset includes images associated with each tool, allowing for quick recognition and visual association. Website Links: Accessing more detailed information about a specific AI tool is effortless, as direct links to the tool's respective website or documentation are provided. This feature enables researchers and data scientists to delve deeper into the tools that pique their interest. Utilization and Benefits: This openly shared dataset serves as a valuable resource for various purposes: Research: Researchers can use this dataset to identify AI tools relevant to their studies, facilitating faster literature reviews, comparative analyses, and the exploration of cutting-edge technologies. Benchmarking: The extensive collection of AI tools allows for comprehensive benchmarking, enabling you to evaluate and compare tools within specific categories or across categories. Market Surveys: Data scientists and market analysts can utilize this dataset to gain insights into the AI tool landscape, helping them identify emerging trends and opportunities within the AI market. Educational Purposes: Educators and students can leverage this dataset for teaching and learning about AI tools, their applications, and the categorization of AI technologies. Conclusion: In summary, this openly shared dataset from TasticAI, featuring over 4,000 AI tools categorized into more than 50 categories, represents a valuable asset for researchers, data scientists, and anyone interested in the field of artificial intelligence. Its easy accessibility, detailed information, and versatile applications make it an indispensable resource for advancing AI research, benchmarking, market analysis, and more. Explore the dataset at https://tasticai.com and unlock the potential of this rich collection of AI tools for your projects and studies.
Uploaded new content for Washington's Fund Finder tool (updated 4-05-2019).
Dataset Card for Dataset Name
Dataset Summary
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Supported Tasks and Leaderboards
information-retrieval semantic-search
Languages
English
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation… See the full description on the dataset page: https://huggingface.co/datasets/Adel-Elwan/Artificial-intelligence-dataset-for-IR-systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
docstring) pairs
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more details, please refer to our paper: Nihal, R. A., et al. "UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios." ICPR 2024 (Accepted), arXiv preprint arXiv (2024).
and the github repo https://github.com/Ragib-Amin-Nihal/C2A
We encourage users to cite this paper when using the dataset for their research or applications.
The C2A (Combination to Application) Dataset is a resource designed to advance human detection in disaster scenarios using UAV imagery. This dataset addresses a critical gap in the field of computer vision and disaster response by providing a large-scale, diverse collection of synthetic images that combine real disaster scenes with human poses.
Context: In the wake of natural disasters and emergencies, rapid and accurate human detection is crucial for effective search and rescue operations. UAVs (Unmanned Aerial Vehicles) have emerged as powerful tools in these scenarios, but their effectiveness is limited by the lack of specialized datasets for training AI models. The C2A dataset aims to bridge this gap, enabling the development of more robust and accurate human detection systems for disaster response.
Sources: The C2A dataset is a synthetic combination of two primary sources: 1. Disaster Backgrounds: Sourced from the AIDER (Aerial Image Dataset for Emergency Response Applications) dataset, providing authentic disaster scene imagery. 2. Human Poses: Derived from the LSP/MPII-MPHB (Multiple Poses Human Body) dataset, offering a wide range of human body positions.
Key Features: - 10,215 high-resolution images - Over 360,000 annotated human instances - 5 human pose categories: Bent, Kneeling, Lying, Sitting, and Upright - 4 disaster scenario types: Fire/Smoke, Flood, Collapsed Building/Rubble, and Traffic Accidents - Image resolutions ranging from 123x152 to 5184x3456 pixels - Bounding box annotations for each human instance
Inspiration: This dataset was inspired by the pressing need to improve the capabilities of AI-assisted search and rescue operations. By providing a diverse and challenging set of images that closely mimic real-world disaster scenarios, we aim to: 1. Enhance the accuracy of human detection algorithms in complex environments 2. Improve the generalization of models across various disaster types and human poses 3. Accelerate the development of AI systems that can assist first responders and save lives
Applications: The C2A dataset is designed for researchers and practitioners in: - Computer Vision and Machine Learning - Disaster Response and Emergency Management - UAV/Drone Technology - Search and Rescue Operations - Humanitarian Aid and Crisis Response
We hope this dataset will inspire innovative approaches to human detection in challenging environments and contribute to the development of technologies that can make a real difference in disaster response efforts.
The number of times during the month someone searched the name of a State of Iowa Office with a Google My Business profile using Google Search or while on Google Maps.
OpenWeb Ninja's Google Images Data (Google SERP Data) API provides real-time image search capabilities for images sourced from all public sources on the web.
The API enables you to search and access more than 100 billion images from across the web including advanced filtering capabilities as supported by Google Advanced Image Search. The API provides Google Images Data (Google SERP Data) including details such as image URL, title, size information, thumbnail, source information, and more data points. The API supports advanced filtering and options such as file type, image color, usage rights, creation time, and more. In addition, any Advanced Google Search operators can be used with the API.
OpenWeb Ninja's Google Images Data & Google SERP Data API common use cases:
Creative Media Production: Enhance digital content with a vast array of real-time images, ensuring engaging and brand-aligned visuals for blogs, social media, and advertising.
AI Model Enhancement: Train and refine AI models with diverse, annotated images, improving object recognition and image classification accuracy.
Trend Analysis: Identify emerging market trends and consumer preferences through real-time visual data, enabling proactive business decisions.
Innovative Product Design: Inspire product innovation by exploring current design trends and competitor products, ensuring market-relevant offerings.
Advanced Search Optimization: Improve search engines and applications with enriched image datasets, providing users with accurate, relevant, and visually appealing search results.
OpenWeb Ninja's Annotated Imagery Data & Google SERP Data Stats & Capabilities:
100B+ Images: Access an extensive database of over 100 billion images.
Images Data from all Public Sources (Google SERP Data): Benefit from a comprehensive aggregation of image data from various public websites, ensuring a wide range of sources and perspectives.
Extensive Search and Filtering Capabilities: Utilize advanced search operators and filters to refine image searches by file type, color, usage rights, creation time, and more, making it easy to find exactly what you need.
Rich Data Points: Each image comes with more than 10 data points, including URL, title (annotation), size information, thumbnail, and source information, providing a detailed context for each image.
Overview
This dataset contains 7k leaderboard conversation votes collected from Search Arena between March 18, 2025 and April 13, 2025. All entries have been redacted for PII and sensitive user information to ensure privacy. Each data point includes:
Two model responses (messages_a and messages_b) The human vote result A timestamp Full system metadata, LLM + web search trace, and post-processed metadata for controlled experiments (conv_meta)
To reproduce the leaderboard results… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/search-arena-v1-7k.
https://www.factori.ai/privacy-policyhttps://www.factori.ai/privacy-policy
Our proprietary People Data is a mobile user dataset that connects anonymous IDs to a wide range of attributes, including demographics, device ownership, audience segments, key locations, and more. This rich dataset allows our partner brands to gain a comprehensive view of consumers based on their personas, enabling them to derive actionable insights swiftly.
Reach Our extensive data reach covers a variety of categories, encompassing user demographics, Mobile Advertising IDs (MAID), device details, locations, affluence, interests, traveled countries, and more. Data Export Methodology We dynamically collect and provide the most updated data and insights through the best-suited method at appropriate intervals, whether daily, weekly, monthly, or quarterly.
Our People Data caters to various business needs, offering valuable insights for consumer analysis, data enrichment, sales forecasting, and retail analytics, empowering brands to make informed decisions and optimize their strategies.
Overview
This dataset contains ALL in-the-wild conversation crowdsourced from Search Arena between March 18, 2025 and May 8, 2025. It includes 24,069 multi-turn conversations with search-LLMs across diverse intents, languages, and topics—alongside 12,652 human preference votes. The dataset spans approximately 11,000 users across 136 countries, 13 publicly released models, around 90 languages (including 11% multilingual prompts), and over 5,000 multi-turn sessions. While user… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/search-arena-24k.
KEYWORD SEARCH IN TEXT CUBE: FINDING TOP-K RELEVANT CELLS BOLIN DING, YINTAO YU, BO ZHAO, CINDY XIDE LIN, JIAWEI HAN, AND CHENGXIANG ZHAI Abstract. We study the problem of keyword search in a data cube with text-rich dimension(s) (so-called text cube). The text cube is built on a multidimensional text database, where each row is associated with some text data (e.g., a document) and other structural dimensions (attributes). A cell in the text cube aggregates a set of documents with matching attribute values in a subset of dimensions. A cell document is the concatenation of all documents in a cell. Given a keyword query, our goal is to find the top-k most relevant cells (ranked according to the relevance scores of cell documents w.r.t. the given query) in the text cube. We define a keyword-based query language and apply IR-style relevance model for scoring and ranking cell documents in the text cube. We propose two efficient approaches to find the top-k answers. The proposed approaches support a general class of IR-style relevance scoring formulas that satisfy certain basic and common properties. One of them uses more time for pre-processing and less time for answering online queries; and the other one is more efficient in pre-processing and consumes more time for online queries. Experimental studies on the ASRS dataset are conducted to verify the efficiency and effectiveness of the proposed approaches.
Use the Records Search to do the following: Search for records, such as agreements with other government agencies, maps and other documents from the Public Works, Transportation, and Planning, Building and Development departments.
description: Provides DigitalGov Search customers their search results in JSON. Sign in with a .gov or .mil email is required.; abstract: Provides DigitalGov Search customers their search results in JSON. Sign in with a .gov or .mil email is required.
Interactive map of King County parks
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AI Search Providers Benchmark Dataset
📊 Dataset Structure
Each entry contains:
id: Unique identifier for the QA pair question: The query text expected_answer: The correct answer category: Topic category area: Broader area classification (News/Knowledge)
🎯 Categories
The dataset covers various domains including:
Entertainment Sports Technology General News Finance Architecture Arts Astronomy Auto (Automotive) E-sports Fashion False Premise
📈… See the full description on the dataset page: https://huggingface.co/datasets/junzhang1207/search-dataset.