100+ datasets found
  1. h

    search-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John, search-dataset [Dataset]. https://huggingface.co/datasets/junzhang1207/search-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    John
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AI Search Providers Benchmark Dataset

      📊 Dataset Structure
    

    Each entry contains:

    id: Unique identifier for the QA pair question: The query text expected_answer: The correct answer category: Topic category area: Broader area classification (News/Knowledge)

      🎯 Categories
    

    The dataset covers various domains including:

    Entertainment Sports Technology General News Finance Architecture Arts Astronomy Auto (Automotive) E-sports Fashion False Premise

      📈… See the full description on the dataset page: https://huggingface.co/datasets/junzhang1207/search-dataset.
    
  2. Understanding machine learning dataset search behaviors: A survey

    • zenodo.org
    csv, pdf, txt
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Edgerton; Joe Edgerton (2025). Understanding machine learning dataset search behaviors: A survey [Dataset]. http://doi.org/10.5281/zenodo.15359924
    Explore at:
    pdf, txt, csvAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joe Edgerton; Joe Edgerton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 7, 2025
    Description

    These files represent the data and accompanying documents of an independent research study by a student researcher examining the searchability and usability of machine learning dataset metadata.

    The purpose of this exploratory study was to understand how machine learning (ML) practitioners are searching for and evaluating datasets for use in their work. This research will help inform development of the ML dataset metadata standard Croissant, which is actively being developed by the Croissant MLCommons working group, so it can aid ML practitioners' workflows and promote best practices like Responsible Artificial Intelligence (RAI).

    The study consisted of a pre-interview Qualtrics survey ("Survey_questions_pre_interview.pdf") that focused on ranking various metadata elements on a Likert importance scale.

    The interview consisted of open questions ("Interview_script_and_questions.pdf") on a range of topics from search of datasets to interoperability to AI used in dataset search. Additionally, participants were asked to share their screen at one point and recall a recent dataset search they had performed.

    The resulting survey dataset ("Survey_p1.csv") and interview ("Interview_p1.txt") of participants are presented in open standard formats for accessibility. Identifying data has been removed from the files so there will be missing columns and rows potentially referenced in the files.

  3. h

    FinDER

    • huggingface.co
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LinqAlpha (2024). FinDER [Dataset]. https://huggingface.co/datasets/Linq-AI-Research/FinDER
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    LinqAlpha
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation

    FinDER is a benchmark dataset designed for evaluating Retrieval-Augmented Generation (RAG) in financial question answering. It consists of 5,703 expert-annotated query–evidence–answer triplets derived from real-world 10-K filings and ambiguous financial queries submitted by industry professionals. This dataset captures the domain-specific challenges of financial QA, including short… See the full description on the dataset page: https://huggingface.co/datasets/Linq-AI-Research/FinDER.

  4. d

    TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

    • datarade.ai
    .json, .csv, .xls
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Jun 18, 2021
    Dataset authored and provided by
    TagX
    Area covered
    Belize, Colombia, Benin, Djibouti, Iceland, Russian Federation, Equatorial Guinea, Qatar, Antigua and Barbuda, Saudi Arabia
    Description

    We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

    Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

    We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.

  5. R

    Chest Finder Dataset

    • universe.roboflow.com
    zip
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai Dataset (2025). Chest Finder Dataset [Dataset]. https://universe.roboflow.com/ai-dataset-8dqwo/chest-finder
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset authored and provided by
    Ai Dataset
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Chests Bounding Boxes
    Description

    Chest Finder

    ## Overview
    
    Chest Finder is a dataset for object detection tasks - it contains Chests annotations for 245 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. U

    U.S. AI Training Dataset Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). U.S. AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/us-ai-training-dataset-market-4957
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    United States
    Variables measured
    Market Size
    Description

    The U.S. AI Training Dataset Market size was valued at USD 590.4 million in 2023 and is projected to reach USD 1880.70 million by 2032, exhibiting a CAGR of 18.0 % during the forecasts period. The U. S. AI training dataset market deals with the generation, selection, and organization of datasets used in training artificial intelligence. These datasets contain the requisite information that the machine learning algorithms need to infer and learn from. Conducts include the advancement and improvement of AI solutions in different fields of business like transport, medical analysis, computing language, and money related measurements. The applications include training the models for activities such as image classification, predictive modeling, and natural language interface. Other emerging trends are the change in direction of more and better-quality, various and annotated data for the improvement of model efficiency, synthetic data generation for data shortage, and data confidentiality and ethical issues in dataset management. Furthermore, due to arising technologies in artificial intelligence and machine learning, there is a noticeable development in building and using the datasets. Recent developments include: In February 2024, Google struck a deal worth USD 60 million per year with Reddit that will give the former real-time access to the latter’s data and use Google AI to enhance Reddit’s search capabilities. , In February 2024, Microsoft announced around USD 2.1 billion investment in Mistral AI to expedite the growth and deployment of large language models. The U.S. giant is expected to underpin Mistral AI with Azure AI supercomputing infrastructure to provide top-notch scale and performance for AI training and inference workloads. .

  7. d

    AI TOOLS - Open Dataset - 4000 tools / 50 categories

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BUREAU, Olivier (2023). AI TOOLS - Open Dataset - 4000 tools / 50 categories [Dataset]. http://doi.org/10.7910/DVN/QLSXZG
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    BUREAU, Olivier
    Description

    Introducing a comprehensive and openly accessible dataset designed for researchers and data scientists in the field of artificial intelligence. This dataset encompasses a collection of over 4,000 AI tools, meticulously categorized into more than 50 distinct categories. This valuable resource has been generously shared by its owner, TasticAI, and is freely available for various purposes such as research, benchmarking, market surveys, and more. Dataset Overview: The dataset provides an extensive repository of AI tools, each accompanied by a wealth of information to facilitate your research endeavors. Here is a brief overview of the key components: AI Tool Name: Each AI tool is listed with its name, providing an easy reference point for users to identify specific tools within the dataset. Description: A concise one-line description is provided for each AI tool. This description offers a quick glimpse into the tool's purpose and functionality. AI Tool Category: The dataset is thoughtfully organized into more than 50 distinct categories, ensuring that you can easily locate AI tools that align with your research interests or project needs. Whether you are working on natural language processing, computer vision, machine learning, or other AI subfields, you will find a dedicated category. Images: Visual representation is crucial for understanding and identifying AI tools. To aid your exploration, the dataset includes images associated with each tool, allowing for quick recognition and visual association. Website Links: Accessing more detailed information about a specific AI tool is effortless, as direct links to the tool's respective website or documentation are provided. This feature enables researchers and data scientists to delve deeper into the tools that pique their interest. Utilization and Benefits: This openly shared dataset serves as a valuable resource for various purposes: Research: Researchers can use this dataset to identify AI tools relevant to their studies, facilitating faster literature reviews, comparative analyses, and the exploration of cutting-edge technologies. Benchmarking: The extensive collection of AI tools allows for comprehensive benchmarking, enabling you to evaluate and compare tools within specific categories or across categories. Market Surveys: Data scientists and market analysts can utilize this dataset to gain insights into the AI tool landscape, helping them identify emerging trends and opportunities within the AI market. Educational Purposes: Educators and students can leverage this dataset for teaching and learning about AI tools, their applications, and the categorization of AI technologies. Conclusion: In summary, this openly shared dataset from TasticAI, featuring over 4,000 AI tools categorized into more than 50 categories, represents a valuable asset for researchers, data scientists, and anyone interested in the field of artificial intelligence. Its easy accessibility, detailed information, and versatile applications make it an indispensable resource for advancing AI research, benchmarking, market analysis, and more. Explore the dataset at https://tasticai.com and unlock the potential of this rich collection of AI tools for your projects and studies.

  8. d

    Fund Finder Master Content (4-05-2019)

    • catalog.data.gov
    • data.wa.gov
    • +1more
    Updated Dec 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.wa.gov (2022). Fund Finder Master Content (4-05-2019) [Dataset]. https://catalog.data.gov/dataset/fund-finder-master-content-4-05-2019
    Explore at:
    Dataset updated
    Dec 16, 2022
    Dataset provided by
    data.wa.gov
    Description

    Uploaded new content for Washington's Fund Finder tool (updated 4-05-2019).

  9. h

    Artificial-intelligence-dataset-for-IR-systems

    • huggingface.co
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adel Mamoun Elwan (2023). Artificial-intelligence-dataset-for-IR-systems [Dataset]. https://huggingface.co/datasets/Adel-Elwan/Artificial-intelligence-dataset-for-IR-systems
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 25, 2023
    Authors
    Adel Mamoun Elwan
    Description

    Dataset Card for Dataset Name

      Dataset Summary
    

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Supported Tasks and Leaderboards
    

    information-retrieval semantic-search

      Languages
    

    English

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data Fields
    

    [More Information Needed]

      Data Splits
    

    [More Information Needed]

      Dataset Creation… See the full description on the dataset page: https://huggingface.co/datasets/Adel-Elwan/Artificial-intelligence-dataset-for-IR-systems.
    
  10. i

    Dataset of smart contract code search

    • ieee-dataport.org
    Updated Jan 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chaochen shi (2025). Dataset of smart contract code search [Dataset]. https://ieee-dataport.org/documents/dataset-smart-contract-code-search
    Explore at:
    Dataset updated
    Jan 4, 2025
    Authors
    chaochen shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    docstring) pairs

  11. R

    Data from: C2a Dataset

    • universe.roboflow.com
    zip
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    saint tour (2024). C2a Dataset [Dataset]. https://universe.roboflow.com/saint-tour/c2a-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 23, 2024
    Dataset authored and provided by
    saint tour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Human Bounding Boxes
    Description

    For more details, please refer to our paper: Nihal, R. A., et al. "UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios." ICPR 2024 (Accepted), arXiv preprint arXiv (2024).

    and the github repo https://github.com/Ragib-Amin-Nihal/C2A

    We encourage users to cite this paper when using the dataset for their research or applications.

    The C2A (Combination to Application) Dataset is a resource designed to advance human detection in disaster scenarios using UAV imagery. This dataset addresses a critical gap in the field of computer vision and disaster response by providing a large-scale, diverse collection of synthetic images that combine real disaster scenes with human poses.

    Context: In the wake of natural disasters and emergencies, rapid and accurate human detection is crucial for effective search and rescue operations. UAVs (Unmanned Aerial Vehicles) have emerged as powerful tools in these scenarios, but their effectiveness is limited by the lack of specialized datasets for training AI models. The C2A dataset aims to bridge this gap, enabling the development of more robust and accurate human detection systems for disaster response.

    Sources: The C2A dataset is a synthetic combination of two primary sources: 1. Disaster Backgrounds: Sourced from the AIDER (Aerial Image Dataset for Emergency Response Applications) dataset, providing authentic disaster scene imagery. 2. Human Poses: Derived from the LSP/MPII-MPHB (Multiple Poses Human Body) dataset, offering a wide range of human body positions.

    Key Features: - 10,215 high-resolution images - Over 360,000 annotated human instances - 5 human pose categories: Bent, Kneeling, Lying, Sitting, and Upright - 4 disaster scenario types: Fire/Smoke, Flood, Collapsed Building/Rubble, and Traffic Accidents - Image resolutions ranging from 123x152 to 5184x3456 pixels - Bounding box annotations for each human instance

    Inspiration: This dataset was inspired by the pressing need to improve the capabilities of AI-assisted search and rescue operations. By providing a diverse and challenging set of images that closely mimic real-world disaster scenarios, we aim to: 1. Enhance the accuracy of human detection algorithms in complex environments 2. Improve the generalization of models across various disaster types and human poses 3. Accelerate the development of AI systems that can assist first responders and save lives

    Applications: The C2A dataset is designed for researchers and practitioners in: - Computer Vision and Machine Learning - Disaster Response and Emergency Management - UAV/Drone Technology - Search and Rescue Operations - Humanitarian Aid and Crisis Response

    We hope this dataset will inspire innovative approaches to human detection in challenging environments and contribute to the development of technologies that can make a real difference in disaster response efforts.

  12. d

    Direct Searches for Iowa Offices

    • datasets.ai
    • catalog.data.gov
    Updated Sep 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of Iowa (2024). Direct Searches for Iowa Offices [Dataset]. https://datasets.ai/datasets/direct-searches-for-iowa-offices
    Explore at:
    Dataset updated
    Sep 15, 2024
    Dataset authored and provided by
    State of Iowa
    Area covered
    Iowa
    Description

    The number of times during the month someone searched the name of a State of Iowa Office with a Google My Business profile using Google Search or while on Google Maps.

  13. Google SERP Data, Web Search Data, Google Images Data | Real-Time API

    • datarade.ai
    .json, .csv
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenWeb Ninja (2024). Google SERP Data, Web Search Data, Google Images Data | Real-Time API [Dataset]. https://datarade.ai/data-products/openweb-ninja-google-data-google-image-data-google-serp-d-openweb-ninja
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Jul 7, 2024
    Dataset authored and provided by
    OpenWeb Ninja
    Area covered
    South Georgia and the South Sandwich Islands, Burundi, Panama, Uganda, Barbados, Ireland, Tokelau, Virgin Islands (U.S.), Uruguay, Grenada
    Description

    OpenWeb Ninja's Google Images Data (Google SERP Data) API provides real-time image search capabilities for images sourced from all public sources on the web.

    The API enables you to search and access more than 100 billion images from across the web including advanced filtering capabilities as supported by Google Advanced Image Search. The API provides Google Images Data (Google SERP Data) including details such as image URL, title, size information, thumbnail, source information, and more data points. The API supports advanced filtering and options such as file type, image color, usage rights, creation time, and more. In addition, any Advanced Google Search operators can be used with the API.

    OpenWeb Ninja's Google Images Data & Google SERP Data API common use cases:

    • Creative Media Production: Enhance digital content with a vast array of real-time images, ensuring engaging and brand-aligned visuals for blogs, social media, and advertising.

    • AI Model Enhancement: Train and refine AI models with diverse, annotated images, improving object recognition and image classification accuracy.

    • Trend Analysis: Identify emerging market trends and consumer preferences through real-time visual data, enabling proactive business decisions.

    • Innovative Product Design: Inspire product innovation by exploring current design trends and competitor products, ensuring market-relevant offerings.

    • Advanced Search Optimization: Improve search engines and applications with enriched image datasets, providing users with accurate, relevant, and visually appealing search results.

    OpenWeb Ninja's Annotated Imagery Data & Google SERP Data Stats & Capabilities:

    • 100B+ Images: Access an extensive database of over 100 billion images.

    • Images Data from all Public Sources (Google SERP Data): Benefit from a comprehensive aggregation of image data from various public websites, ensuring a wide range of sources and perspectives.

    • Extensive Search and Filtering Capabilities: Utilize advanced search operators and filters to refine image searches by file type, color, usage rights, creation time, and more, making it easy to find exactly what you need.

    • Rich Data Points: Each image comes with more than 10 data points, including URL, title (annotation), size information, thumbnail, and source information, providing a detailed context for each image.

  14. h

    search-arena-v1-7k

    • huggingface.co
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMArena (2025). search-arena-v1-7k [Dataset]. https://huggingface.co/datasets/lmarena-ai/search-arena-v1-7k
    Explore at:
    Dataset updated
    Apr 13, 2025
    Dataset authored and provided by
    LMArena
    Description

    Overview

    This dataset contains 7k leaderboard conversation votes collected from Search Arena between March 18, 2025 and April 13, 2025. All entries have been redacted for PII and sensitive user information to ensure privacy. Each data point includes:

    Two model responses (messages_a and messages_b) The human vote result A timestamp Full system metadata, LLM + web search trace, and post-processed metadata for controlled experiments (conv_meta)

    To reproduce the leaderboard results… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/search-arena-v1-7k.

  15. f

    People Data | Global |Reach - 900 Million Records for Comprehensive Consumer...

    • factori.ai
    Updated Dec 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). People Data | Global |Reach - 900 Million Records for Comprehensive Consumer Insights & Data Enrichment [Dataset]. https://www.factori.ai/datasets/people-data/
    Explore at:
    Dataset updated
    Dec 24, 2024
    License

    https://www.factori.ai/privacy-policyhttps://www.factori.ai/privacy-policy

    Area covered
    Global
    Description

    Our proprietary People Data is a mobile user dataset that connects anonymous IDs to a wide range of attributes, including demographics, device ownership, audience segments, key locations, and more. This rich dataset allows our partner brands to gain a comprehensive view of consumers based on their personas, enabling them to derive actionable insights swiftly.

    People Data Graph

    • Record Count: 900 Million
    • Capturing Frequency: Once per Event
    • Delivering Frequency: Once per Month
    • Updated: Monthly

    People Data

    Reach Our extensive data reach covers a variety of categories, encompassing user demographics, Mobile Advertising IDs (MAID), device details, locations, affluence, interests, traveled countries, and more. Data Export Methodology We dynamically collect and provide the most updated data and insights through the best-suited method at appropriate intervals, whether daily, weekly, monthly, or quarterly.

    Business Needs

    Our People Data caters to various business needs, offering valuable insights for consumer analysis, data enrichment, sales forecasting, and retail analytics, empowering brands to make informed decisions and optimize their strategies.

  16. h

    search-arena-24k

    • huggingface.co
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LMArena (2025). search-arena-24k [Dataset]. https://huggingface.co/datasets/lmarena-ai/search-arena-24k
    Explore at:
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    LMArena
    Description

    Overview

    This dataset contains ALL in-the-wild conversation crowdsourced from Search Arena between March 18, 2025 and May 8, 2025. It includes 24,069 multi-turn conversations with search-LLMs across diverse intents, languages, and topics—alongside 12,652 human preference votes. The dataset spans approximately 11,000 users across 136 countries, 13 publicly released models, around 90 languages (including 11% multilingual prompts), and over 5,000 multi-turn sessions. While user… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/search-arena-24k.

  17. c

    Data from: KEYWORD SEARCH IN TEXT CUBE: FINDING TOP-K RELEVANT CELLS

    • s.cnmilf.com
    • datasets.ai
    • +3more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). KEYWORD SEARCH IN TEXT CUBE: FINDING TOP-K RELEVANT CELLS [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/keyword-search-in-text-cube-finding-top-k-relevant-cells
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    KEYWORD SEARCH IN TEXT CUBE: FINDING TOP-K RELEVANT CELLS BOLIN DING, YINTAO YU, BO ZHAO, CINDY XIDE LIN, JIAWEI HAN, AND CHENGXIANG ZHAI Abstract. We study the problem of keyword search in a data cube with text-rich dimension(s) (so-called text cube). The text cube is built on a multidimensional text database, where each row is associated with some text data (e.g., a document) and other structural dimensions (attributes). A cell in the text cube aggregates a set of documents with matching attribute values in a subset of dimensions. A cell document is the concatenation of all documents in a cell. Given a keyword query, our goal is to find the top-k most relevant cells (ranked according to the relevance scores of cell documents w.r.t. the given query) in the text cube. We define a keyword-based query language and apply IR-style relevance model for scoring and ranking cell documents in the text cube. We propose two efficient approaches to find the top-k answers. The proposed approaches support a general class of IR-style relevance scoring formulas that satisfy certain basic and common properties. One of them uses more time for pre-processing and less time for answering online queries; and the other one is more efficient in pre-processing and consumes more time for online queries. Experimental studies on the ASRS dataset are conducted to verify the efficiency and effectiveness of the proposed approaches.

  18. d

    Records Search Help Document

    • datasets.ai
    • catalog.data.gov
    • +2more
    21
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lake County, Illinois (2024). Records Search Help Document [Dataset]. https://datasets.ai/datasets/records-search-help-document-891ee
    Explore at:
    21Available download formats
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    Lake County, Illinois
    Description

    Use the Records Search to do the following: Search for records, such as agreements with other government agencies, maps and other documents from the Public Works, Transportation, and Planning, Building and Development departments.

  19. d

    DigitalGov Search API (formerly USASearch).

    • datadiscoverystudio.org
    • datasets.ai
    • +2more
    Updated Feb 8, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). DigitalGov Search API (formerly USASearch). [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/cf8c58770ea543fdbd420c13157d3e47/html
    Explore at:
    Dataset updated
    Feb 8, 2018
    Description

    description: Provides DigitalGov Search customers their search results in JSON. Sign in with a .gov or .mil email is required.; abstract: Provides DigitalGov Search customers their search results in JSON. Sign in with a .gov or .mil email is required.

  20. d

    King County Parks Finder

    • datasets.ai
    • data.kingcounty.gov
    • +1more
    21
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    King County, Washington (2024). King County Parks Finder [Dataset]. https://datasets.ai/datasets/king-county-parks-finder
    Explore at:
    21Available download formats
    Dataset updated
    Aug 26, 2024
    Dataset authored and provided by
    King County, Washington
    Area covered
    King County
    Description

    Interactive map of King County parks

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
John, search-dataset [Dataset]. https://huggingface.co/datasets/junzhang1207/search-dataset

search-dataset

junzhang1207/search-dataset

AI Search Providers Benchmark Dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
John
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

AI Search Providers Benchmark Dataset

  📊 Dataset Structure

Each entry contains:

id: Unique identifier for the QA pair question: The query text expected_answer: The correct answer category: Topic category area: Broader area classification (News/Knowledge)

  🎯 Categories

The dataset covers various domains including:

Entertainment Sports Technology General News Finance Architecture Arts Astronomy Auto (Automotive) E-sports Fashion False Premise

  📈… See the full description on the dataset page: https://huggingface.co/datasets/junzhang1207/search-dataset.
Search
Clear search
Close search
Google apps
Main menu