100+ datasets found
  1. c

    PRIVATE Patent Application Information Retrieval (PAIR)

    • s.cnmilf.com
    • catalog.data.gov
    Updated Jul 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patents (2022). PRIVATE Patent Application Information Retrieval (PAIR) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/private-patent-application-information-retrieval-pair
    Explore at:
    Dataset updated
    Jul 15, 2022
    Dataset provided by
    Patents
    Description

    Offers exclusive access to patent application status information for unpublished patent applications only to the applicant/inventor or his/her representative(s). Private PAIR includes bibliographic, patent term adjustments, continuity data, foreign priority, and address & attorney/agent information from the Patent Application Locating and Monitoring (PALM) System; PDF images of documents (including correspondence) and a transaction history from the Content Management System (CMS) (formerly the Image File Wrapper (IFW) System); and fee information from the Fee Processing Next Generation (FPNG) System. Search is by application number (with or without the two-digit series code), control number, or Patent Cooperation Treaty (PCT) number. Private PAIR requires users to establish a USPTO.gov account and customer number, and establish a password. For more information about establishing a USPTO.gov account and customer number: https://www.uspto.gov/patents-application-process/applying-online/getting-started-new-users Unavailable during database backups (Saturday, Tuesday, and Thursday from 04:30 - 04:45 AM U.S. Eastern Time and Sunday 00:01 - 04:00 AM U.S. Eastern Time. Updated daily. https://ppair-my.uspto.gov/pair/PrivatePair

  2. DORIS-MAE-v1

    • zenodo.org
    • data.niaid.nih.gov
    bin, json
    Updated Oct 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianyou Wang; Kaicheng Wang; Xiaoyue Wang; Prudhviraj Naidu; Leon Bergen; Ramamohan Paturi; Jianyou Wang; Kaicheng Wang; Xiaoyue Wang; Prudhviraj Naidu; Leon Bergen; Ramamohan Paturi (2023). DORIS-MAE-v1 [Dataset]. http://doi.org/10.5281/zenodo.8299749
    Explore at:
    bin, jsonAvailable download formats
    Dataset updated
    Oct 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jianyou Wang; Kaicheng Wang; Xiaoyue Wang; Prudhviraj Naidu; Leon Bergen; Ramamohan Paturi; Jianyou Wang; Kaicheng Wang; Xiaoyue Wang; Prudhviraj Naidu; Leon Bergen; Ramamohan Paturi
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high costs and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research.

    Documentations for the DORIS-MAE dataset is publicly available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset. This upload contains both DORIS-MAE dataset version 1 and ada-002 vector embeddings for all queries and related abstracts (used in candidate pool creation). DORIS-MAE dataset version 1 is comprised of four main sub-datasets, each serving distinct purposes.

    The Query dataset contains 100 human-crafted complex queries spanning across five categories: ML, NLP, CV, AI, and Composite. Each category has 20 associated queries. Queries are broken down into aspects (ranging from 3 to 9 per query) and sub-aspects (from 0 to 6 per aspect, with 0 signifying no further breakdown required). For each query, a corresponding candidate pool of relevant paper abstracts, ranging from 99 to 138, is provided.

    The Corpus dataset is composed of 363,133 abstracts from computer science papers, published between 2011-2021, and sourced from arXiv. Each entry includes title, original abstract, URL, primary and secondary categories, as well as citation information retrieved from Semantic Scholar. A masked version of each abstract is also provided, facilitating the automated creation of queries.

    The Annotation dataset includes generated annotations for all 165,144 question pairs, each comprising an aspect/sub-aspect and a corresponding paper abstract from the query's candidate pool. It includes the original text generated by ChatGPT (version chatgpt-3.5-turbo-0301) explaining its decision-making process, along with a three-level relevance score (e.g., 0,1,2) representing ChatGPT's final decision.

    Finally, the Test Set dataset contains human annotations for a random selection of 250 question pairs used in hypothesis testing. It includes each of the three human annotators' final decisions, recorded as a three-level relevance score (e.g., 0,1,2).

    The file "ada_embedding_for_DORIS-MAE_v1.pickle" contains text embeddings for the DORIS-MAE dataset, generated by OpenAI's ada-002 model. The structure of the file is as follows:

    ├── ada_embedding_for_DORIS-MAE_v1.pickle
    ├── "Query"
    │ ├── query_id_1 (Embedding of query_1)
    │ ├── query_id_2 (Embedding of query_2)
    │ └── query_id_3 (Embedding of query_3)
    │ .
    │ .
    │ .
    └── "Corpus"
    ├── corpus_id_1 (Embedding of abstract_1)
    ├── corpus_id_2 (Embedding of abstract_2)
    └── corpus_id_3 (Embedding of abstract_3)
    .
    .
    .

  3. h

    Finsights-Grey-RAG-Effective-Information-Retrieval-logs

    • huggingface.co
    Updated Mar 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rajagopal (2025). Finsights-Grey-RAG-Effective-Information-Retrieval-logs [Dataset]. https://huggingface.co/datasets/rajapower1/Finsights-Grey-RAG-Effective-Information-Retrieval-logs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2025
    Authors
    rajagopal
    Description

    rajapower1/Finsights-Grey-RAG-Effective-Information-Retrieval-logs dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. Z

    Models and Data for Simple Applications of BERT for Ad Hoc Document...

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang, Wei (2020). Models and Data for Simple Applications of BERT for Ad Hoc Document Retrieval [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3241944
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Lin, Jimmy
    Zhang, Haotian
    Akkalyoncu Yilmaz, Zeynep
    Yang, Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This submission includes all pretrained models, test data and prediction files for the arXiv paper "Simple Applications of BERT for Ad Hoc Document Retrieval". Please follow the instructions at the Birch repo to reproduce the results.

  5. r

    Computer-Assisted Information Retrieval Service System for Music

    • rrid.site
    • dknet.org
    • +2more
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Computer-Assisted Information Retrieval Service System for Music [Dataset]. http://identifiers.org/RRID:SCR_008177
    Explore at:
    Dataset updated
    Aug 25, 2025
    Description

    CAIRSS is a bibliographic database of older literature (prior to 1993) of music research literature in music education, music psychology, music therapy, and music medicine. Citations have been taken from 1,354 different journal titles; 18 of which are primary journals, meaning that every article ever to appear is included. The primary journals are: * Arts in Psychotherapy * Bulletin of the Council for Research in Music Education * Bulletin of the National Association for Music Therapy * Contributions to Music Education * Hospital Music Newsletter * International Journal of Arts Medicine * Journal of the Association for Music and Imagery * Journal of Music Teacher Education * Journal of Music Therapy * Journal of Research in Music Education * Medical Problems of Performing Artists * Music Perception * Music Therapy * Music Therapy Perspectives * Psychology of Music * Psychomusicology * The Quarterly * Applications of Research to Music Education

  6. W

    IR Benchmarks

    • anthology.aicmu.ac.cn
    • webis.de
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Benno Stein; Matthias Hagen (2023). IR Benchmarks [Dataset]. https://anthology.aicmu.ac.cn/data/ir-benchmarks.html
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    The Web Technology & Information Systems Network
    Leipzig University
    Friedrich Schiller University Jena
    Bauhaus-Universität Weimar
    Authors
    Martin Potthast; Benno Stein; Matthias Hagen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of information retrieval benchmarks covering 15 corpora (1.9 billion documents) on which 32 well-known shared tasks are based. We filled the leaderboards with Docker images of 50 standard retrieval approaches. Within this setup, we were able to automatically run and evaluate the 50 approaches on the 32 tasks (1600 runs). All Benchmarks are added as training datasets because their qrels are already publicly available. Please find a detailed tutorial on how to submit approaches on github.

    View on TIRA: https://tira.io/task-overview/ir-benchmarks

  7. g

    TREC 2022 NeuCLIR Dataset

    • gimi9.com
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +2more
    Updated Sep 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). TREC 2022 NeuCLIR Dataset [Dataset]. https://gimi9.com/dataset/data-gov_2022-neuclir-dataset/
    Explore at:
    Dataset updated
    Sep 13, 2025
    Description

    Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluation forums for more than twenty years, but recent advances in the application of deep learning to information retrieval (IR) warrant a new, large-scale effort that will enable exploration of classical and modern IR techniques for this task.

  8. company-profiles

    • kaggle.com
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ojas Srivastava (2023). company-profiles [Dataset]. https://www.kaggle.com/datasets/ojassrivastava18/company-information-information-retrieval-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ojas Srivastava
    Description

    This dataset contains information about different companies. There are 41 text files, each containing information about different companies. Each file has more than 500 words. This dataset can be used to test information retrieval models and for other NLP-based models.

  9. h

    vietnamese-retrieval

    • huggingface.co
    Updated Feb 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thanh Dat Hoang (2024). vietnamese-retrieval [Dataset]. https://huggingface.co/datasets/thanhdath/vietnamese-retrieval
    Explore at:
    Dataset updated
    Feb 8, 2024
    Authors
    Thanh Dat Hoang
    Description

    Dataset Card for "vietnamese-retrieval"

    More Information needed

  10. SE-PQA: a Resource for Personalized Community Question Answering

    • zenodo.org
    csv, zip
    Updated Feb 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasela Pranav; Kasela Pranav; Pasi Gabriella; Pasi Gabriella; Perego Raffaele; Perego Raffaele; Marco Braga; Marco Braga (2024). SE-PQA: a Resource for Personalized Community Question Answering [Dataset]. http://doi.org/10.5281/zenodo.7940964
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Feb 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kasela Pranav; Kasela Pranav; Pasi Gabriella; Pasi Gabriella; Perego Raffaele; Perego Raffaele; Marco Braga; Marco Braga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Personalization in Information Retrieval is a topic studied for a long time. Nevertheless, there is still a lack of high-quality, real-world datasets to conduct large-scale experiments and evaluate models for personalized search. This paper contributes to fill this gap by introducing SE-PQA (StackExchange - Personalized Question Answering), a new resource to design and evaluate personalized models related to the two tasks of community Question Answering (cQA). The contributed dataset includes more than 1 million queries and 2 million answers, annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. We describe the characteristics of SE-PQA and detail the features associated with both questions and answers. We also provide reproducible baseline methods for the cQA task based on the resource, including deep learning models and personalization approaches. The results of the preliminary experiments conducted show the appropriateness of SE-PQA to train effective cQA models; they also show that personalization improves remarkably the effectiveness of all the methods tested. Furthermore, we show the benefits in terms of robustness and generalization of combining data from multiple communities for personalization purposes.

    Performance on all communities separately:

      <tbody><tr>
        <th>Community</th>
        <th>Model (BM25 +)</th>
        <th>P@1</th>
        <th>NDCG@3</th>
        <th>NDCG@10</th>
        <th>R@100</th>
        <th>MAP@100</th>
        <th>$\lambda$</th>
      </tr>
    
    </tbody><tbody>
      <tr>
        <td>Academia</td>
        <td>MiniLM</td>
        <td>0.438</td>
        <td>0.382</td>
        <td>0.395</td>
        <td>0.489</td>
        <td>0.344</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.453</td>
        <td>0.392</td>
        <td>0.403</td>
        <td>0.489</td>
        <td>0.352</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Anime</td>
        <td>MiniLM + TAG</td>
        <td>0.650</td>
        <td>0.682</td>
        <td>0.714</td>
        <td>0.856</td>
        <td>0.683</td>
        <td>(.1,.9,.0)</td>
      </tr>
      <tr>
        <td>Apple</td>
        <td>MiniLM</td>
        <td>0.327</td>
        <td>0.351</td>
        <td>0.381</td>
        <td>0.514</td>
        <td>0.349</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.335</td>
        <td>0.361</td>
        <td>0.389</td>
        <td>0.514</td>
        <td>0.357</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Bicycles</td>
        <td>MiniLM</td>
        <td>0.405</td>
        <td>0.380</td>
        <td>0.421</td>
        <td>0.600</td>
        <td>0.365</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.436</td>
        <td>0.405</td>
        <td>0.441</td>
        <td>0.600</td>
        <td>0.386</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Boardgames</td>
        <td>MiniLM</td>
        <td>0.681</td>
        <td>0.694</td>
        <td>0.728</td>
        <td>0.866</td>
        <td>0.692</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.696</td>
        <td>0.702</td>
        <td>0.736</td>
        <td>0.866</td>
        <td>0.699</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Buddhism</td>
        <td>MiniLM + TAG</td>
        <td>0.490</td>
        <td>0.387</td>
        <td>0.397</td>
        <td>0.544</td>
        <td>0.334</td>
        <td>(.3,.7,.0)</td>
      </tr>
      <tr>
        <td>Christianity</td>
        <td>MiniLM</td>
        <td>0.534</td>
        <td>0.505</td>
        <td>0.555</td>
        <td>0.783</td>
        <td>0.497</td>
        <td>(.2,.8)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.549</td>
        <td>0.521</td>
        <td>0.564</td>
        <td>0.783</td>
        <td>0.507</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Cooking</td>
        <td>MiniLM</td>
        <td>0.600</td>
        <td>0.567</td>
        <td>0.600</td>
        <td>0.719</td>
        <td>0.553</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.619</td>
        <td>0.583</td>
        <td>0.614</td>
        <td>0.719</td>
        <td>0.568</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>DIY</td>
        <td>MiniLM</td>
        <td>0.323</td>
        <td>0.313</td>
        <td>0.346</td>
        <td>0.501</td>
        <td>0.302</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.335</td>
        <td>0.324</td>
        <td>0.356</td>
        <td>0.501</td>
        <td>0.312</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Expatriates</td>
        <td>MiniLM + TAG</td>
        <td>0.596</td>
        <td>0.653</td>
        <td>0.682</td>
        <td>0.832</td>
        <td>0.645</td>
        <td>(.1,.9,.0)</td>
      </tr>
      <tr>
        <td>Fitness</td>
        <td>MiniLM + TAG</td>
        <td>0.568</td>
        <td>0.575</td>
        <td>0.613</td>
        <td>0.760</td>
        <td>0.567</td>
        <td>(.2,.8,.0)</td>
      </tr>
      <tr>
        <td>Freelancing</td>
        <td>MiniLM + TAG</td>
        <td>0.513</td>
        <td>0.472</td>
        <td>0.506</td>
        <td>0.654</td>
        <td>0.457</td>
        <td>(.1,.9,.0)</td>
      </tr>
      <tr>
        <td>Gaming</td>
        <td>MiniLM</td>
        <td>0.510</td>
        <td>0.534</td>
        <td>0.562</td>
        <td>0.686</td>
        <td>0.532</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.519</td>
        <td>0.547</td>
        <td>0.571</td>
        <td>0.686</td>
        <td>0.541</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Gardening</td>
        <td>MiniLM</td>
        <td>0.344</td>
        <td>0.362</td>
        <td>0.396</td>
        <td>0.520</td>
        <td>0.359</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.345</td>
        <td>0.369</td>
        <td>0.399</td>
        <td>0.520</td>
        <td>0.363</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Genealogy</td>
        <td>MiniLM + TAG</td>
        <td>0.592</td>
        <td>0.605</td>
        <td>0.631</td>
        <td>0.779</td>
        <td>0.594</td>
        <td>(.3,.7,.0)</td>
      </tr>
      <tr>
        <td>Health</td>
        <td>MiniLM + TAG</td>
        <td>0.718</td>
        <td>0.765</td>
        <td>0.797</td>
        <td>0.934</td>
        <td>0.765</td>
        <td>(.2,.8,.0)</td>
      </tr>
      <tr>
        <td>Gaming</td>
        <td>MiniLM</td>
        <td>0.510</td>
        <td>0.534</td>
        <td>0.562</td>
        <td>0.686</td>
        <td>0.532</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.519</td>
        <td>0.547</td>
        <td>0.571</td>
        <td>0.686</td>
        <td>0.541</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Hermeneutics</td>
        <td>MiniLM</td>
        <td>0.589</td>
        <td>0.538</td>
        <td>0.593</td>
        <td>0.828</td>
        <td>0.526</td>
        <td>(.2,.8)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.632</td>
        <td>0.570</td>
        <td>0.617</td>
        <td>0.828</td>
        <td>0.552</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Hinduism</td>
        <td>MiniLM</td>
        <td>0.388</td>
        <td>0.415</td>
        <td>0.459</td>
        <td>0.686</td>
        <td>0.416</td>
        <td>(.2,.8)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.382</td>
        <td>0.410</td>
        <td>0.457</td>
        <td>0.686</td>
        <td>0.412</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>History</td>
        <td>MiniLM + TAG</td>
        <td>0.740</td>
        <td>0.735</td>
        <td>0.764</td>
        <td>0.862</td>
        <td>0.730</td>
        <td>(.2,.8,.0)</td>
      </tr>
      <tr>
        <td>Hsm</td>
        <td>MiniLM + TAG</td>
        <td>0.666</td>
        <td>0.707</td>
        <td>0.737</td>
        <td>0.870</td>
        <td>0.690</td>
        <td>(.2,.8,.0)</td>
      </tr>
      <tr>
        <td>Interpersonal</td>
        <td>MiniLM + TAG</td>
        <td>0.663</td>
        <td>0.617</td>
        <td>0.653</td>
        <td>0.739</td>
        <td>0.604</td>
        <td>(.2,.8,.0)</td>
      </tr>
      <tr>
        <td>Islam</td>
        <td>MiniLM</td>
        <td>0.382</td>
        <td>0.412</td>
        <td>0.453</td>
        <td>0.642</td>
        <td>0.410</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.395</td>
        <td>0.427</td>
        <td>0.464</td>
        <td>0.642</td>
        <td>0.421</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Judaism</td>
        <td>MiniLM + TAG</td>
        <td>0.363</td>
        <td>0.387</td>
        <td>0.432</td>
        <td>0.649</td>
        <td>0.388</td>
        <td>(.2,.8,.0)</td>
      </tr>
      <tr>
        <td>Law</td>
        <td>MiniLM</td>
        <td>0.663</td>
        <td>0.647</td>
        <td>0.678</td>
        <td>0.803</td>
        <td>0.639</td>
        <td>(.2,.8)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.677</td>
        <td>0.657</td>
        <td>0.687</td>
        <td>0.803</td>
        <td>0.649</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Lifehacks</td>
        <td>MiniLM</td>
        <td>0.714</td>
        <td>0.601</td>
        <td>0.617</td>
        <td>0.703</td>
        <td>0.553</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.714</td>
        <td>0.621</td>
        <td>0.631</td>
        <td>0.703</td>
        <td>0.568</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Linguistics</td>
        <td>MiniLM + TAG</td>
        <td>0.584</td>
        <td>0.588</td>
        <td>0.630</td>
        <td>0.794</td>
        <td>0.587</td>
        <td>(.2,.8,.0)</td>
      </tr>
      <tr>
        <td>Literature</td>
        <td>MiniLM + TAG</td>
        <td>0.871</td>
        <td>0.878</td>
        <td>0.889</td>
        <td>0.934</td>
        <td>0.876</td>
        <td>(.3,.7,.0)</td>
      </tr>
      <tr>
        <td>Martialarts</td>
        <td>MiniLM</td>
        <td>0.630</td>
        <td>0.599</td>
        <td>0.645</td>
        <td>0.796</td>
        <td>0.596</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.640</td>
        <td>0.628</td>
        <td>0.660</td>
        <td>0.796</td>
        <td>0.612</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Money</td>
        <td>MiniLM</td>
        <td>0.545</td>
        <td>0.535</td>
        <td>0.563</td>
        <td>0.706</td>
        <td>0.515</td>
        <td>(.2,.8)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.559</td>
        <td>0.542</td>
        <td>0.571</td>
        <td>0.706</td>
        <td>0.523</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Movies</td>
        <td>MiniLM</td>
        <td>0.713</td>
        <td>0.722</td>
        <td>0.753</td>
        <td>0.865</td>
        <td>0.724</td>
        <td>(.1,.9)</td>
      </tr>
      <tr>
        <td> </td>
        <td>MiniLM + TAG</td>
        <td>0.728</td>
        <td>0.735</td>
        <td>0.762</td>
        <td>0.865</td>
        <td>0.735</td>
        <td>(.1,.8,.1)</td>
      </tr>
      <tr>
        <td>Music</td>
        <td>MiniLM</td>
        <td>0.508</td>
        <td>0.447</td>
        <td>0.476</td>
        <td>0.602</td>
        <td>0.418</td>
        <td>(.2,.8)</td>
      </tr>
      <tr>
        <td>
    
  11. p

    Data from: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge...

    • physionet.org
    Updated Jan 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantin Kotschenreuther (2024). EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems [Dataset]. http://doi.org/10.13026/25fx-f706
    Explore at:
    Dataset updated
    Jan 11, 2024
    Authors
    Konstantin Kotschenreuther
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.

  12. STI BM25 Sequence Dataset

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tingzhen Liu; Qianqian Xiong; Shengxi Zhang (2023). STI BM25 Sequence Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21321198.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Tingzhen Liu; Qianqian Xiong; Shengxi Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Based on Baidu STI dataset, we conducted a comparative study on the cost performance of classical computational linguistics methods and large language models. This dataset discloses the relevant data of the study, including the original corpus and the BM25 sequence we calculated.

  13. D

    Document Management and Retrieval System Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Document Management and Retrieval System Report [Dataset]. https://www.marketreportanalytics.com/reports/document-management-and-retrieval-system-55257
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Document Management and Retrieval System (DMRS) market is experiencing robust growth, driven by the increasing need for efficient information management across diverse sectors. The rising volume of digital documents, coupled with stringent regulatory compliance requirements and the growing adoption of cloud-based solutions, are key factors fueling market expansion. Academic institutions, corporations, and the public sector are increasingly relying on DMRS to streamline workflows, enhance collaboration, and ensure data security. The market is segmented by application (Academic, Corporate, Public Sector) and type (Cloud-based, On-premises), with cloud-based solutions gaining significant traction due to their scalability, accessibility, and cost-effectiveness. Key players like Clarivate, Elsevier, and Digital Science are driving innovation through continuous product development and strategic partnerships. While the on-premises segment retains a presence, the shift towards cloud-based solutions is anticipated to continue, driven by the benefits of remote access and reduced infrastructure costs. Regional variations exist, with North America and Europe currently holding significant market shares, although Asia-Pacific is projected to witness substantial growth in the coming years, fueled by increasing digitalization and technological advancements. The competitive landscape is characterized by both established players and emerging companies offering specialized solutions. This leads to a dynamic market with a focus on continuous improvement and innovation. The forecast period (2025-2033) anticipates sustained growth, propelled by technological advancements like AI-powered search and retrieval capabilities, improved integration with other business applications, and the increasing demand for robust security features. The market is expected to consolidate somewhat, with larger players potentially acquiring smaller firms to expand their product portfolios and market reach. Despite the strong growth outlook, challenges remain, including data security concerns, integration complexities, and the need for user-friendly interfaces. Addressing these concerns through continuous innovation and user-centric design will be crucial for sustained market success. The market is expected to witness a gradual shift towards more sophisticated and integrated DMRS solutions, catering to the evolving needs of diverse user groups.

  14. AILA 2019 Precedent & Statute Retrieval Task

    • zenodo.org
    zip
    Updated Oct 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paheli Bhattacharya; Kripabandhu Ghosh; Saptarshi Ghosh; Arindam Pal; Parth Mehta; Arnab Bhattacharya; Prasenjit Majumder; Paheli Bhattacharya; Kripabandhu Ghosh; Saptarshi Ghosh; Arindam Pal; Parth Mehta; Arnab Bhattacharya; Prasenjit Majumder (2020). AILA 2019 Precedent & Statute Retrieval Task [Dataset]. http://doi.org/10.5281/zenodo.4063986
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 3, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Paheli Bhattacharya; Kripabandhu Ghosh; Saptarshi Ghosh; Arindam Pal; Parth Mehta; Arnab Bhattacharya; Prasenjit Majumder; Paheli Bhattacharya; Kripabandhu Ghosh; Saptarshi Ghosh; Arindam Pal; Parth Mehta; Arnab Bhattacharya; Prasenjit Majumder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of the AILA (Artificial Intelligence for Legal Assistance) Track at FIRE 2019

    Track website : https://sites.google.com/view/fire-2019-aila/
    Conference website : http://fire.irsi.res.in/fire/2019/home

    In countries following the Common Law system (e.g., UK, USA, Canada, Australia, India), there are two primary sources of law – Statutes (established laws) and Precedents (prior cases). Statutes deal with applying legal principles to a situation (facts / scenario / circumstances which lead to filing the case). Precedents or prior cases help a lawyer understand how the Court has dealt with similar scenarios in the past, and prepare the legal reasoning accordingly.

    When a lawyer is presented with a situation (that will potentially lead to filing of a case), it will be very beneficial to him/her if there is an automatic system that identifies a set of related prior cases involving similar situations as well as statutes/acts that can be most suited to the purpose in the given situation. Such a system shall not only help a lawyer but also benefit a common man, in a way of getting a preliminary understanding, even before he/she approaches a lawyer. It shall assist him/her in identifying where his/her legal problem fits, what legal actions he/she can proceed with (through statutes) and what were the outcomes of similar cases (through precedents).

    Motivated by the above scenario, we propose two tasks here :

    • Task 1 : Identifying relevant prior cases for a given situation
    • Task 2 : Identifying most relevant statutes for a given situation

    Task Description:

    You will be given a set of 50 queries, each of which describes a situation.

    Task 1: Identifying relevant prior cases

    We provide ~3000 case documents of cases that were judged in the Supreme Court of India. For each query, the task is to retrieve the most similar / relevant case document with respect to the situation in the given query.

    Task 2: Identifying relevant statutes

    We have identified a set of 197 statutes (Sections of Acts) from Indian law, that are relevant to some of the queries. We provide the title and description of these statutes. For each query, the task is to identify the most relevant statutes (from among the 197 statutes). Note that, the task can be modelled either as an unsupervised retrieval task (where you search for relevant statues) or as a supervised classification task (e.g., trying to predict for each statute whether it is relevant). For the latter, case documents provided for Task 1 can be utilised. However, if a team wishes to apply supervised models, then it is their responsibility to create the necessary training data.

  15. f

    Data underlying the master thesis: Exploring Copula-Based Models for the...

    • figshare.com
    • data.4tu.nl
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimitris Theodorakopoulos (2023). Data underlying the master thesis: Exploring Copula-Based Models for the Stochastic Simulation of Information Retrieval Evaluation Data [Dataset]. http://doi.org/10.4121/21739355.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Dimitris Theodorakopoulos
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the results of the experiments that I ran for my master thesis. The full code (and more) can be found at https://github.com/dimitris93/msc-thesis

  16. RELISH-Aspire

    • figshare.com
    json
    Updated Mar 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheshera Mysore (2022). RELISH-Aspire [Dataset]. http://doi.org/10.6084/m9.figshare.19425506.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Mar 26, 2022
    Dataset provided by
    figshare
    Authors
    Sheshera Mysore
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a copy of the RELISH dataset used in the paper "Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity" by Sheshera Mysore, Arman Cohan, Tom Hope. The RELISH dataset was first introduced in Brown et al. 2019. See further details of the paper, how this dataset was compiled, and how it was used: https://github.com/allenai/aspireThe contents of the dataset are as follows: abstracts-relish.jsonl: jsonl file containing the paper-id, abstracts, and titles for the queries and candidates which are part of the dataset.

    relish-queries-release.csv: Metadata associated with every query.test-pid2anns-relish.json: JSON file with the query paper-id, candidate paper-ids for every query paper in the dataset. Use these files in conjunction with abstracts-relish.jsonl to generate files for use in model evaluation. relish-evaluation_splits.json: Paper-ids for the splits to use in reporting evaluation numbers. aspire/src/evaluation/ranking_eval.py included in the github repo accompanying this dataset implements the evaluation protocol and computes evaluation metrics. Please see the paper for descriptions of the experimental protocol we recommend to report evaluation metrics.

  17. h

    perspective-information-retrieval-perspectrum

    • huggingface.co
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fengyu Cai (2024). perspective-information-retrieval-perspectrum [Dataset]. https://huggingface.co/datasets/trumancai/perspective-information-retrieval-perspectrum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 14, 2024
    Authors
    Fengyu Cai
    Description

    trumancai/perspective-information-retrieval-perspectrum dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. f

    Tuning of the information retrieval parameters for the Question-Answering...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Apr 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lovis, Christian; Teodoro, Douglas; Harbarth, Stephan; Huttner, Angela; Gobeill, Julien; Ruch, Patrick; Wipfli, Rolf; Pasche, Emilie (2013). Tuning of the information retrieval parameters for the Question-Answering task. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001740411
    Explore at:
    Dataset updated
    Apr 30, 2013
    Authors
    Lovis, Christian; Teodoro, Douglas; Harbarth, Stephan; Huttner, Angela; Gobeill, Julien; Ruch, Patrick; Wipfli, Rolf; Pasche, Emilie
    Description

    Tuning of the information retrieval parameters for the Question-Answering task.

  19. s

    Citation Trends for "Protecting Data Privacy in Private Information...

    • shibatadb.com
    Updated Jun 15, 2000
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yubetsu (2000). Citation Trends for "Protecting Data Privacy in Private Information Retrieval Schemes" [Dataset]. https://www.shibatadb.com/article/gmVF6nhT
    Explore at:
    Dataset updated
    Jun 15, 2000
    Dataset authored and provided by
    Yubetsu
    License

    https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt

    Time period covered
    2001 - 2025
    Variables measured
    New Citations per Year
    Description

    Yearly citation counts for the publication titled "Protecting Data Privacy in Private Information Retrieval Schemes".

  20. Patterns of Scholarly Communication in Global Information Retrieval...

    • figshare.com
    7z
    Updated Oct 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shakil Ahmad (2022). Patterns of Scholarly Communication in Global Information Retrieval Research: A bibliometric analysis (1954-2021) [Dataset]. http://doi.org/10.6084/m9.figshare.21312366.v1
    Explore at:
    7zAvailable download formats
    Dataset updated
    Oct 11, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Shakil Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Patterns of Scholarly Communication in Global Information Retrieval Research: A bibliometric analysis (1954-2021)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Patents (2022). PRIVATE Patent Application Information Retrieval (PAIR) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/private-patent-application-information-retrieval-pair

PRIVATE Patent Application Information Retrieval (PAIR)

Explore at:
Dataset updated
Jul 15, 2022
Dataset provided by
Patents
Description

Offers exclusive access to patent application status information for unpublished patent applications only to the applicant/inventor or his/her representative(s). Private PAIR includes bibliographic, patent term adjustments, continuity data, foreign priority, and address & attorney/agent information from the Patent Application Locating and Monitoring (PALM) System; PDF images of documents (including correspondence) and a transaction history from the Content Management System (CMS) (formerly the Image File Wrapper (IFW) System); and fee information from the Fee Processing Next Generation (FPNG) System. Search is by application number (with or without the two-digit series code), control number, or Patent Cooperation Treaty (PCT) number. Private PAIR requires users to establish a USPTO.gov account and customer number, and establish a password. For more information about establishing a USPTO.gov account and customer number: https://www.uspto.gov/patents-application-process/applying-online/getting-started-new-users Unavailable during database backups (Saturday, Tuesday, and Thursday from 04:30 - 04:45 AM U.S. Eastern Time and Sunday 00:01 - 04:00 AM U.S. Eastern Time. Updated daily. https://ppair-my.uspto.gov/pair/PrivatePair

Search
Clear search
Close search
Google apps
Main menu