100+ datasets found

c
PRIVATE Patent Application Information Retrieval (PAIR)
s.cnmilf.com
catalog.data.gov
Updated Jul 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patents (2022). PRIVATE Patent Application Information Retrieval (PAIR) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/private-patent-application-information-retrieval-pair
Explore at:
Dataset updated
Jul 15, 2022
Dataset provided by
Patents
Description
Offers exclusive access to patent application status information for unpublished patent applications only to the applicant/inventor or his/her representative(s). Private PAIR includes bibliographic, patent term adjustments, continuity data, foreign priority, and address & attorney/agent information from the Patent Application Locating and Monitoring (PALM) System; PDF images of documents (including correspondence) and a transaction history from the Content Management System (CMS) (formerly the Image File Wrapper (IFW) System); and fee information from the Fee Processing Next Generation (FPNG) System. Search is by application number (with or without the two-digit series code), control number, or Patent Cooperation Treaty (PCT) number. Private PAIR requires users to establish a USPTO.gov account and customer number, and establish a password. For more information about establishing a USPTO.gov account and customer number: https://www.uspto.gov/patents-application-process/applying-online/getting-started-new-users Unavailable during database backups (Saturday, Tuesday, and Thursday from 04:30 - 04:45 AM U.S. Eastern Time and Sunday 00:01 - 04:00 AM U.S. Eastern Time. Updated daily. https://ppair-my.uspto.gov/pair/PrivatePair
DORIS-MAE-v1
zenodo.org
data.niaid.nih.gov
bin, json
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianyou Wang; Kaicheng Wang; Xiaoyue Wang; Prudhviraj Naidu; Leon Bergen; Ramamohan Paturi; Jianyou Wang; Kaicheng Wang; Xiaoyue Wang; Prudhviraj Naidu; Leon Bergen; Ramamohan Paturi (2023). DORIS-MAE-v1 [Dataset]. http://doi.org/10.5281/zenodo.8299749
Explore at:
bin, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8299749
Dataset updated
Oct 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jianyou Wang; Kaicheng Wang; Xiaoyue Wang; Prudhviraj Naidu; Leon Bergen; Ramamohan Paturi; Jianyou Wang; Kaicheng Wang; Xiaoyue Wang; Prudhviraj Naidu; Leon Bergen; Ramamohan Paturi
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high costs and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research.

Documentations for the DORIS-MAE dataset is publicly available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset. This upload contains both DORIS-MAE dataset version 1 and ada-002 vector embeddings for all queries and related abstracts (used in candidate pool creation). DORIS-MAE dataset version 1 is comprised of four main sub-datasets, each serving distinct purposes.

The Query dataset contains 100 human-crafted complex queries spanning across five categories: ML, NLP, CV, AI, and Composite. Each category has 20 associated queries. Queries are broken down into aspects (ranging from 3 to 9 per query) and sub-aspects (from 0 to 6 per aspect, with 0 signifying no further breakdown required). For each query, a corresponding candidate pool of relevant paper abstracts, ranging from 99 to 138, is provided.

The Corpus dataset is composed of 363,133 abstracts from computer science papers, published between 2011-2021, and sourced from arXiv. Each entry includes title, original abstract, URL, primary and secondary categories, as well as citation information retrieved from Semantic Scholar. A masked version of each abstract is also provided, facilitating the automated creation of queries.

The Annotation dataset includes generated annotations for all 165,144 question pairs, each comprising an aspect/sub-aspect and a corresponding paper abstract from the query's candidate pool. It includes the original text generated by ChatGPT (version chatgpt-3.5-turbo-0301) explaining its decision-making process, along with a three-level relevance score (e.g., 0,1,2) representing ChatGPT's final decision.

Finally, the Test Set dataset contains human annotations for a random selection of 250 question pairs used in hypothesis testing. It includes each of the three human annotators' final decisions, recorded as a three-level relevance score (e.g., 0,1,2).

The file "ada_embedding_for_DORIS-MAE_v1.pickle" contains text embeddings for the DORIS-MAE dataset, generated by OpenAI's ada-002 model. The structure of the file is as follows:

├── ada_embedding_for_DORIS-MAE_v1.pickle
├── "Query"
│ ├── query_id_1 (Embedding of query_1)
│ ├── query_id_2 (Embedding of query_2)
│ └── query_id_3 (Embedding of query_3)
│ .
│ .
│ .
└── "Corpus"
├── corpus_id_1 (Embedding of abstract_1)
├── corpus_id_2 (Embedding of abstract_2)
└── corpus_id_3 (Embedding of abstract_3)
.
.
.
h
Finsights-Grey-RAG-Effective-Information-Retrieval-logs
huggingface.co
Updated Mar 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rajagopal (2025). Finsights-Grey-RAG-Effective-Information-Retrieval-logs [Dataset]. https://huggingface.co/datasets/rajapower1/Finsights-Grey-RAG-Effective-Information-Retrieval-logs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2025
Authors
rajagopal
Description
rajapower1/Finsights-Grey-RAG-Effective-Information-Retrieval-logs dataset hosted on Hugging Face and contributed by the HF Datasets community
Z
Models and Data for Simple Applications of BERT for Ad Hoc Document...
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang, Wei (2020). Models and Data for Simple Applications of BERT for Ad Hoc Document Retrieval [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3241944
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Lin, Jimmy
Zhang, Haotian
Akkalyoncu Yilmaz, Zeynep
Yang, Wei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This submission includes all pretrained models, test data and prediction files for the arXiv paper "Simple Applications of BERT for Ad Hoc Document Retrieval". Please follow the instructions at the Birch repo to reproduce the results.
r
Computer-Assisted Information Retrieval Service System for Music
rrid.site
dknet.org
+2more
Updated Aug 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Computer-Assisted Information Retrieval Service System for Music [Dataset]. http://identifiers.org/RRID:SCR_008177
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008177
Dataset updated
Aug 25, 2025
Description
CAIRSS is a bibliographic database of older literature (prior to 1993) of music research literature in music education, music psychology, music therapy, and music medicine. Citations have been taken from 1,354 different journal titles; 18 of which are primary journals, meaning that every article ever to appear is included. The primary journals are: * Arts in Psychotherapy * Bulletin of the Council for Research in Music Education * Bulletin of the National Association for Music Therapy * Contributions to Music Education * Hospital Music Newsletter * International Journal of Arts Medicine * Journal of the Association for Music and Imagery * Journal of Music Teacher Education * Journal of Music Therapy * Journal of Research in Music Education * Medical Problems of Performing Artists * Music Perception * Music Therapy * Music Therapy Perspectives * Psychology of Music * Psychomusicology * The Quarterly * Applications of Research to Music Education
W
IR Benchmarks
anthology.aicmu.ac.cn
webis.de
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Potthast; Benno Stein; Matthias Hagen (2023). IR Benchmarks [Dataset]. https://anthology.aicmu.ac.cn/data/ir-benchmarks.html
Explore at:
Dataset updated
2023
Dataset provided by
The Web Technology & Information Systems Network
Leipzig University
Friedrich Schiller University Jena
Bauhaus-Universität Weimar
Authors
Martin Potthast; Benno Stein; Matthias Hagen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of information retrieval benchmarks covering 15 corpora (1.9 billion documents) on which 32 well-known shared tasks are based. We filled the leaderboards with Docker images of 50 standard retrieval approaches. Within this setup, we were able to automatically run and evaluate the 50 approaches on the 32 tasks (1600 runs). All Benchmarks are added as training datasets because their qrels are already publicly available. Please find a detailed tutorial on how to submit approaches on github.
View on TIRA: https://tira.io/task-overview/ir-benchmarks
g
TREC 2022 NeuCLIR Dataset
gimi9.com
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+2more
Updated Sep 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). TREC 2022 NeuCLIR Dataset [Dataset]. https://gimi9.com/dataset/data-gov_2022-neuclir-dataset/
Explore at:
Dataset updated
Sep 13, 2025
Description
Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluation forums for more than twenty years, but recent advances in the application of deep learning to information retrieval (IR) warrant a new, large-scale effort that will enable exploration of classical and modern IR techniques for this task.
company-profiles
kaggle.com
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ojas Srivastava (2023). company-profiles [Dataset]. https://www.kaggle.com/datasets/ojassrivastava18/company-information-information-retrieval-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ojas Srivastava
Description
This dataset contains information about different companies. There are 41 text files, each containing information about different companies. Each file has more than 500 words. This dataset can be used to test information retrieval models and for other NLP-based models.
h
vietnamese-retrieval
huggingface.co
Updated Feb 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thanh Dat Hoang (2024). vietnamese-retrieval [Dataset]. https://huggingface.co/datasets/thanhdath/vietnamese-retrieval
Explore at:
Dataset updated
Feb 8, 2024
Authors
Thanh Dat Hoang
Description
Dataset Card for "vietnamese-retrieval"

More Information needed

SE-PQA: a Resource for Personalized Community Question Answering

zenodo.org

csv, zip

Updated Feb 5, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Kasela Pranav; Kasela Pranav; Pasi Gabriella; Pasi Gabriella; Perego Raffaele; Perego Raffaele; Marco Braga; Marco Braga (2024). SE-PQA: a Resource for Personalized Community Question Answering [Dataset]. http://doi.org/10.5281/zenodo.7940964

Explore at:

csv, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7940964

Dataset updated

Feb 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Kasela Pranav; Kasela Pranav; Pasi Gabriella; Pasi Gabriella; Perego Raffaele; Perego Raffaele; Marco Braga; Marco Braga

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Personalization in Information Retrieval is a topic studied for a long time. Nevertheless, there is still a lack of high-quality, real-world datasets to conduct large-scale experiments and evaluate models for personalized search. This paper contributes to fill this gap by introducing SE-PQA (StackExchange - Personalized Question Answering), a new resource to design and evaluate personalized models related to the two tasks of community Question Answering (cQA). The contributed dataset includes more than 1 million queries and 2 million answers, annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. We describe the characteristics of SE-PQA and detail the features associated with both questions and answers. We also provide reproducible baseline methods for the cQA task based on the resource, including deep learning models and personalization approaches. The results of the preliminary experiments conducted show the appropriateness of SE-PQA to train effective cQA models; they also show that personalization improves remarkably the effectiveness of all the methods tested. Furthermore, we show the benefits in terms of robustness and generalization of combining data from multiple communities for personalization purposes.

Performance on all communities separately:

  <tbody><tr>
    <th>Community</th>
    <th>Model (BM25 +)</th>
    <th>P@1</th>
    <th>NDCG@3</th>
    <th>NDCG@10</th>
    <th>R@100</th>
    <th>MAP@100</th>
    <th>$\lambda$</th>
  </tr>

</tbody><tbody>
  <tr>
    <td>Academia</td>
    <td>MiniLM</td>
    <td>0.438</td>
    <td>0.382</td>
    <td>0.395</td>
    <td>0.489</td>
    <td>0.344</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.453</td>
    <td>0.392</td>
    <td>0.403</td>
    <td>0.489</td>
    <td>0.352</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Anime</td>
    <td>MiniLM + TAG</td>
    <td>0.650</td>
    <td>0.682</td>
    <td>0.714</td>
    <td>0.856</td>
    <td>0.683</td>
    <td>(.1,.9,.0)</td>
  </tr>
  <tr>
    <td>Apple</td>
    <td>MiniLM</td>
    <td>0.327</td>
    <td>0.351</td>
    <td>0.381</td>
    <td>0.514</td>
    <td>0.349</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.335</td>
    <td>0.361</td>
    <td>0.389</td>
    <td>0.514</td>
    <td>0.357</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Bicycles</td>
    <td>MiniLM</td>
    <td>0.405</td>
    <td>0.380</td>
    <td>0.421</td>
    <td>0.600</td>
    <td>0.365</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.436</td>
    <td>0.405</td>
    <td>0.441</td>
    <td>0.600</td>
    <td>0.386</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Boardgames</td>
    <td>MiniLM</td>
    <td>0.681</td>
    <td>0.694</td>
    <td>0.728</td>
    <td>0.866</td>
    <td>0.692</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.696</td>
    <td>0.702</td>
    <td>0.736</td>
    <td>0.866</td>
    <td>0.699</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Buddhism</td>
    <td>MiniLM + TAG</td>
    <td>0.490</td>
    <td>0.387</td>
    <td>0.397</td>
    <td>0.544</td>
    <td>0.334</td>
    <td>(.3,.7,.0)</td>
  </tr>
  <tr>
    <td>Christianity</td>
    <td>MiniLM</td>
    <td>0.534</td>
    <td>0.505</td>
    <td>0.555</td>
    <td>0.783</td>
    <td>0.497</td>
    <td>(.2,.8)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.549</td>
    <td>0.521</td>
    <td>0.564</td>
    <td>0.783</td>
    <td>0.507</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Cooking</td>
    <td>MiniLM</td>
    <td>0.600</td>
    <td>0.567</td>
    <td>0.600</td>
    <td>0.719</td>
    <td>0.553</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.619</td>
    <td>0.583</td>
    <td>0.614</td>
    <td>0.719</td>
    <td>0.568</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>DIY</td>
    <td>MiniLM</td>
    <td>0.323</td>
    <td>0.313</td>
    <td>0.346</td>
    <td>0.501</td>
    <td>0.302</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.335</td>
    <td>0.324</td>
    <td>0.356</td>
    <td>0.501</td>
    <td>0.312</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Expatriates</td>
    <td>MiniLM + TAG</td>
    <td>0.596</td>
    <td>0.653</td>
    <td>0.682</td>
    <td>0.832</td>
    <td>0.645</td>
    <td>(.1,.9,.0)</td>
  </tr>
  <tr>
    <td>Fitness</td>
    <td>MiniLM + TAG</td>
    <td>0.568</td>
    <td>0.575</td>
    <td>0.613</td>
    <td>0.760</td>
    <td>0.567</td>
    <td>(.2,.8,.0)</td>
  </tr>
  <tr>
    <td>Freelancing</td>
    <td>MiniLM + TAG</td>
    <td>0.513</td>
    <td>0.472</td>
    <td>0.506</td>
    <td>0.654</td>
    <td>0.457</td>
    <td>(.1,.9,.0)</td>
  </tr>
  <tr>
    <td>Gaming</td>
    <td>MiniLM</td>
    <td>0.510</td>
    <td>0.534</td>
    <td>0.562</td>
    <td>0.686</td>
    <td>0.532</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.519</td>
    <td>0.547</td>
    <td>0.571</td>
    <td>0.686</td>
    <td>0.541</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Gardening</td>
    <td>MiniLM</td>
    <td>0.344</td>
    <td>0.362</td>
    <td>0.396</td>
    <td>0.520</td>
    <td>0.359</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.345</td>
    <td>0.369</td>
    <td>0.399</td>
    <td>0.520</td>
    <td>0.363</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Genealogy</td>
    <td>MiniLM + TAG</td>
    <td>0.592</td>
    <td>0.605</td>
    <td>0.631</td>
    <td>0.779</td>
    <td>0.594</td>
    <td>(.3,.7,.0)</td>
  </tr>
  <tr>
    <td>Health</td>
    <td>MiniLM + TAG</td>
    <td>0.718</td>
    <td>0.765</td>
    <td>0.797</td>
    <td>0.934</td>
    <td>0.765</td>
    <td>(.2,.8,.0)</td>
  </tr>
  <tr>
    <td>Gaming</td>
    <td>MiniLM</td>
    <td>0.510</td>
    <td>0.534</td>
    <td>0.562</td>
    <td>0.686</td>
    <td>0.532</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.519</td>
    <td>0.547</td>
    <td>0.571</td>
    <td>0.686</td>
    <td>0.541</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Hermeneutics</td>
    <td>MiniLM</td>
    <td>0.589</td>
    <td>0.538</td>
    <td>0.593</td>
    <td>0.828</td>
    <td>0.526</td>
    <td>(.2,.8)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.632</td>
    <td>0.570</td>
    <td>0.617</td>
    <td>0.828</td>
    <td>0.552</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Hinduism</td>
    <td>MiniLM</td>
    <td>0.388</td>
    <td>0.415</td>
    <td>0.459</td>
    <td>0.686</td>
    <td>0.416</td>
    <td>(.2,.8)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.382</td>
    <td>0.410</td>
    <td>0.457</td>
    <td>0.686</td>
    <td>0.412</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>History</td>
    <td>MiniLM + TAG</td>
    <td>0.740</td>
    <td>0.735</td>
    <td>0.764</td>
    <td>0.862</td>
    <td>0.730</td>
    <td>(.2,.8,.0)</td>
  </tr>
  <tr>
    <td>Hsm</td>
    <td>MiniLM + TAG</td>
    <td>0.666</td>
    <td>0.707</td>
    <td>0.737</td>
    <td>0.870</td>
    <td>0.690</td>
    <td>(.2,.8,.0)</td>
  </tr>
  <tr>
    <td>Interpersonal</td>
    <td>MiniLM + TAG</td>
    <td>0.663</td>
    <td>0.617</td>
    <td>0.653</td>
    <td>0.739</td>
    <td>0.604</td>
    <td>(.2,.8,.0)</td>
  </tr>
  <tr>
    <td>Islam</td>
    <td>MiniLM</td>
    <td>0.382</td>
    <td>0.412</td>
    <td>0.453</td>
    <td>0.642</td>
    <td>0.410</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.395</td>
    <td>0.427</td>
    <td>0.464</td>
    <td>0.642</td>
    <td>0.421</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Judaism</td>
    <td>MiniLM + TAG</td>
    <td>0.363</td>
    <td>0.387</td>
    <td>0.432</td>
    <td>0.649</td>
    <td>0.388</td>
    <td>(.2,.8,.0)</td>
  </tr>
  <tr>
    <td>Law</td>
    <td>MiniLM</td>
    <td>0.663</td>
    <td>0.647</td>
    <td>0.678</td>
    <td>0.803</td>
    <td>0.639</td>
    <td>(.2,.8)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.677</td>
    <td>0.657</td>
    <td>0.687</td>
    <td>0.803</td>
    <td>0.649</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Lifehacks</td>
    <td>MiniLM</td>
    <td>0.714</td>
    <td>0.601</td>
    <td>0.617</td>
    <td>0.703</td>
    <td>0.553</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.714</td>
    <td>0.621</td>
    <td>0.631</td>
    <td>0.703</td>
    <td>0.568</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Linguistics</td>
    <td>MiniLM + TAG</td>
    <td>0.584</td>
    <td>0.588</td>
    <td>0.630</td>
    <td>0.794</td>
    <td>0.587</td>
    <td>(.2,.8,.0)</td>
  </tr>
  <tr>
    <td>Literature</td>
    <td>MiniLM + TAG</td>
    <td>0.871</td>
    <td>0.878</td>
    <td>0.889</td>
    <td>0.934</td>
    <td>0.876</td>
    <td>(.3,.7,.0)</td>
  </tr>
  <tr>
    <td>Martialarts</td>
    <td>MiniLM</td>
    <td>0.630</td>
    <td>0.599</td>
    <td>0.645</td>
    <td>0.796</td>
    <td>0.596</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.640</td>
    <td>0.628</td>
    <td>0.660</td>
    <td>0.796</td>
    <td>0.612</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Money</td>
    <td>MiniLM</td>
    <td>0.545</td>
    <td>0.535</td>
    <td>0.563</td>
    <td>0.706</td>
    <td>0.515</td>
    <td>(.2,.8)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.559</td>
    <td>0.542</td>
    <td>0.571</td>
    <td>0.706</td>
    <td>0.523</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Movies</td>
    <td>MiniLM</td>
    <td>0.713</td>
    <td>0.722</td>
    <td>0.753</td>
    <td>0.865</td>
    <td>0.724</td>
    <td>(.1,.9)</td>
  </tr>
  <tr>
    <td> </td>
    <td>MiniLM + TAG</td>
    <td>0.728</td>
    <td>0.735</td>
    <td>0.762</td>
    <td>0.865</td>
    <td>0.735</td>
    <td>(.1,.8,.1)</td>
  </tr>
  <tr>
    <td>Music</td>
    <td>MiniLM</td>
    <td>0.508</td>
    <td>0.447</td>
    <td>0.476</td>
    <td>0.602</td>
    <td>0.418</td>
    <td>(.2,.8)</td>
  </tr>
  <tr>
    <td>

p
Data from: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge...
physionet.org
Updated Jan 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantin Kotschenreuther (2024). EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems [Dataset]. http://doi.org/10.13026/25fx-f706
Explore at:
Unique identifier
https://doi.org/10.13026/25fx-f706
Dataset updated
Jan 11, 2024
Authors
Konstantin Kotschenreuther
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.
STI BM25 Sequence Dataset
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tingzhen Liu; Qianqian Xiong; Shengxi Zhang (2023). STI BM25 Sequence Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21321198.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21321198.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Tingzhen Liu; Qianqian Xiong; Shengxi Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Based on Baidu STI dataset, we conducted a comparative study on the cost performance of classical computational linguistics methods and large language models. This dataset discloses the relevant data of the study, including the original corpus and the BM25 sequence we calculated.
D
Document Management and Retrieval System Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Document Management and Retrieval System Report [Dataset]. https://www.marketreportanalytics.com/reports/document-management-and-retrieval-system-55257
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 3, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Document Management and Retrieval System (DMRS) market is experiencing robust growth, driven by the increasing need for efficient information management across diverse sectors. The rising volume of digital documents, coupled with stringent regulatory compliance requirements and the growing adoption of cloud-based solutions, are key factors fueling market expansion. Academic institutions, corporations, and the public sector are increasingly relying on DMRS to streamline workflows, enhance collaboration, and ensure data security. The market is segmented by application (Academic, Corporate, Public Sector) and type (Cloud-based, On-premises), with cloud-based solutions gaining significant traction due to their scalability, accessibility, and cost-effectiveness. Key players like Clarivate, Elsevier, and Digital Science are driving innovation through continuous product development and strategic partnerships. While the on-premises segment retains a presence, the shift towards cloud-based solutions is anticipated to continue, driven by the benefits of remote access and reduced infrastructure costs. Regional variations exist, with North America and Europe currently holding significant market shares, although Asia-Pacific is projected to witness substantial growth in the coming years, fueled by increasing digitalization and technological advancements. The competitive landscape is characterized by both established players and emerging companies offering specialized solutions. This leads to a dynamic market with a focus on continuous improvement and innovation. The forecast period (2025-2033) anticipates sustained growth, propelled by technological advancements like AI-powered search and retrieval capabilities, improved integration with other business applications, and the increasing demand for robust security features. The market is expected to consolidate somewhat, with larger players potentially acquiring smaller firms to expand their product portfolios and market reach. Despite the strong growth outlook, challenges remain, including data security concerns, integration complexities, and the need for user-friendly interfaces. Addressing these concerns through continuous innovation and user-centric design will be crucial for sustained market success. The market is expected to witness a gradual shift towards more sophisticated and integrated DMRS solutions, catering to the evolving needs of diverse user groups.
AILA 2019 Precedent & Statute Retrieval Task
zenodo.org
zip
Updated Oct 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paheli Bhattacharya; Kripabandhu Ghosh; Saptarshi Ghosh; Arindam Pal; Parth Mehta; Arnab Bhattacharya; Prasenjit Majumder; Paheli Bhattacharya; Kripabandhu Ghosh; Saptarshi Ghosh; Arindam Pal; Parth Mehta; Arnab Bhattacharya; Prasenjit Majumder (2020). AILA 2019 Precedent & Statute Retrieval Task [Dataset]. http://doi.org/10.5281/zenodo.4063986
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4063986
Dataset updated
Oct 3, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Paheli Bhattacharya; Kripabandhu Ghosh; Saptarshi Ghosh; Arindam Pal; Parth Mehta; Arnab Bhattacharya; Prasenjit Majumder; Paheli Bhattacharya; Kripabandhu Ghosh; Saptarshi Ghosh; Arindam Pal; Parth Mehta; Arnab Bhattacharya; Prasenjit Majumder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of the AILA (Artificial Intelligence for Legal Assistance) Track at FIRE 2019

Track website : https://sites.google.com/view/fire-2019-aila/
Conference website : http://fire.irsi.res.in/fire/2019/home

In countries following the Common Law system (e.g., UK, USA, Canada, Australia, India), there are two primary sources of law – Statutes (established laws) and Precedents (prior cases). Statutes deal with applying legal principles to a situation (facts / scenario / circumstances which lead to filing the case). Precedents or prior cases help a lawyer understand how the Court has dealt with similar scenarios in the past, and prepare the legal reasoning accordingly.

When a lawyer is presented with a situation (that will potentially lead to filing of a case), it will be very beneficial to him/her if there is an automatic system that identifies a set of related prior cases involving similar situations as well as statutes/acts that can be most suited to the purpose in the given situation. Such a system shall not only help a lawyer but also benefit a common man, in a way of getting a preliminary understanding, even before he/she approaches a lawyer. It shall assist him/her in identifying where his/her legal problem fits, what legal actions he/she can proceed with (through statutes) and what were the outcomes of similar cases (through precedents).

Motivated by the above scenario, we propose two tasks here :

Task 1 : Identifying relevant prior cases for a given situation

Task 2 : Identifying most relevant statutes for a given situation

Task Description:

You will be given a set of 50 queries, each of which describes a situation.

Task 1: Identifying relevant prior cases

We provide ~3000 case documents of cases that were judged in the Supreme Court of India. For each query, the task is to retrieve the most similar / relevant case document with respect to the situation in the given query.

Task 2: Identifying relevant statutes

We have identified a set of 197 statutes (Sections of Acts) from Indian law, that are relevant to some of the queries. We provide the title and description of these statutes. For each query, the task is to identify the most relevant statutes (from among the 197 statutes). Note that, the task can be modelled either as an unsupervised retrieval task (where you search for relevant statues) or as a supervised classification task (e.g., trying to predict for each statute whether it is relevant). For the latter, case documents provided for Task 1 can be utilised. However, if a team wishes to apply supervised models, then it is their responsibility to create the necessary training data.
f
Data underlying the master thesis: Exploring Copula-Based Models for the...
figshare.com
data.4tu.nl
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimitris Theodorakopoulos (2023). Data underlying the master thesis: Exploring Copula-Based Models for the Stochastic Simulation of Information Retrieval Evaluation Data [Dataset]. http://doi.org/10.4121/21739355.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4121/21739355.v1
Dataset updated
Jun 1, 2023
Dataset provided by
4TU.ResearchData
Authors
Dimitris Theodorakopoulos
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the results of the experiments that I ran for my master thesis. The full code (and more) can be found at https://github.com/dimitris93/msc-thesis
RELISH-Aspire
figshare.com
json
Updated Mar 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheshera Mysore (2022). RELISH-Aspire [Dataset]. http://doi.org/10.6084/m9.figshare.19425506.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19425506.v1
Dataset updated
Mar 26, 2022
Dataset provided by
figshare
Authors
Sheshera Mysore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a copy of the RELISH dataset used in the paper "Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity" by Sheshera Mysore, Arman Cohan, Tom Hope. The RELISH dataset was first introduced in Brown et al. 2019. See further details of the paper, how this dataset was compiled, and how it was used: https://github.com/allenai/aspireThe contents of the dataset are as follows: abstracts-relish.jsonl: jsonl file containing the paper-id, abstracts, and titles for the queries and candidates which are part of the dataset.

relish-queries-release.csv: Metadata associated with every query.test-pid2anns-relish.json: JSON file with the query paper-id, candidate paper-ids for every query paper in the dataset. Use these files in conjunction with abstracts-relish.jsonl to generate files for use in model evaluation. relish-evaluation_splits.json: Paper-ids for the splits to use in reporting evaluation numbers. aspire/src/evaluation/ranking_eval.py included in the github repo accompanying this dataset implements the evaluation protocol and computes evaluation metrics. Please see the paper for descriptions of the experimental protocol we recommend to report evaluation metrics.
h
perspective-information-retrieval-perspectrum
huggingface.co
Updated Oct 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fengyu Cai (2024). perspective-information-retrieval-perspectrum [Dataset]. https://huggingface.co/datasets/trumancai/perspective-information-retrieval-perspectrum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 14, 2024
Authors
Fengyu Cai
Description
trumancai/perspective-information-retrieval-perspectrum dataset hosted on Hugging Face and contributed by the HF Datasets community
f
Tuning of the information retrieval parameters for the Question-Answering...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Apr 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lovis, Christian; Teodoro, Douglas; Harbarth, Stephan; Huttner, Angela; Gobeill, Julien; Ruch, Patrick; Wipfli, Rolf; Pasche, Emilie (2013). Tuning of the information retrieval parameters for the Question-Answering task. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001740411
Explore at:
Dataset updated
Apr 30, 2013
Authors
Lovis, Christian; Teodoro, Douglas; Harbarth, Stephan; Huttner, Angela; Gobeill, Julien; Ruch, Patrick; Wipfli, Rolf; Pasche, Emilie
Description
Tuning of the information retrieval parameters for the Question-Answering task.
s
Citation Trends for "Protecting Data Privacy in Private Information...
shibatadb.com
Updated Jun 15, 2000
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2000). Citation Trends for "Protecting Data Privacy in Private Information Retrieval Schemes" [Dataset]. https://www.shibatadb.com/article/gmVF6nhT
Explore at:
Dataset updated
Jun 15, 2000
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2001 - 2025
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "Protecting Data Privacy in Private Information Retrieval Schemes".
Patterns of Scholarly Communication in Global Information Retrieval...
figshare.com
7z
Updated Oct 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shakil Ahmad (2022). Patterns of Scholarly Communication in Global Information Retrieval Research: A bibliometric analysis (1954-2021) [Dataset]. http://doi.org/10.6084/m9.figshare.21312366.v1
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21312366.v1
Dataset updated
Oct 11, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Shakil Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Patterns of Scholarly Communication in Global Information Retrieval Research: A bibliometric analysis (1954-2021)

Facebook

Twitter

Click to copy link

Link copied

Cite

Patents (2022). PRIVATE Patent Application Information Retrieval (PAIR) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/private-patent-application-information-retrieval-pair

PRIVATE Patent Application Information Retrieval (PAIR)

Explore at:

Dataset updated

Jul 15, 2022

Dataset provided by

Patents

Description

Offers exclusive access to patent application status information for unpublished patent applications only to the applicant/inventor or his/her representative(s). Private PAIR includes bibliographic, patent term adjustments, continuity data, foreign priority, and address & attorney/agent information from the Patent Application Locating and Monitoring (PALM) System; PDF images of documents (including correspondence) and a transaction history from the Content Management System (CMS) (formerly the Image File Wrapper (IFW) System); and fee information from the Fee Processing Next Generation (FPNG) System. Search is by application number (with or without the two-digit series code), control number, or Patent Cooperation Treaty (PCT) number. Private PAIR requires users to establish a USPTO.gov account and customer number, and establish a password. For more information about establishing a USPTO.gov account and customer number: https://www.uspto.gov/patents-application-process/applying-online/getting-started-new-users Unavailable during database backups (Saturday, Tuesday, and Thursday from 04:30 - 04:45 AM U.S. Eastern Time and Sunday 00:01 - 04:00 AM U.S. Eastern Time. Updated daily. https://ppair-my.uspto.gov/pair/PrivatePair

Clear search

Close search

Google apps

Main menu

PRIVATE Patent Application Information Retrieval (PAIR)

DORIS-MAE-v1

Finsights-Grey-RAG-Effective-Information-Retrieval-logs

Models and Data for Simple Applications of BERT for Ad Hoc Document...

Computer-Assisted Information Retrieval Service System for Music

IR Benchmarks

TREC 2022 NeuCLIR Dataset

company-profiles

vietnamese-retrieval

SE-PQA: a Resource for Personalized Community Question Answering

Data from: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge...

STI BM25 Sequence Dataset

Document Management and Retrieval System Report

AILA 2019 Precedent & Statute Retrieval Task

Data underlying the master thesis: Exploring Copula-Based Models for the...

RELISH-Aspire

perspective-information-retrieval-perspectrum

Tuning of the information retrieval parameters for the Question-Answering...

Citation Trends for "Protecting Data Privacy in Private Information...

Patterns of Scholarly Communication in Global Information Retrieval...

PRIVATE Patent Application Information Retrieval (PAIR)See More Versions

PRIVATE Patent Application Information Retrieval (PAIR)