Offers exclusive access to patent application status information for unpublished patent applications only to the applicant/inventor or his/her representative(s). Private PAIR includes bibliographic, patent term adjustments, continuity data, foreign priority, and address & attorney/agent information from the Patent Application Locating and Monitoring (PALM) System; PDF images of documents (including correspondence) and a transaction history from the Content Management System (CMS) (formerly the Image File Wrapper (IFW) System); and fee information from the Fee Processing Next Generation (FPNG) System. Search is by application number (with or without the two-digit series code), control number, or Patent Cooperation Treaty (PCT) number. Private PAIR requires users to establish a USPTO.gov account and customer number, and establish a password. For more information about establishing a USPTO.gov account and customer number: https://www.uspto.gov/patents-application-process/applying-online/getting-started-new-users Unavailable during database backups (Saturday, Tuesday, and Thursday from 04:30 - 04:45 AM U.S. Eastern Time and Sunday 00:01 - 04:00 AM U.S. Eastern Time. Updated daily. https://ppair-my.uspto.gov/pair/PrivatePair
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high costs and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research.
Documentations for the DORIS-MAE dataset is publicly available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset. This upload contains both DORIS-MAE dataset version 1 and ada-002 vector embeddings for all queries and related abstracts (used in candidate pool creation). DORIS-MAE dataset version 1 is comprised of four main sub-datasets, each serving distinct purposes.
The Query dataset contains 100 human-crafted complex queries spanning across five categories: ML, NLP, CV, AI, and Composite. Each category has 20 associated queries. Queries are broken down into aspects (ranging from 3 to 9 per query) and sub-aspects (from 0 to 6 per aspect, with 0 signifying no further breakdown required). For each query, a corresponding candidate pool of relevant paper abstracts, ranging from 99 to 138, is provided.
The Corpus dataset is composed of 363,133 abstracts from computer science papers, published between 2011-2021, and sourced from arXiv. Each entry includes title, original abstract, URL, primary and secondary categories, as well as citation information retrieved from Semantic Scholar. A masked version of each abstract is also provided, facilitating the automated creation of queries.
The Annotation dataset includes generated annotations for all 165,144 question pairs, each comprising an aspect/sub-aspect and a corresponding paper abstract from the query's candidate pool. It includes the original text generated by ChatGPT (version chatgpt-3.5-turbo-0301) explaining its decision-making process, along with a three-level relevance score (e.g., 0,1,2) representing ChatGPT's final decision.
Finally, the Test Set dataset contains human annotations for a random selection of 250 question pairs used in hypothesis testing. It includes each of the three human annotators' final decisions, recorded as a three-level relevance score (e.g., 0,1,2).
The file "ada_embedding_for_DORIS-MAE_v1.pickle" contains text embeddings for the DORIS-MAE dataset, generated by OpenAI's ada-002 model. The structure of the file is as follows:
├── ada_embedding_for_DORIS-MAE_v1.pickle
├── "Query"
│ ├── query_id_1 (Embedding of query_1)
│ ├── query_id_2 (Embedding of query_2)
│ └── query_id_3 (Embedding of query_3)
│ .
│ .
│ .
└── "Corpus"
├── corpus_id_1 (Embedding of abstract_1)
├── corpus_id_2 (Embedding of abstract_2)
└── corpus_id_3 (Embedding of abstract_3)
.
.
.
rajapower1/Finsights-Grey-RAG-Effective-Information-Retrieval-logs dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This submission includes all pretrained models, test data and prediction files for the arXiv paper "Simple Applications of BERT for Ad Hoc Document Retrieval". Please follow the instructions at the Birch repo to reproduce the results.
CAIRSS is a bibliographic database of older literature (prior to 1993) of music research literature in music education, music psychology, music therapy, and music medicine. Citations have been taken from 1,354 different journal titles; 18 of which are primary journals, meaning that every article ever to appear is included. The primary journals are: * Arts in Psychotherapy * Bulletin of the Council for Research in Music Education * Bulletin of the National Association for Music Therapy * Contributions to Music Education * Hospital Music Newsletter * International Journal of Arts Medicine * Journal of the Association for Music and Imagery * Journal of Music Teacher Education * Journal of Music Therapy * Journal of Research in Music Education * Medical Problems of Performing Artists * Music Perception * Music Therapy * Music Therapy Perspectives * Psychology of Music * Psychomusicology * The Quarterly * Applications of Research to Music Education
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of information retrieval benchmarks covering 15 corpora (1.9 billion documents) on which 32 well-known shared tasks are based. We filled the leaderboards with Docker images of 50 standard retrieval approaches. Within this setup, we were able to automatically run and evaluate the 50 approaches on the 32 tasks (1600 runs). All Benchmarks are added as training datasets because their qrels are already publicly available. Please find a detailed tutorial on how to submit approaches on github.
View on TIRA: https://tira.io/task-overview/ir-benchmarks
Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluation forums for more than twenty years, but recent advances in the application of deep learning to information retrieval (IR) warrant a new, large-scale effort that will enable exploration of classical and modern IR techniques for this task.
This dataset contains information about different companies. There are 41 text files, each containing information about different companies. Each file has more than 500 words. This dataset can be used to test information retrieval models and for other NLP-based models.
Dataset Card for "vietnamese-retrieval"
More Information needed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Personalization in Information Retrieval is a topic studied for a long time. Nevertheless, there is still a lack of high-quality, real-world datasets to conduct large-scale experiments and evaluate models for personalized search. This paper contributes to fill this gap by introducing SE-PQA (StackExchange - Personalized Question Answering), a new resource to design and evaluate personalized models related to the two tasks of community Question Answering (cQA). The contributed dataset includes more than 1 million queries and 2 million answers, annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. We describe the characteristics of SE-PQA and detail the features associated with both questions and answers. We also provide reproducible baseline methods for the cQA task based on the resource, including deep learning models and personalization approaches. The results of the preliminary experiments conducted show the appropriateness of SE-PQA to train effective cQA models; they also show that personalization improves remarkably the effectiveness of all the methods tested. Furthermore, we show the benefits in terms of robustness and generalization of combining data from multiple communities for personalization purposes.
Performance on all communities separately:
<tbody><tr>
<th>Community</th>
<th>Model (BM25 +)</th>
<th>P@1</th>
<th>NDCG@3</th>
<th>NDCG@10</th>
<th>R@100</th>
<th>MAP@100</th>
<th>$\lambda$</th>
</tr>
</tbody><tbody>
<tr>
<td>Academia</td>
<td>MiniLM</td>
<td>0.438</td>
<td>0.382</td>
<td>0.395</td>
<td>0.489</td>
<td>0.344</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.453</td>
<td>0.392</td>
<td>0.403</td>
<td>0.489</td>
<td>0.352</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Anime</td>
<td>MiniLM + TAG</td>
<td>0.650</td>
<td>0.682</td>
<td>0.714</td>
<td>0.856</td>
<td>0.683</td>
<td>(.1,.9,.0)</td>
</tr>
<tr>
<td>Apple</td>
<td>MiniLM</td>
<td>0.327</td>
<td>0.351</td>
<td>0.381</td>
<td>0.514</td>
<td>0.349</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.335</td>
<td>0.361</td>
<td>0.389</td>
<td>0.514</td>
<td>0.357</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Bicycles</td>
<td>MiniLM</td>
<td>0.405</td>
<td>0.380</td>
<td>0.421</td>
<td>0.600</td>
<td>0.365</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.436</td>
<td>0.405</td>
<td>0.441</td>
<td>0.600</td>
<td>0.386</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Boardgames</td>
<td>MiniLM</td>
<td>0.681</td>
<td>0.694</td>
<td>0.728</td>
<td>0.866</td>
<td>0.692</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.696</td>
<td>0.702</td>
<td>0.736</td>
<td>0.866</td>
<td>0.699</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Buddhism</td>
<td>MiniLM + TAG</td>
<td>0.490</td>
<td>0.387</td>
<td>0.397</td>
<td>0.544</td>
<td>0.334</td>
<td>(.3,.7,.0)</td>
</tr>
<tr>
<td>Christianity</td>
<td>MiniLM</td>
<td>0.534</td>
<td>0.505</td>
<td>0.555</td>
<td>0.783</td>
<td>0.497</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.549</td>
<td>0.521</td>
<td>0.564</td>
<td>0.783</td>
<td>0.507</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Cooking</td>
<td>MiniLM</td>
<td>0.600</td>
<td>0.567</td>
<td>0.600</td>
<td>0.719</td>
<td>0.553</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.619</td>
<td>0.583</td>
<td>0.614</td>
<td>0.719</td>
<td>0.568</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>DIY</td>
<td>MiniLM</td>
<td>0.323</td>
<td>0.313</td>
<td>0.346</td>
<td>0.501</td>
<td>0.302</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.335</td>
<td>0.324</td>
<td>0.356</td>
<td>0.501</td>
<td>0.312</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Expatriates</td>
<td>MiniLM + TAG</td>
<td>0.596</td>
<td>0.653</td>
<td>0.682</td>
<td>0.832</td>
<td>0.645</td>
<td>(.1,.9,.0)</td>
</tr>
<tr>
<td>Fitness</td>
<td>MiniLM + TAG</td>
<td>0.568</td>
<td>0.575</td>
<td>0.613</td>
<td>0.760</td>
<td>0.567</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Freelancing</td>
<td>MiniLM + TAG</td>
<td>0.513</td>
<td>0.472</td>
<td>0.506</td>
<td>0.654</td>
<td>0.457</td>
<td>(.1,.9,.0)</td>
</tr>
<tr>
<td>Gaming</td>
<td>MiniLM</td>
<td>0.510</td>
<td>0.534</td>
<td>0.562</td>
<td>0.686</td>
<td>0.532</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.519</td>
<td>0.547</td>
<td>0.571</td>
<td>0.686</td>
<td>0.541</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Gardening</td>
<td>MiniLM</td>
<td>0.344</td>
<td>0.362</td>
<td>0.396</td>
<td>0.520</td>
<td>0.359</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.345</td>
<td>0.369</td>
<td>0.399</td>
<td>0.520</td>
<td>0.363</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Genealogy</td>
<td>MiniLM + TAG</td>
<td>0.592</td>
<td>0.605</td>
<td>0.631</td>
<td>0.779</td>
<td>0.594</td>
<td>(.3,.7,.0)</td>
</tr>
<tr>
<td>Health</td>
<td>MiniLM + TAG</td>
<td>0.718</td>
<td>0.765</td>
<td>0.797</td>
<td>0.934</td>
<td>0.765</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Gaming</td>
<td>MiniLM</td>
<td>0.510</td>
<td>0.534</td>
<td>0.562</td>
<td>0.686</td>
<td>0.532</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.519</td>
<td>0.547</td>
<td>0.571</td>
<td>0.686</td>
<td>0.541</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Hermeneutics</td>
<td>MiniLM</td>
<td>0.589</td>
<td>0.538</td>
<td>0.593</td>
<td>0.828</td>
<td>0.526</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.632</td>
<td>0.570</td>
<td>0.617</td>
<td>0.828</td>
<td>0.552</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Hinduism</td>
<td>MiniLM</td>
<td>0.388</td>
<td>0.415</td>
<td>0.459</td>
<td>0.686</td>
<td>0.416</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.382</td>
<td>0.410</td>
<td>0.457</td>
<td>0.686</td>
<td>0.412</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>History</td>
<td>MiniLM + TAG</td>
<td>0.740</td>
<td>0.735</td>
<td>0.764</td>
<td>0.862</td>
<td>0.730</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Hsm</td>
<td>MiniLM + TAG</td>
<td>0.666</td>
<td>0.707</td>
<td>0.737</td>
<td>0.870</td>
<td>0.690</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Interpersonal</td>
<td>MiniLM + TAG</td>
<td>0.663</td>
<td>0.617</td>
<td>0.653</td>
<td>0.739</td>
<td>0.604</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Islam</td>
<td>MiniLM</td>
<td>0.382</td>
<td>0.412</td>
<td>0.453</td>
<td>0.642</td>
<td>0.410</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.395</td>
<td>0.427</td>
<td>0.464</td>
<td>0.642</td>
<td>0.421</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Judaism</td>
<td>MiniLM + TAG</td>
<td>0.363</td>
<td>0.387</td>
<td>0.432</td>
<td>0.649</td>
<td>0.388</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Law</td>
<td>MiniLM</td>
<td>0.663</td>
<td>0.647</td>
<td>0.678</td>
<td>0.803</td>
<td>0.639</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.677</td>
<td>0.657</td>
<td>0.687</td>
<td>0.803</td>
<td>0.649</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Lifehacks</td>
<td>MiniLM</td>
<td>0.714</td>
<td>0.601</td>
<td>0.617</td>
<td>0.703</td>
<td>0.553</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.714</td>
<td>0.621</td>
<td>0.631</td>
<td>0.703</td>
<td>0.568</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Linguistics</td>
<td>MiniLM + TAG</td>
<td>0.584</td>
<td>0.588</td>
<td>0.630</td>
<td>0.794</td>
<td>0.587</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Literature</td>
<td>MiniLM + TAG</td>
<td>0.871</td>
<td>0.878</td>
<td>0.889</td>
<td>0.934</td>
<td>0.876</td>
<td>(.3,.7,.0)</td>
</tr>
<tr>
<td>Martialarts</td>
<td>MiniLM</td>
<td>0.630</td>
<td>0.599</td>
<td>0.645</td>
<td>0.796</td>
<td>0.596</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.640</td>
<td>0.628</td>
<td>0.660</td>
<td>0.796</td>
<td>0.612</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Money</td>
<td>MiniLM</td>
<td>0.545</td>
<td>0.535</td>
<td>0.563</td>
<td>0.706</td>
<td>0.515</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.559</td>
<td>0.542</td>
<td>0.571</td>
<td>0.706</td>
<td>0.523</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Movies</td>
<td>MiniLM</td>
<td>0.713</td>
<td>0.722</td>
<td>0.753</td>
<td>0.865</td>
<td>0.724</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.728</td>
<td>0.735</td>
<td>0.762</td>
<td>0.865</td>
<td>0.735</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Music</td>
<td>MiniLM</td>
<td>0.508</td>
<td>0.447</td>
<td>0.476</td>
<td>0.602</td>
<td>0.418</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td>
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Based on Baidu STI dataset, we conducted a comparative study on the cost performance of classical computational linguistics methods and large language models. This dataset discloses the relevant data of the study, including the original corpus and the BM25 sequence we calculated.
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The global Document Management and Retrieval System (DMRS) market is experiencing robust growth, driven by the increasing need for efficient information management across diverse sectors. The rising volume of digital documents, coupled with stringent regulatory compliance requirements and the growing adoption of cloud-based solutions, are key factors fueling market expansion. Academic institutions, corporations, and the public sector are increasingly relying on DMRS to streamline workflows, enhance collaboration, and ensure data security. The market is segmented by application (Academic, Corporate, Public Sector) and type (Cloud-based, On-premises), with cloud-based solutions gaining significant traction due to their scalability, accessibility, and cost-effectiveness. Key players like Clarivate, Elsevier, and Digital Science are driving innovation through continuous product development and strategic partnerships. While the on-premises segment retains a presence, the shift towards cloud-based solutions is anticipated to continue, driven by the benefits of remote access and reduced infrastructure costs. Regional variations exist, with North America and Europe currently holding significant market shares, although Asia-Pacific is projected to witness substantial growth in the coming years, fueled by increasing digitalization and technological advancements. The competitive landscape is characterized by both established players and emerging companies offering specialized solutions. This leads to a dynamic market with a focus on continuous improvement and innovation. The forecast period (2025-2033) anticipates sustained growth, propelled by technological advancements like AI-powered search and retrieval capabilities, improved integration with other business applications, and the increasing demand for robust security features. The market is expected to consolidate somewhat, with larger players potentially acquiring smaller firms to expand their product portfolios and market reach. Despite the strong growth outlook, challenges remain, including data security concerns, integration complexities, and the need for user-friendly interfaces. Addressing these concerns through continuous innovation and user-centric design will be crucial for sustained market success. The market is expected to witness a gradual shift towards more sophisticated and integrated DMRS solutions, catering to the evolving needs of diverse user groups.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of the AILA (Artificial Intelligence for Legal Assistance) Track at FIRE 2019
Track website : https://sites.google.com/view/fire-2019-aila/
Conference website : http://fire.irsi.res.in/fire/2019/home
In countries following the Common Law system (e.g., UK, USA, Canada, Australia, India), there are two primary sources of law – Statutes (established laws) and Precedents (prior cases). Statutes deal with applying legal principles to a situation (facts / scenario / circumstances which lead to filing the case). Precedents or prior cases help a lawyer understand how the Court has dealt with similar scenarios in the past, and prepare the legal reasoning accordingly.
When a lawyer is presented with a situation (that will potentially lead to filing of a case), it will be very beneficial to him/her if there is an automatic system that identifies a set of related prior cases involving similar situations as well as statutes/acts that can be most suited to the purpose in the given situation. Such a system shall not only help a lawyer but also benefit a common man, in a way of getting a preliminary understanding, even before he/she approaches a lawyer. It shall assist him/her in identifying where his/her legal problem fits, what legal actions he/she can proceed with (through statutes) and what were the outcomes of similar cases (through precedents).
Motivated by the above scenario, we propose two tasks here :
Task Description:
You will be given a set of 50 queries, each of which describes a situation.
Task 1: Identifying relevant prior cases
We provide ~3000 case documents of cases that were judged in the Supreme Court of India. For each query, the task is to retrieve the most similar / relevant case document with respect to the situation in the given query.
Task 2: Identifying relevant statutes
We have identified a set of 197 statutes (Sections of Acts) from Indian law, that are relevant to some of the queries. We provide the title and description of these statutes. For each query, the task is to identify the most relevant statutes (from among the 197 statutes). Note that, the task can be modelled either as an unsupervised retrieval task (where you search for relevant statues) or as a supervised classification task (e.g., trying to predict for each statute whether it is relevant). For the latter, case documents provided for Task 1 can be utilised. However, if a team wishes to apply supervised models, then it is their responsibility to create the necessary training data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the results of the experiments that I ran for my master thesis. The full code (and more) can be found at https://github.com/dimitris93/msc-thesis
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a copy of the RELISH dataset used in the paper "Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity" by Sheshera Mysore, Arman Cohan, Tom Hope. The RELISH dataset was first introduced in Brown et al. 2019. See further details of the paper, how this dataset was compiled, and how it was used: https://github.com/allenai/aspireThe contents of the dataset are as follows: abstracts-relish.jsonl: jsonl file containing the paper-id, abstracts, and titles for the queries and candidates which are part of the dataset.
relish-queries-release.csv: Metadata associated with every query.test-pid2anns-relish.json: JSON file with the query paper-id, candidate paper-ids for every query paper in the dataset. Use these files in conjunction with abstracts-relish.jsonl to generate files for use in model evaluation. relish-evaluation_splits.json: Paper-ids for the splits to use in reporting evaluation numbers. aspire/src/evaluation/ranking_eval.py included in the github repo accompanying this dataset implements the evaluation protocol and computes evaluation metrics. Please see the paper for descriptions of the experimental protocol we recommend to report evaluation metrics.
trumancai/perspective-information-retrieval-perspectrum dataset hosted on Hugging Face and contributed by the HF Datasets community
Tuning of the information retrieval parameters for the Question-Answering task.
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "Protecting Data Privacy in Private Information Retrieval Schemes".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Patterns of Scholarly Communication in Global Information Retrieval Research: A bibliometric analysis (1954-2021)
Offers exclusive access to patent application status information for unpublished patent applications only to the applicant/inventor or his/her representative(s). Private PAIR includes bibliographic, patent term adjustments, continuity data, foreign priority, and address & attorney/agent information from the Patent Application Locating and Monitoring (PALM) System; PDF images of documents (including correspondence) and a transaction history from the Content Management System (CMS) (formerly the Image File Wrapper (IFW) System); and fee information from the Fee Processing Next Generation (FPNG) System. Search is by application number (with or without the two-digit series code), control number, or Patent Cooperation Treaty (PCT) number. Private PAIR requires users to establish a USPTO.gov account and customer number, and establish a password. For more information about establishing a USPTO.gov account and customer number: https://www.uspto.gov/patents-application-process/applying-online/getting-started-new-users Unavailable during database backups (Saturday, Tuesday, and Thursday from 04:30 - 04:45 AM U.S. Eastern Time and Sunday 00:01 - 04:00 AM U.S. Eastern Time. Updated daily. https://ppair-my.uspto.gov/pair/PrivatePair