The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment.
RAFT is a few-shot classification benchmark that tests language models:
across multiple domains (lit reviews, medical data, tweets, customer interaction, etc.) on economically valuable classification tasks (someone inherently cares about the task) with evaluation that mirrors deployment (50 labeled examples per task, info retrieval allowed, hidden test set)
Description from: https://raft.elicit.org/
The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. RAFT is a few-shot classification benchmark that tests language models: across multiple domains (lit reviews, medical data, tweets, customer interaction, etc.) on economically valuable classification tasks (someone inherently cares about the task) with evaluation that mirrors deployment (50 labeled examples per task, info retrieval allowed, hidden test set) Description from: https://raft.elicit.org/
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment.
RAFT is a few-shot classification benchmark that tests language models:
across multiple domains (lit reviews, medical data, tweets, customer interaction, etc.) on economically valuable classification tasks (someone inherently cares about the task) with evaluation that mirrors deployment (50 labeled examples per task, info retrieval allowed, hidden test set)
Description from: https://raft.elicit.org/