Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for BEIR Benchmark
Dataset Summary
BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:
Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweetโฆ See the full description on the dataset page: https://huggingface.co/datasets/BeIR/scifact.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
SciFact An MTEB dataset Massive Text Embedding Benchmark
SciFact verifies scientific claims using evidence from the research literature containing scientific paper abstracts.
Task category t2t
Domains Academic, Medical, Written
Reference https://github.com/allenai/scifact
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_tasks(["SciFact"]) evaluator = mteb.MTEB(task)โฆ See the full description on the dataset page: https://huggingface.co/datasets/mteb/scifact.
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
{_DESCRIPTION_BASE} This config connects the claims to the evidence and doc ids.
pa-shk/scifact dataset hosted on Hugging Face and contributed by the HF Datasets community
Data Description
Homepage: https://github.com/KID-22/Cocktail Repository: https://github.com/KID-22/Cocktail Paper: [Needs More Information]
Dataset Summary
All the 16 benchmarked datasets in Cocktail are listed in the following table.
Dataset Raw Website Cocktail Website Cocktail-Name md5 for Processed Data Domain Relevancy
MS MARCO Homepage Homepage msmarco 985926f3e906fadf0dc6249f23ed850f Misc. Binary 6,979 542,203
DL19โฆ See the full description on the dataset page: https://huggingface.co/datasets/IR-Cocktail/scifact.
scifact.pisa
Description
A PISA index for the SciFact dataset
Usage
import pyterrier as pt index = pt.Artifact.from_hf('pyterrier/scifact.pisa') index.bm25() # returns a BM25 retriever
Benchmarks
name nDCG@10 R@1000
bm25 0.6776 0.9733
dph 0.6735 0.97
Reproduction
import pyterrier as pt from tqdm import tqdm import ir_datasets from pyterrier_pisa import PisaIndex index = PisaIndex("scifact.pisa"โฆ See the full description on the dataset page: https://huggingface.co/datasets/pyterrier/scifact.pisa.
gurnoor-ctx/dummy-scifact dataset hosted on Hugging Face and contributed by the HF Datasets community
Data Stats
206 claims 500k distractors
Data Structure
Test
claim evidence: GT evidence evidence_id: GT evidence id label: GT label evidences: list of all evidences evidence_ids: list of all evidence ids labels: list of all labels
Distractors
evidence evidence_id
Process Code
import pandas as pd
from datasets import Dataset
claims = pd.read_csv("./scifact_open_retriever_test.csv") claims.head()
docs =โฆ See the full description on the dataset page: https://huggingface.co/datasets/umbc-scify/scifact-open.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_tasks(["Scifact-VN"]) evaluator = mteb.MTEB(task)
model = mteb.get_model(YOUR_MODEL) evaluator.run(model)
To learn more about how to run models on mteb task check out the GitHub repitory.
Citation
If you use this dataset, please cite the dataset as well as mteb, as this dataset likely includes additional processing as a part ofโฆ See the full description on the dataset page: https://huggingface.co/datasets/GreenNode/scifact-vn.
MCINext/scifact-fa-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
trmteb/scifact-tr dataset hosted on Hugging Face and contributed by the HF Datasets community
franciellevargas/SciFact dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
kaengreg/rus-scifact dataset hosted on Hugging Face and contributed by the HF Datasets community
jasper-xian/splade-scifact-train-retrievals dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for beir/scifact
The beir/scifact dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.
Data
This dataset provides:
docs (documents, i.e., the corpus); count=5,183 queries (i.e., topics); count=1,109
This dataset is used by: beir_scifact_test, beir_scifact_train
Usage
from datasets import load_dataset
docs = load_dataset('irds/beir_scifact', 'docs') for record in docs: record #โฆ See the full description on the dataset page: https://huggingface.co/datasets/irds/beir_scifact.
nthakur/gpl-scifact dataset hosted on Hugging Face and contributed by the HF Datasets community
scifact.splade-v3.cache
Description
TODO: What is the artifact?
Usage
import pyterrier_alpha as pta artifact = pta.Artifact.from_hf('pyterrier/scifact.splade-v3.cache')
Benchmarks
TODO: Provide benchmarks for the artifact.
Reproduction
Metadata
{ "type": "indexer_cache", "format": "lz4pickle", "package_hint":โฆ See the full description on the dataset page: https://huggingface.co/datasets/pyterrier/scifact.splade-v3.cache.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for BEIR-NL Benchmark
Dataset Summary
BEIR-NL is a Dutch-translated version of the BEIR benchmark, a diverse and heterogeneous collection of datasets covering various domains from biomedical and financial texts to general web content. Our benchmark is integrated into the Massive Multilingual Text Embedding Benchmark (MMTEB). BEIR-NL contains the following tasks:
Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018โฆ See the full description on the dataset page: https://huggingface.co/datasets/clips/beir-nl-scifact.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NFCorpus: 20 generated queries (BEIR Benchmark)
This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset.
DocT5query model used: BeIR/query-gen-msmarco-t5-base-v1 id (str): unique document id in NFCorpus in the BEIR benchmark (corpus.jsonl). Questions generated: 20 Code used for generation: evaluate_anserini_docT5query_parallel.py
Below contains the old dataset card for the BEIR benchmark.
Dataset Card for BEIRโฆ See the full description on the dataset page: https://huggingface.co/datasets/income/scifact-top-20-gen-queries.
vaibhavad/sheared-llama-scifact-results-new dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for BEIR Benchmark
Dataset Summary
BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:
Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweetโฆ See the full description on the dataset page: https://huggingface.co/datasets/BeIR/scifact.