There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.
Within TIDES and among other researchers interested in document understanding, a group grew up which has been focusing on summarization and the evaluation of summarization systems. Part of the initial evaluation for TIDES called for a workshop to be held in the fall of 2000 to explore different ways of summarizing a common set of documents. Additionally a road mapping effort was started in March of 2000 to lay plans for a long-term evaluation effort in summarization.
Out of the initial workshop and the roadmapping effort has grown a continuing evaluation in the area of text summarization called the Document Understanding Conferences (DUC). Sponsored by the Advanced Research and Development Activity (ARDA), the conference series is run by the National Institute of Standards and Technology (NIST) to further progress in summarization and enable researchers to participate in large-scale experiments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
D U C 2 0 0 2 dataset (https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html) processed through doc2vec (https://github.com/jhlau/doc2vec) This dataset includes the documents embeddings of the full DUC 2002 in the following configurations:Sentence embeddingsDocument embeddingsDocument Set embeddingsIt also includes the results of the research presented in "Central embeddings for extractive summarization based on similarity"In order to obtain the original DUC 2002 dataset please consult the official site.
OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.
Dataset Access To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.
Steps:
Grant access to DUC dataset by following NIST instructions here. you should receive two user-password pairs (for DUC01-02 and DUC06-07) you should receive a file named fwdrequestingducdata.zip Clone this repository by running the following command: git clone https://github.com/liatschiff/OpenAsp.git Optionally create a conda or virtualenv environment:
bash conda create -n openasp 'python>3.10,<3.11' conda activate openasp
Install python requirements, currently requires python3.8-3.10 (later python versions have issues with spacy)
bash pip install -r requirements.txt
copy fwdrequestingducdata.zip into the OpenAsp repo directory
run the prepare script command:
bash python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'
load the dataset using huggingface datasets
from glob import glob
import os
import gzip
import shutil
from datasets import load_dataset
openasp_files = os.path.join('openasp-v1', '*.jsonl.gz')
data_files = {
os.path.basename(fname).split('.')[0]: fname
for fname in glob(openasp_files)
}
for ftype, fname in data_files.copy().items():
with gzip.open(fname, 'rb') as gz_file:
with open(fname[:-3], 'wb') as output_file:
shutil.copyfileobj(gz_file, output_file)
data_files[ftype] = fname[:-3]
load OpenAsp as huggingface's dataset
openasp = load_dataset('json', data_files=data_files)
print first sample from every split
for split in ['train', 'valid', 'test']:
sample = openasp[split][0]
# print title, aspect_label, summary and documents for the sample
title = sample['title']
aspect_label = sample['aspect_label']
summary = '
'.join(sample['summary_text'])
input_docs_text = ['
'.join(d['text']) for d in sample['documents']]
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
print(f'Sample from {split}
Split title={title}
Aspect label={aspect_label}')
print(f'
aspect-based summary:
{summary}')
print('
input documents:
')
for i, doc_txt in enumerate(input_docs_text):
print(f'---- doc #{i} ----')
print(doc_txt[:256] + '...')
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
')
Troubleshooting
Dataset failed loading with load_dataset() - you may want to delete huggingface datasets cache folder 401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc) Dataset created but prints a warning about content verification - you may be using different version of NLTK or spacy model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt. IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.
Under The Hood The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.
License This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.
OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.
Within TIDES and among other researchers interested in document understanding, a group grew up which has been focusing on summarization and the evaluation of summarization systems. Part of the initial evaluation for TIDES called for a workshop to be held in the fall of 2000 to explore different ways of summarizing a common set of documents. Additionally a road mapping effort was started in March of 2000 to lay plans for a long-term evaluation effort in summarization.
Out of the initial workshop and the roadmapping effort has grown a continuing evaluation in the area of text summarization called the Document Understanding Conferences (DUC). Sponsored by the Advanced Research and Development Activity (ARDA), the conference series is run by the National Institute of Standards and Technology (NIST) to further progress in summarization and enable researchers to participate in large-scale experiments.