3 datasets found

P
DUC 2007 Dataset
paperswithcode.com
Updated Jul 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). DUC 2007 Dataset [Dataset]. https://paperswithcode.com/dataset/duc-2007
Explore at:
Dataset updated
Jul 30, 2021
Description
There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

Within TIDES and among other researchers interested in document understanding, a group grew up which has been focusing on summarization and the evaluation of summarization systems. Part of the initial evaluation for TIDES called for a workshop to be held in the fall of 2000 to explore different ways of summarizing a common set of documents. Additionally a road mapping effort was started in March of 2000 to lay plans for a long-term evaluation effort in summarization.

Out of the initial workshop and the roadmapping effort has grown a continuing evaluation in the area of text summarization called the Document Understanding Conferences (DUC). Sponsored by the Advanced Research and Development Activity (ARDA), the conference series is run by the National Institute of Standards and Technology (NIST) to further progress in summarization and enable researchers to participate in large-scale experiments.
i
Sentence embeddings for document sets in DUC 2002 summarization task
ieee-dataport.org
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hiram Calvo (2025). Sentence embeddings for document sets in DUC 2002 summarization task [Dataset]. https://ieee-dataport.org/documents/sentence-embeddings-document-sets-duc-2002-summarization-task
Explore at:
Dataset updated
Jun 17, 2025
Authors
Hiram Calvo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
D U C 2 0 0 2 dataset (https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html) processed through doc2vec (https://github.com/jhlau/doc2vec) This dataset includes the documents embeddings of the full DUC 2002 in the following configurations:Sentence embeddingsDocument embeddingsDocument Set embeddingsIt also includes the results of the research presented in "Central embeddings for extractive summarization based on similarity"In order to obtain the original DUC 2002 dataset please consult the official site.
P
OpenAsp Dataset
paperswithcode.com
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shmuel Amar; Liat Schiff; Ori Ernst; Asi Shefer; Ori Shapira; Ido Dagan (2023). OpenAsp Dataset [Dataset]. https://paperswithcode.com/dataset/openasp
Explore at:
Dataset updated
Dec 6, 2023
Authors
Shmuel Amar; Liat Schiff; Ori Ernst; Asi Shefer; Ori Shapira; Ido Dagan
Description
OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.

Dataset Access To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.

Steps:

Grant access to DUC dataset by following NIST instructions here. you should receive two user-password pairs (for DUC01-02 and DUC06-07) you should receive a file named fwdrequestingducdata.zip Clone this repository by running the following command: git clone https://github.com/liatschiff/OpenAsp.git Optionally create a conda or virtualenv environment:

bash conda create -n openasp 'python>3.10,<3.11' conda activate openasp

Install python requirements, currently requires python3.8-3.10 (later python versions have issues with spacy)

bash pip install -r requirements.txt

copy fwdrequestingducdata.zip into the OpenAsp repo directory

run the prepare script command:

bash python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'

load the dataset using huggingface datasets

from glob import glob import os import gzip import shutil from datasets import load_dataset openasp_files = os.path.join('openasp-v1', '*.jsonl.gz') data_files = { os.path.basename(fname).split('.')[0]: fname for fname in glob(openasp_files) } for ftype, fname in data_files.copy().items(): with gzip.open(fname, 'rb') as gz_file: with open(fname[:-3], 'wb') as output_file: shutil.copyfileobj(gz_file, output_file) data_files[ftype] = fname[:-3] load OpenAsp as huggingface's dataset openasp = load_dataset('json', data_files=data_files) print first sample from every split for split in ['train', 'valid', 'test']: sample = openasp[split][0] # print title, aspect_label, summary and documents for the sample title = sample['title'] aspect_label = sample['aspect_label'] summary = ' '.join(sample['summary_text']) input_docs_text = [' '.join(d['text']) for d in sample['documents']] print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *') print(f'Sample from {split} Split title={title} Aspect label={aspect_label}') print(f' aspect-based summary: {summary}') print(' input documents: ') for i, doc_txt in enumerate(input_docs_text): print(f'---- doc #{i} ----') print(doc_txt[:256] + '...') print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ')

Troubleshooting

Dataset failed loading with load_dataset() - you may want to delete huggingface datasets cache folder 401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc) Dataset created but prints a warning about content verification - you may be using different version of NLTK or spacy model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt. IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.

Under The Hood The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.

License This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.

OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2021). DUC 2007 Dataset [Dataset]. https://paperswithcode.com/dataset/duc-2007

DUC 2007 Dataset

Document Understanding Conferences

Explore at:

Dataset updated

Jul 30, 2021

Description

There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

Within TIDES and among other researchers interested in document understanding, a group grew up which has been focusing on summarization and the evaluation of summarization systems. Part of the initial evaluation for TIDES called for a workshop to be held in the fall of 2000 to explore different ways of summarizing a common set of documents. Additionally a road mapping effort was started in March of 2000 to lay plans for a long-term evaluation effort in summarization.

Out of the initial workshop and the roadmapping effort has grown a continuing evaluation in the area of text summarization called the Document Understanding Conferences (DUC). Sponsored by the Advanced Research and Development Activity (ARDA), the conference series is run by the National Institute of Standards and Technology (NIST) to further progress in summarization and enable researchers to participate in large-scale experiments.

Clear search

Close search

Google apps

Main menu

DUC 2007 Dataset

Sentence embeddings for document sets in DUC 2002 summarization task

OpenAsp Dataset

DUC 2007 DatasetSee More Versions

Document Understanding Conferences

DUC 2007 Dataset