3 datasets found
  1. P

    DUC 2007 Dataset

    • paperswithcode.com
    Updated Jul 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). DUC 2007 Dataset [Dataset]. https://paperswithcode.com/dataset/duc-2007
    Explore at:
    Dataset updated
    Jul 30, 2021
    Description

    There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

    Within TIDES and among other researchers interested in document understanding, a group grew up which has been focusing on summarization and the evaluation of summarization systems. Part of the initial evaluation for TIDES called for a workshop to be held in the fall of 2000 to explore different ways of summarizing a common set of documents. Additionally a road mapping effort was started in March of 2000 to lay plans for a long-term evaluation effort in summarization.

    Out of the initial workshop and the roadmapping effort has grown a continuing evaluation in the area of text summarization called the Document Understanding Conferences (DUC). Sponsored by the Advanced Research and Development Activity (ARDA), the conference series is run by the National Institute of Standards and Technology (NIST) to further progress in summarization and enable researchers to participate in large-scale experiments.

  2. i

    Sentence embeddings for document sets in DUC 2002 summarization task

    • ieee-dataport.org
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiram Calvo (2025). Sentence embeddings for document sets in DUC 2002 summarization task [Dataset]. https://ieee-dataport.org/documents/sentence-embeddings-document-sets-duc-2002-summarization-task
    Explore at:
    Dataset updated
    Jun 17, 2025
    Authors
    Hiram Calvo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    D U C 2 0 0 2 dataset (https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html) processed through doc2vec (https://github.com/jhlau/doc2vec) This dataset includes the documents embeddings of the full DUC 2002 in the following configurations:Sentence embeddingsDocument embeddingsDocument Set embeddingsIt also includes the results of the research presented in "Central embeddings for extractive summarization based on similarity"In order to obtain the original DUC 2002 dataset please consult the official site.

  3. P

    OpenAsp Dataset

    • paperswithcode.com
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shmuel Amar; Liat Schiff; Ori Ernst; Asi Shefer; Ori Shapira; Ido Dagan (2023). OpenAsp Dataset [Dataset]. https://paperswithcode.com/dataset/openasp
    Explore at:
    Dataset updated
    Dec 6, 2023
    Authors
    Shmuel Amar; Liat Schiff; Ori Ernst; Asi Shefer; Ori Shapira; Ido Dagan
    Description

    OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.

    Dataset Access To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.

    Steps:

    Grant access to DUC dataset by following NIST instructions here. you should receive two user-password pairs (for DUC01-02 and DUC06-07) you should receive a file named fwdrequestingducdata.zip Clone this repository by running the following command: git clone https://github.com/liatschiff/OpenAsp.git Optionally create a conda or virtualenv environment:

    bash conda create -n openasp 'python>3.10,<3.11' conda activate openasp

    Install python requirements, currently requires python3.8-3.10 (later python versions have issues with spacy)

    bash pip install -r requirements.txt

    copy fwdrequestingducdata.zip into the OpenAsp repo directory

    run the prepare script command:

    bash python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'

    load the dataset using huggingface datasets

    from glob import glob
    import os
    import gzip
    import shutil
    from datasets import load_dataset
    
    openasp_files = os.path.join('openasp-v1', '*.jsonl.gz')
    
    data_files = {
      os.path.basename(fname).split('.')[0]: fname
      for fname in glob(openasp_files)
    }
    
    for ftype, fname in data_files.copy().items():
      with gzip.open(fname, 'rb') as gz_file:
        with open(fname[:-3], 'wb') as output_file:
          shutil.copyfileobj(gz_file, output_file)
      data_files[ftype] = fname[:-3]
    
    load OpenAsp as huggingface's dataset
    openasp = load_dataset('json', data_files=data_files)
    
    print first sample from every split
    for split in ['train', 'valid', 'test']:
      sample = openasp[split][0]
    
    # print title, aspect_label, summary and documents for the sample
    title = sample['title']
    aspect_label = sample['aspect_label']
    summary = '
    '.join(sample['summary_text'])
    input_docs_text = ['
    '.join(d['text']) for d in sample['documents']]
    
    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
    print(f'Sample from {split}
    Split title={title}
    Aspect label={aspect_label}')
    print(f'
    aspect-based summary:
     {summary}')
    print('
    input documents:
    ')
    for i, doc_txt in enumerate(input_docs_text):
      print(f'---- doc #{i} ----')
      print(doc_txt[:256] + '...')
    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
    
    
    ')
    
    

    Troubleshooting

    Dataset failed loading with load_dataset() - you may want to delete huggingface datasets cache folder 401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc) Dataset created but prints a warning about content verification - you may be using different version of NLTK or spacy model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt. IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.

    Under The Hood The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.

    License This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.

    OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2021). DUC 2007 Dataset [Dataset]. https://paperswithcode.com/dataset/duc-2007

DUC 2007 Dataset

Document Understanding Conferences

Explore at:
Dataset updated
Jul 30, 2021
Description

There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

Within TIDES and among other researchers interested in document understanding, a group grew up which has been focusing on summarization and the evaluation of summarization systems. Part of the initial evaluation for TIDES called for a workshop to be held in the fall of 2000 to explore different ways of summarizing a common set of documents. Additionally a road mapping effort was started in March of 2000 to lay plans for a long-term evaluation effort in summarization.

Out of the initial workshop and the roadmapping effort has grown a continuing evaluation in the area of text summarization called the Document Understanding Conferences (DUC). Sponsored by the Advanced Research and Development Activity (ARDA), the conference series is run by the National Institute of Standards and Technology (NIST) to further progress in summarization and enable researchers to participate in large-scale experiments.

Search
Clear search
Close search
Google apps
Main menu