2 datasets found
  1. g

    Complex Document Information Processing (CDIP) dataset | gimi9.com

    • gimi9.com
    Updated Jan 1, 1996
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (1996). Complex Document Information Processing (CDIP) dataset | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_complex-document-information-processing-cdip-dataset/
    Explore at:
    Dataset updated
    Jan 1, 1996
    Description

    This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.

  2. Complex Document Information Processing (CDIP) dataset

    • data.nist.gov
    • catalog.data.gov
    Updated Feb 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Soboroff (2022). Complex Document Information Processing (CDIP) dataset [Dataset]. http://doi.org/10.18434/mds2-2531
    Explore at:
    Dataset updated
    Feb 4, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Ian Soboroff
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(1996). Complex Document Information Processing (CDIP) dataset | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_complex-document-information-processing-cdip-dataset/

Complex Document Information Processing (CDIP) dataset | gimi9.com

Explore at:
Dataset updated
Jan 1, 1996
Description

This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.

Search
Clear search
Close search
Google apps
Main menu