2 datasets found

g
Complex Document Information Processing (CDIP) dataset | gimi9.com
gimi9.com
Updated Jan 1, 1996
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(1996). Complex Document Information Processing (CDIP) dataset | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_complex-document-information-processing-cdip-dataset/
Explore at:
Dataset updated
Jan 1, 1996
Description
This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.
Complex Document Information Processing (CDIP) dataset
data.nist.gov
catalog.data.gov
Updated Feb 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Soboroff (2022). Complex Document Information Processing (CDIP) dataset [Dataset]. http://doi.org/10.18434/mds2-2531
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2531, https://identifiers.org/ark:/88434/mds2-2531
Dataset updated
Feb 4, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Ian Soboroff
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(1996). Complex Document Information Processing (CDIP) dataset | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_complex-document-information-processing-cdip-dataset/

Complex Document Information Processing (CDIP) dataset | gimi9.com

Explore at:

Dataset updated

Jan 1, 1996

Description

This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.

Clear search

Close search

Google apps

Main menu