3 datasets found

R
Table Extraction Pdf Dataset
universe.roboflow.com
zip
Updated Nov 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Traore (2022). Table Extraction Pdf Dataset [Dataset]. https://universe.roboflow.com/mohamed-traore-2ekkp/table-extraction-pdf/model/6
Explore at:
zipAvailable download formats
Dataset updated
Nov 4, 2022
Dataset authored and provided by
Mohamed Traore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Data Table Bounding Boxes
Description
The dataset comes from Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure - creators of CascadeTabNet.

Depending on the dataset version downloaded, the images will include annotations for 'borderless' tables, 'bordered' tables', and 'cells'. Borderless tables are those in which every cell in the table does not have a border. Bordered tables are those in which every cell in the table has a border, and the table is bordered. Cells are the individual data points within the table.

A subset of the full dataset, the ICDAR Table Cells Dataset, was extracted and imported to Roboflow to create this hosted version of the Cascade TabNet project. All the additional dataset components used in the full project are available here: All Files.

Versions:

Version 1, raw-images : 342 raw images of tables. No augmentations, preprocessing step of auto-orient was all that was added.

Version 2, tableBordersOnly-rawImages : 342 raw images of tables. This dataset version contains the same images as version 1, but with the caveat of Modify Classes being applied to omit the 'cell' class from all images (rendering these images to be apt for creating a model to detect 'borderless' tables and 'bordered' tables.

For the versions below: Preprocessing step of Resize (416by416 Fit within-white edges) was added along with more augmentations to increase the size of the training set and to make our images more uniform. Preprocessing applies to all images whereas augmentations only apply to training set images. 3. Version 3, augmented-FAST-model : 818 raw images of tables. Trained from Scratch (no transfer learning) with the "Fast" model from Roboflow Train. 3X augmentation (generated images). 4. Version 4, augmented-ACCURATE-model : 818 raw images of tables. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation. 5. Version 5, tableBordersOnly-augmented-FAST-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Fast" model from Roboflow Train. 3X augmentation. 6. Version 6, tableBordersOnly-augmented-ACCURATE-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.

Example Image from the Datasethttps://i.imgur.com/ruizSQN.png" alt="Example Image from the Dataset">

Cascade TabNet in Actionhttps://i.imgur.com/nyn98Ue.png" alt="Cascade TabNet in Action"> CascadeTabNet is an automatic table recognition method for interpretation of tabular data in document images. We present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. CascadeTabNet is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset.

From the Original Authors:

If you find this work useful for your research, please cite our paper: @misc{ cascadetabnet2020, title={CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents}, author={Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure}, year={2020}, eprint={2004.12629}, archivePrefix={arXiv}, primaryClass={cs.CV} }
Paper Parts Fsod Rmrg Dataset
universe.roboflow.com
zip
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow 20-VL FSOD (2025). Paper Parts Fsod Rmrg Dataset [Dataset]. https://universe.roboflow.com/rf20-vl-fsod/paper-parts-fsod-rmrg/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 4, 2025
Dataset provided by
Roboflow
Authors
Roboflow 20-VL FSOD
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Paper Parts Fsod Rmrg Rmrg Bounding Boxes
Description
Overview

Introduction

Object Classes

Author

Chapter

Equation

Equation Number

Figure

Figure Caption

Footnote

List of Content Heading

List of Content Text

Page Number

Paragraph

Reference Text

Section

Subsection

Subsubsection

Table

Table Caption

Table of Contents Text

Title

Introduction

This dataset is designed to annotate the structural elements of academic papers. It aims to train models to recognize different parts of a paper. Each class corresponds to a text or graphical element commonly found in papers.

Author: The name(s) of the person(s) who wrote the document.

Chapter: The major divisions within the paper, usually denoted by a number and a title.

Equation: Mathematical formulas or expressions.

Equation Number: The numeral identifiers for equations.

Figure: Visual representations like graphs or charts.

Figure Caption: Text descriptions associated with figures.

Footnote: Additional information at the bottom of the page.

List of Content Heading: The titles of content sections in a list.

List of Content Text: Descriptions or details within a list of content.

Page Number: The numeral indicating the page's position.

Paragraph: Blocks of text conveying an idea or point.

Reference Text: Citations or bibliographic information.

Section: Main headings within a chapter.

Subsection: Subheadings under a section.

Subsubsection: Further subdivisions under a subsection.

Table: Data or information arranged in rows and columns.

Table Caption: Text descriptions associated with tables.

Table of Contents Text: Entries listing sections and page numbers.

Title: The main heading or name of the paper.

Object Classes

Author

Description

Text indicating the name(s) of the author(s), typically found near the beginning of a document.

Instructions

Identify the text block containing the author names. It usually follows the title and may include affiliations. Do not include titles, affiliations or titles of sections adjacent to author names.

Chapter

Description

Indicates a major division of the document, often labeled with a number and title.

Instructions

Locate text labeled with "Chapter" followed by a number and title. Capture the entire heading, ensuring no unrelated text is included.

Equation

Description

Symbols and numbers arranged to represent a mathematical concept.

Instructions

Draw boxes around all mathematical expressions, excluding any accompanying text or numbers identifying the equations.

Equation Number

Description

Numerals used to uniquely identify equations.

Instructions

Identify numbers in parentheses next to equations. Do not include equation text or variables.

Figure

Description

Visual content such as graphs, diagrams, code or images.

Instructions

Outline the entire graphical representation. Do not include captions or any surrounding text.

Figure Caption

Description

Text providing a description or explanation above or below a figure.

Instructions

Identify the text directly associated with a figure. Ensure no unrelated figures or text are included.

Footnote

Description

Clarifications or additional details located at the bottom of a page.

Instructions

Locate text at the page's bottom that refers back to a mark or reference in the main text. Exclude any unrelated content.

List of Content Heading

Description

Headings at the list of context text, identifying its purpose or content. This may also be called a list of figures.

Instructions

Identify and label only the heading for lists in content sections. Do not include subsequent list items.

List of Content Text

Description

The detailed entries or points in a list. These often summarize all figures in the paper.

Instructions

Identify each item in a content list. Exclude list headings and any non-list content.

Page Number

Description

Numerical indication of the current page.

Instructions

Locate numbers typically positioned at the top or bottom margins. Do not include text or symbols beside the numbers.

Paragraph

Description

Blocks of text separated by spacing or indentation.

Instructions

Enclose individual text blocks that form coherent sections. Ensure each paragraph is distinguished separately.

Reference Text

Description

Bibliographic information found typically in a reference sect
R
Paper Parts Fsod Okht Dataset
universe.roboflow.com
zip
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
roboflow20temp (2025). Paper Parts Fsod Okht Dataset [Dataset]. https://universe.roboflow.com/roboflow20temp/paper-parts-fsod-okht
Explore at:
zipAvailable download formats
Dataset updated
Feb 25, 2025
Dataset authored and provided by
roboflow20temp
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Paper Parts Fsod Okht Okht Bounding Boxes
Description
Overview

Introduction

Object Classes

Author

Chapter

Equation

Equation Number

Figure

Figure Caption

Footnote

List of Content Heading

List of Content Text

Page Number

Paragraph

Reference Text

Section

Subsection

Subsubsection

Table

Table Caption

Table of Contents Text

Title

Introduction

This dataset is designed to annotate the structural elements of academic papers. It aims to train models to recognize different parts of a paper. Each class corresponds to a text or graphical element commonly found in papers.

Author: The name(s) of the person(s) who wrote the document.

Chapter: The major divisions within the paper, usually denoted by a number and a title.

Equation: Mathematical formulas or expressions.

Equation Number: The numeral identifiers for equations.

Figure: Visual representations like graphs or charts.

Figure Caption: Text descriptions associated with figures.

Footnote: Additional information at the bottom of the page.

List of Content Heading: The titles of content sections in a list.

List of Content Text: Descriptions or details within a list of content.

Page Number: The numeral indicating the page's position.

Paragraph: Blocks of text conveying an idea or point.

Reference Text: Citations or bibliographic information.

Section: Main headings within a chapter.

Subsection: Subheadings under a section.

Subsubsection: Further subdivisions under a subsection.

Table: Data or information arranged in rows and columns.

Table Caption: Text descriptions associated with tables.

Table of Contents Text: Entries listing sections and page numbers.

Title: The main heading or name of the paper.

Object Classes

Author

Description

Text indicating the name(s) of the author(s), typically found near the beginning of a document.

Instructions

Identify the text block containing the author names. It usually follows the title and may include affiliations. Do not include titles or titles of sections adjacent to author names.

Chapter

Description

Indicates a major division of the document, often labeled with a number and title.

Instructions

Locate text labeled with "Chapter" followed by a number and title. Capture the entire heading, ensuring no unrelated text is included.

Equation

Description

Symbols and numbers arranged to represent a mathematical concept.

Instructions

Draw boxes around all mathematical expressions, excluding any accompanying text or numbers identifying the equations.

Equation Number

Description

Numerals used to uniquely identify equations.

Instructions

Identify numbers in parentheses next to equations. Do not include equation text or variables.

Figure

Description

Visual content such as graphs, diagrams, or images.

Instructions

Outline the entire graphical representation. Do not include captions or any surrounding text.

Figure Caption

Description

Text providing a description or explanation of a figure.

Instructions

Identify the text directly associated with a figure below it. Ensure no unrelated figures or text are included.

Footnote

Description

Clarifications or additional details located at the bottom of a page.

Instructions

Locate text at the page's bottom that refers back to a mark or reference in the main text. Exclude any unrelated content.

List of Content Heading

Description

Headings at the start of a list, identifying its purpose or content.

Instructions

Identify and label only the heading for lists in content sections. Do not include subsequent list items.

List of Content Text

Description

The detailed entries or points in a list.

Instructions

Identify each item in a content list. Exclude list headings and any non-list content.

Page Number

Description

Numerical indication of the current page.

Instructions

Locate numbers typically positioned at the top or bottom margins. Do not include text or symbols beside the numbers.

Paragraph

Description

Blocks of text separated by spacing or indentation.

Instructions

Enclose individual text blocks that form coherent sections. Ensure each paragraph is distinguished separately.

Reference Text

Description

Bibliographic information found typically in a reference section.

Instructions

Identify the full reference entries. Ensure each citation is clearly distinguished without over
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mohamed Traore (2022). Table Extraction Pdf Dataset [Dataset]. https://universe.roboflow.com/mohamed-traore-2ekkp/table-extraction-pdf/model/6

Table Extraction Pdf Dataset

table-extraction-pdf

table-extraction-pdf-dataset

Explore at:

zipAvailable download formats

Dataset updated

Nov 4, 2022

Dataset authored and provided by

Mohamed Traore

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured

Data Table Bounding Boxes

Description

The dataset comes from Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure - creators of CascadeTabNet.

Depending on the dataset version downloaded, the images will include annotations for 'borderless' tables, 'bordered' tables', and 'cells'. Borderless tables are those in which every cell in the table does not have a border. Bordered tables are those in which every cell in the table has a border, and the table is bordered. Cells are the individual data points within the table.

A subset of the full dataset, the ICDAR Table Cells Dataset, was extracted and imported to Roboflow to create this hosted version of the Cascade TabNet project. All the additional dataset components used in the full project are available here: All Files.

Versions:

Version 1, raw-images : 342 raw images of tables. No augmentations, preprocessing step of auto-orient was all that was added.
Version 2, tableBordersOnly-rawImages : 342 raw images of tables. This dataset version contains the same images as version 1, but with the caveat of Modify Classes being applied to omit the 'cell' class from all images (rendering these images to be apt for creating a model to detect 'borderless' tables and 'bordered' tables.

For the versions below: Preprocessing step of Resize (416by416 Fit within-white edges) was added along with more augmentations to increase the size of the training set and to make our images more uniform. Preprocessing applies to all images whereas augmentations only apply to training set images. 3. Version 3, augmented-FAST-model : 818 raw images of tables. Trained from Scratch (no transfer learning) with the "Fast" model from Roboflow Train. 3X augmentation (generated images). 4. Version 4, augmented-ACCURATE-model : 818 raw images of tables. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation. 5. Version 5, tableBordersOnly-augmented-FAST-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Fast" model from Roboflow Train. 3X augmentation. 6. Version 6, tableBordersOnly-augmented-ACCURATE-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.

Example Image from the Datasethttps://i.imgur.com/ruizSQN.png" alt="Example Image from the Dataset">

Cascade TabNet in Actionhttps://i.imgur.com/nyn98Ue.png" alt="Cascade TabNet in Action"> CascadeTabNet is an automatic table recognition method for interpretation of tabular data in document images. We present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. CascadeTabNet is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset.

From the Original Authors:

If you find this work useful for your research, please cite our paper: @misc{ cascadetabnet2020, title={CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents}, author={Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure}, year={2020}, eprint={2004.12629}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Clear search

Close search

Google apps

Main menu

Table Extraction Pdf Dataset

Versions:

From the Original Authors:

Paper Parts Fsod Rmrg Dataset

Overview

Introduction

Object Classes

Author

Description

Instructions

Chapter

Description

Instructions

Equation

Description

Instructions

Equation Number

Description

Instructions

Figure

Description

Instructions

Figure Caption

Description

Instructions

Footnote

Description

Instructions

List of Content Heading

Description

Instructions

List of Content Text

Description

Instructions

Page Number

Description

Instructions

Paragraph

Description

Instructions

Reference Text

Description

Paper Parts Fsod Okht Dataset

Overview

Introduction

Object Classes

Author

Description

Instructions

Chapter

Description

Instructions

Equation

Description

Instructions

Equation Number

Description

Instructions

Figure

Description

Instructions

Figure Caption

Description

Instructions

Footnote

Description

Instructions

List of Content Heading

Description

Instructions

List of Content Text

Description

Instructions

Page Number

Description

Instructions

Paragraph

Description

Instructions

Reference Text