Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
IAM Handwriting Dataset is a collection of handwritten passages by several writers. Generally, they use that data to classify writers according to their writing styles. A traditional way of solving such problem is extracting features like spacing between letters, curvatures, etc. and feeding them into Support Vector Machines. But, I wanted to solve this problem by Deep learning using Keras and Tensorflow. For the purpose, we don't need the full IAM Handwriting Dataset, but some authentic subset which can be used for training such as a subset of images by top 50 persons who contributed the most towards the dataset.
This dataset contains images of each handwritten sentence with the dash-separated filename format. The first field represents the test code, second the writer id, third passage id, and fourth the sentence id.
This dataset won't be here without the help of FKI Computer Vision and Artificial Intelligence. As I came across the IAM Handwriting dataset from their website.
I would like to see people use this data for more insights, exploratory notebooks, and many more because Handwriting recognition is not an easy task to be done individually. I need you Kagglers to have a look at it.
Facebook
Twitterwords.tgz : Contains words (example: a01/a01-122/a01-122-s01-02.png) xml.tgz: Contains the meta-infornation in XML format (example: a01-122.xml).
The IAM Handwriting Database is publicly accessible and freely available for non-commercial research purposes. If you are using data from the IAM Handwriting Database, we request you to register, so we are aware of who is using our data. If you are publishing scientific work based on the IAM Handwriting Database, we request you to include a reference to the paper.
https://fki.tic.heia-fr.ch/databases/download-the-iam-handwriting-database
Facebook
TwitterThis dataset was generated employing a technique of randomly crossing out words from the IAM database, utilizing several types of strokes. The ratio of cross-out words to regular words in handwritten documents can vary greatly depending on the document and context. However, typically, the number of cross-out words is small compared with regular words. To ensure a realistic ratio of regular to cross-out words in our synthetic database, 30% of samples from the IAM training set were selected. First, the bounding box of each word in a line was detected. The bounding box covers the core area of the word. Then, at random, a word is crossed out within the core area. Each line contains a randomly struck-out word at a different position. The annotation of these struck-out words was replaced with the symbol #.
The folder has:
s-s0 images
Syn-trainset
Syn-validset
Syn_IAM_testset
The transcription files are in the format of
Filename, threshold label of handwritten line
s-s0-0,157 A # to stop Mr. Gaitskell from
Cite the below work if you have used this dataset:
"A deep learning approach to handwritten text recognition in the presence of struck-out text"
https://ieeexplore.ieee.org/document/8961024
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Dataset Information
Keywords
Hebrew, handwritten, letters
Description
HDD_v0 consists of images of isolated Hebrew characters together with training and test sets subdivision. The images were collected from hand-filled forms. For more details, please refer to [1]. When using this dataset in research work, please cite [1]. [1] I. Rabaev, B. Kurar Barakat, A. Churkin and J. El-Sana. The HHD Dataset. The 17th International Conference on Frontiers in Handwriting… See the full description on the dataset page: https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).
Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.
1)SMHD.txt contain all the line level transcription in the form of
image name, threshold value, label
0001-000,178 Bombay Phenotype :-
2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text.
3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt.
In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.
We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.
In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.
In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.
Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.
In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Egyptian Handwriting Dataset
A dataset of 11k+ handwritten Arabic words from Egyptian writers, extracted and tightly cropped from scanned paper forms. This dataset offers diverse handwriting samples ranging from children to elderly contributors, making it ideal for training robust Arabic handwriting recognition models.
Each form contains 6 unique words, resulting in 24 handwritten word images per form. Each word is written four times by the same writer to capture… See the full description on the dataset page: https://huggingface.co/datasets/OmarMDiab/Egyptian-Handwriting-Dataset.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The data of the dataset is collected from Professor Tom Gedeon and the complete handwriting paper of the CEDAR handwriting dataset. A CHoiCE Dataset with 62 classes cursive handwriting letters, "0-9, a-z, A-Z", each class in both the original data and the binary data at least have 40 pictures. The data format is a 28x28 ".png" format picture. The data set has a total of 62 categories of 0-9, a-z and A-Z, corresponding to the files "0" to "61" in the order of "label.txt". The data set is divided into two parts, the unprocessed original data image is stored in the "0" to "61" in the "V0.3/data" folder, and the binarized data image Stored in "0" to "61" in the "V0.3/data-bin" folder.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/WYRTKShttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/WYRTKS
Handwriting analysis is still an important application in machine learning. A basic requirement for any handwriting recognition application is the availability of comprehensive datasets. Standard labelled datasets play a significant role in training and evaluating learning algorithms. In this paper, we present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language. We intentionally concentrated on collecting Persian word samples which are rare in the currently available datasets. Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits. Moreover, the forms were filled out by 400 native Persian writers. To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported. This dataset is available for research and academic use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises the images for the personalization described in the paper Personalizing Handwriting Recognition Systems with Limited User-Specific Samples.
Dataset Statistics (v.1.0)
More details can be found on the Github Repository: Github GoBo
Model gobo_Baselinemodel.hdf5
Facebook
TwitterThis dataset was created by Gwachat Kozah
Facebook
TwitterThe IAM-database: an English sentence database for offline handwriting recognition.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Within the central repository, there are subfolders of different categories. Each of these subfolders contains both images and their corresponding transcriptions, saved as .txt files. As an example, the folder 'summary-based-0001-0055' encompasses 55 handwritten image documents pertaining to the summary task, with the images ranging from 0001 to 0055 within this category. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.Moreover, there exists a document detailing the transcription rules utilized for transcribing the dataset. Following these guidelines will enable the seamless addition of more images.Dataset Description:We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Thai Handwriting Dataset
This dataset combines two major Thai handwriting datasets:
BEST 2019 Thai Handwriting Recognition dataset (train-0000.parquet) Thai Handwritten Free Dataset by Wang (train-0001.parquet onwards)
Maintainer
kobkrit@iapp.co.th
Dataset Description
BEST 2019 Dataset
Contains handwritten Thai text images along with their ground truth transcriptions. The images have been processed and standardized for machine learning tasks.… See the full description on the dataset page: https://huggingface.co/datasets/iapp/thai_handwriting_dataset.
Facebook
TwitterThis data was donated to AI Dept. RuG by IBM in 2007
It contains *.STK files (ASCII) containing on-line
handwriting pen-tip coordinates (x,y). The format
can be converted to unipen.
For internal use at RuG only.
Lambert Schomaker
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Thai Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Thai language.
Dataset Contain & Diversity:Containing more than 2000 images, this Thai OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Thai text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these shopping lists were written and images were captured by native Thai people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Thai text recognition models.
Update & Custom Collection:We are committed to continually expanding this dataset by adding more images with the help of our native Thai crowd community.
If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Thai language. Your journey to improved language understanding and processing begins here.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We created a character dataset by collecting samples from 12 writers. Each writer contributed with letters (lower and uppercase), digits, and words from a pangram that we have not employed in our experiments, but they are included in "extra" folder for each writer in this database. Up to 4 samples have been collected for each pair writer/character, and the total number of samples in this database version is 2812.
Database structure:
scanner.py - character scanning program, dataset collection. convert2mnist.py - a program for converting a dataset into a mnist-like form. It is intended for an example with the test. example_using.py - example of a primitive grid for character recognition. It is intended only to demonstrate the consistency of the dataset. When using the dataset, of course, the user can and will use their own, more advanced approaches. data - folder with dataset. w_n_m - folder with writer's attempt (in total 37 folders) [char] - the main file of the symbol track, a text file with a list of coordinates of the form - "x1","y1","x2","y2",...,"xN","yN". [char]_times - a file with additional information on the track with a list of time in ms between receiving coordinates of points. [char].png is an auxiliary file - a picture of the symbol as it was visible to the writer. The file is for understanding only.
Class distribution in example_using.py, which you can find in github repository provided below:
[A] = { "а" , "А" } [Б] = { "б" , "Б" } [В] = { "в" , "В" } [Г] = { "г" , "Г" } [Д] = { "д" , "Д" } [Е] = { "е" , "Е" } [Ё] = { "ё" , "Ё" } [Ж] = { "ж" , "Ж" } [З] = { "з" , "З" } [И] = { "и" , "И" } [Й] = { "й" , "Й" } [К] = { "к" , "К" } [Л] = { "л" , "Л" } [М] = { "м" , "М" } [Н] = { "н" , "Н" } [О] = { "о" , "О", "0" } [П] = { "п" , "П" } [Р] = { "р" , "Р" } [С] = { "с" , "С" } [Т] = { "т" , "Т" } [У] = { "у" , "У" } [Ф] = { "ф" , "Ф" } [Х] = { "х" , "Х" } [Ц] = { "ц" , "Ц" } [Ч] = { "ч" , "Ч" } [Ш] = { "ш" , "Ш" } [Щ] = { "щ" , "Щ" } [Ъ] = { "ъ" , "Ъ" } [Ы] = { "ы" , "Ы" } [Ь] = { "ь" , "Ь" } [Э] = { "э" , "Э" } [Ю] = { "ю" , "Ю" } [Я] = { "я" , "Я" } [1] = { "1" } [2] = { "2" } [3] = { "3" } [4] = { "4" } [5] = { "5" } [6] = { "6" } [7] = { "7" } [8] = { "8" } [9] = { "9" }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Handwritten Letters is a dataset for classification tasks - it contains Letters annotations for 3,410 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterThe text carrier are A4 paper, lined paper, English paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes English composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as English handwriting OCR.
Facebook
TwitterIAM Handwriting Database
The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments.
The database was first published in [1] at the ICDAR 1999. Using this database an HMM based recognition system for handwritten sentences was developed and published in [2] at the ICPR 2000. The segmentation scheme used in the second version of the database is documented in [3] and has been published in the ICPR 2002. The IAM-database as of October 2002 is described in [4]. We use the database extensively in our own research, see publications for further details.
The database contains forms of unconstrained handwritten text, which were scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels. The figure below provides samples of a complete form, a text line and some extracted words.
Characteristics
The IAM Handwriting Database 3.0 is structured as follows:
657 writers contributed samples of their handwriting 1'539 pages of scanned text 5'685 isolated and labeled sentences 13'353 isolated and labeled text lines 115'320 isolated and labeled words The words have been extracted from pages of scanned text using an automatic segmentation scheme and were verified manually. The segmentation scheme has been developed at our institute [3].
All form, line and word images are provided as PNG files and the corresponding form label files, including segmentation information and variety of estimated parameters (from the preprocessing steps described in [2]), are included in the image files as meta-information in XML format which is described in XML file and XML file format (DTD).
References
[1] U. Marti and H. Bunke. A full English sentence database for off-line handwriting recognition. In Proc. of the 5th Int. Conf. on Document Analysis and Recognition, pages 705 - 708, 1999.
[2] U. Marti and H. Bunke. Handwritten Sentence Recognition. In Proc. of the 15th Int. Conf. on Pattern Recognition, Volume 3, pages 467 - 470, 2000.
[3] M. Zimmermann and H. Bunke. Automatic Segmentation of the IAM Off-line Database for Handwritten English Text. In Proc. of the 16th Int. Conf. on Pattern Recognition, Volume 4, pages 35 - 39, 2000.
[4] U. Marti and H. Bunke. The IAM-database: An English Sentence Database for Off-line Handwriting Recognition. Int. Journal on Document Analysis and Recognition, Volume 5, pages 39 - 46, 2002.
[5] S. Johansson, G.N. Leech and H. Goodluck. Manual of Information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital Computers. Department of English, University of Oslo, Norway, 1978.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Handwriting is a dataset for object detection tasks - it contains Digits annotations for 1,866 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
IAM Handwriting Dataset is a collection of handwritten passages by several writers. Generally, they use that data to classify writers according to their writing styles. A traditional way of solving such problem is extracting features like spacing between letters, curvatures, etc. and feeding them into Support Vector Machines. But, I wanted to solve this problem by Deep learning using Keras and Tensorflow. For the purpose, we don't need the full IAM Handwriting Dataset, but some authentic subset which can be used for training such as a subset of images by top 50 persons who contributed the most towards the dataset.
This dataset contains images of each handwritten sentence with the dash-separated filename format. The first field represents the test code, second the writer id, third passage id, and fourth the sentence id.
This dataset won't be here without the help of FKI Computer Vision and Artificial Intelligence. As I came across the IAM Handwriting dataset from their website.
I would like to see people use this data for more insights, exploratory notebooks, and many more because Handwriting recognition is not an easy task to be done individually. I need you Kagglers to have a look at it.