MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
IAM - line level
Dataset Summary
The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments. Note that all images are resized to a fixed height of 128 pixels.
Languages
All the documents in the dataset are written in English.
Dataset Structure
Data Instances
{ 'image':⦠See the full description on the dataset page: https://huggingface.co/datasets/Teklia/IAM-line.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Thai Handwriting Dataset
This dataset combines two major Thai handwriting datasets:
BEST 2019 Thai Handwriting Recognition dataset (train-0000.parquet) Thai Handwritten Free Dataset by Wang (train-0001.parquet onwards)
Maintainer
kobkrit@iapp.co.th
Dataset Description
BEST 2019 Dataset
Contains handwritten Thai text images along with their ground truth transcriptions. The images have been processed and standardized for machine learning tasks.⦠See the full description on the dataset page: https://huggingface.co/datasets/iapp/thai_handwriting_dataset.
This dataset was generated employing a technique of randomly crossing out words from the IAM database, utilizing several types of strokes. The ratio of cross-out words to regular words in handwritten documents can vary greatly depending on the document and context. However, typically, the number of cross-out words is small compared with regular words. To ensure a realistic ratio of regular to cross-out words in our synthetic database, 30% of samples from the IAM training set were selected. First, the bounding box of each word in a line was detected. The bounding box covers the core area of the word. Then, at random, a word is crossed out within the core area. Each line contains a randomly struck-out word at a different position. The annotation of these struck-out words was replaced with the symbol #.
The folder has:
s-s0 images
Syn-trainset
Syn-validset
Syn_IAM_testset
The transcription files are in the format of
Filename, threshold label of handwritten line
s-s0-0,157 A # to stop Mr. Gaitskell from
Cite the below work if you have used this dataset:
"A deep learning approach to handwritten text recognition in the presence of struck-out text"
https://ieeexplore.ieee.org/document/8961024
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This synthetic dataset has been generated to facilitate object detection (in YOLO format) for research on dyslexia-related handwriting patterns. It builds upon an original corpus of uppercase and lowercase letters obtained from multiple sources: the NIST Special Database 19 111, the Kaggle dataset āA-Z Handwritten Alphabets in .csv formatā 222, as well as handwriting samples from dyslexic primary school children of Seberang Jaya, Penang (Malaysia).
In the original dataset, uppercase letters originated from NIST Special Database 19, while lowercase letters came from the Kaggle dataset curated by S. Patel. Additional images (categorized as Normal, Reversal, and Corrected) were collected and labeled based on handwriting samples of dyslexic and non-dyslexic students, resulting in:
Building upon this foundation, the Synthetic Dyslexia Handwriting Dataset presented here was programmatically generated to produce labeled examples suitable for training and validating object detection models. Each synthetic image arranges multiple letters of various classes (Normal, Reversal, Corrected) in a ātext lineā style on a black background, providing YOLO-compatible .txt
annotations that specify bounding boxes for each letter.
(x, y, width, height)
in YOLO format.0 = Normal
, 1 = Reversal
, and 2 = Corrected
.If you are using this synthetic dataset or the original Dyslexia Handwriting Dataset, please cite the following papers:
111 P. J. Grother, āNIST Special Database 19,ā NIST, 2016. [Online]. Available:
https://www.nist.gov/srd/nist-special-database-19
222 S. Patel, āA-Z Handwritten Alphabets in .csv format,ā Kaggle, 2017. [Online]. Available:
https://www.kaggle.com/sachinpatel21/az-handwritten-alphabets-in-csv-format
Researchers and practitioners are encouraged to integrate this synthetic dataset into their computer vision pipelines for tasks such as dyslexia pattern analysis, character recognition, and educational technology development. Please cite the original authors and publications if you utilize this synthetic dataset in your work.
The original RAR file was password-protected with the password: WanAsy321. This synthetic dataset, however, is provided openly for streamlined usage.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F950187%2Fd8a0b40fa9a5ad45c65e703b28d4a504%2Fbackground.png?generation=1703873571061442&alt=media" alt="">
The process for collecting this dataset was documented in paper "https://doi.org/10.12913/22998624/122567">"Development of Extensive Polish Handwritten Characters Database for Text Recognition Research" by Mikhail Tokovarov, dr Monika Kaczorowska and dr Marek MiÅosz. Link to download the original dataset: https://cs.pollub.pl/phcd/. The source fileset also contains a dataset of raw images of whole sentences written in Polish.
PHCD (Polish Handwritten Characters Database) is a collection of handwritten texts in Polish. It was created by researchers at Lublin University of Technology for the purpose of offline handwritten text recognition. The database contains more than 530 000 images of handwritten characters. Each image is a 32x32 pixel grayscale image representing one of 89 classes (10 digits, 26 lowercase latin letters, 26 uppercase latin letters, 9 lowercase polish letters, 9 uppercase polish letters and 9 special characters), with around 6 000 examples per class.
This notebook contains a PyTorch example of how to load the dataset from .npz files and train a CNN model. You can also use the dataset with other frameworks, such as TensorFlow, Keras, etc.
For .npz files, use numpy.load method.
The dataset contains the following:
I want to express my gratitude to the following people: Dr. Edyta Åukasik for introducing me to this dataset and to authors of this dataset - Mikhail Tokovarov, dr. Monika Kaczorowska and dr. Marek MiÅosz from Lublin University of Technology in Poland.
You can use this data the same way you used MNIST, KMNIST of Fashion MNIST: refine your image classification skills, use GPU & TPU to implement CNN architectures for models to perform such multiclass classifications.
This dataset was created by DR. IZA SAZANITA ISA
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).
Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.
1)SMHD.txt contain all the line level transcription in the form of
image name, threshold value, label
0001-000,178 Bombay Phenotype :-
2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text.
3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt.
In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.
We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.
In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.
In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.
Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.
In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises the images for the personalization described in the paper Personalizing Handwriting Recognition Systems with Limited User-Specific Samples.
Dataset Statistics (v.1.0)
* Handwritten word-level images
* English
* 40 Participants
* 5 sets from different sources for personalization
* 2 sets from 2 domains (same domains as 2 personalization sets) for testing
* 926 words/writer, 37k words in total
More details can be found on the Github Repository:
Github GoBo
Model
gobo_Baselinemodel.hdf5
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Dataset Information
Keywords
Hebrew, handwritten, letters
Description
HDD_v0 consists of images of isolated Hebrew characters together with training and test sets subdivision. The images were collected from hand-filled forms. For more details, please refer to [1]. When using this dataset in research work, please cite [1]. [1] I. Rabaev, B. Kurar Barakat, A. Churkin and J. El-Sana. The HHD Dataset. The 17th International Conference on Frontiers in Handwriting⦠See the full description on the dataset page: https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Arabic Handwritten Digits DatasetAbstractIn recent years, handwritten digits recognition has been an important areadue to its applications in several fields. This work is focusing on the recognitionpart of handwritten Arabic digits recognition that face several challenges, includingthe unlimited variation in human handwriting and the large public databases. Thepaper provided a deep learning technique that can be effectively apply to recognizing Arabic handwritten digits. LeNet-5, a Convolutional Neural Network (CNN)trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. A comparison is held amongst theresults, and it is shown by the end that the use of CNN was leaded to significantimprovements across different machine-learning classification algorithms.The Convolutional Neural Network was trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. Moreover, the CNN is giving an average recognition accuracy of 99.15%.ContextThe motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten digits recognition. In recent years, Arabic handwritten digits recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions.ContentThe MADBase is modified Arabic handwritten digits database contains 60,000 training images, and 10,000 test images. MADBase were written by 700 writers. Each writer wrote each digit (from 0 -9) ten times. To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution.MADBase is available for free and can be downloaded from (http://datacenter.aucegypt.edu/shazeem/) .AcknowledgementsCNN for Handwritten Arabic Digits Recognition Based on LeNet-5http://link.springer.com/chapter/10.1007/978-3-319-48308-5_54Ahmed El-Sawy, Hazem El-Bakry, Mohamed LoeyProceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016Volume 533 of the series Advances in Intelligent Systems and Computing pp 566-575InspirationCreating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position.Arabic Handwritten Characters Datasethttps://www.kaggle.com/mloey1/ahcd1Benha Universityhttp://bu.edu.eg/staff/mloeyhttps://mloey.github.io/
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Thai Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Thai language.
Dataset Contain & Diversity:Containing more than 2000 images, this Thai OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Thai text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these shopping lists were written and images were captured by native Thai people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Thai text recognition models.
Update & Custom Collection:We are committed to continually expanding this dataset by adding more images with the help of our native Thai crowd community.
If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Thai language. Your journey to improved language understanding and processing begins here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Healthcare Automation: The model can be used to digitize handwritten medical prescriptions thus reducing manual transcription errors and streamlining the process in pharmacies and hospitals.
Historical Document Digitization: This model could be utilized for transcribing old handwritten medical documents for research purposes.
Handwriting Analysis Tool: The model can be used for general handwriting analysis purposes, for example in educational institutions to improve handwriting recognition or in forensic analysis.
OCR Software Improvement: This model can be integrated with OCR (Optical Character Recognition) software to enhance its performance in recognizing and interpreting handwritten text, capitalizing on the diverse range of characters available.
Medical Informatics Studies: Researchers using digital health records for epidemiological studies can utilize this model to extract data from handwritten prescriptions or doctor's notes.
The IAM-database: an English sentence database for offline handwriting recognition.
This data was donated to AI Dept. RuG by IBM in 2007
It contains *.STK files (ASCII) containing on-line
handwriting pen-tip coordinates (x,y). The format
can be converted to unipen.
For internal use at RuG only.
Lambert Schomaker
## Overview
Handwriting is a dataset for object detection tasks - it contains 1 annotations for 1,000 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Handwritten Letters is a dataset for classification tasks - it contains Letters annotations for 3,410 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Urdu Handwritten Text Dataset contains high-quality images of handwritten Urdu text collected from native speakers across diverse demographics, including people with disabilities. The dataset covers the full Urdu character set, ligatures, diacritics, and dots, making it ideal for OCR, handwriting authentication, forensic analysis, and multilingual handwriting recognition research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The DARWIN dataset includes handwriting data from 174 participants. The classification task consists in distinguishing Alzheimerās disease patients from healthy people.
Creator: Francesco Fontanella
Source: https://archive.ics.uci.edu/dataset/732/darwin
The DARWIN dataset was created to allow researchers to improve the existing machine-learning methodologies for the prediction of Alzheimer's disease via handwriting analysis.
Citation Requests/Acknowledgements
N. D. Cilia, C. De Stefano, F. Fontanella, A. S. Di Freca, An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis, Procedia Computer Science 141 (2018) 466ā471. https://doi.org/10.1016/j.procs.2018.10.141
N. D. Cilia, G. De Gregorio, C. De Stefano, F. Fontanella, A. Marcelli, A. Parziale, Diagnosing Alzheimerās disease from online handwriting: A novel dataset and performance benchmarking, Engineering Applications of Artificial Intelligence, Vol. 111 (20229) 104822. https://doi.org/10.1016/j.engappai.2022.104822
The writers are Europeans who often write Italian. The device is scanner, the collection angle is eye-level angle. The dataset content includes address, company name, personal name.The dataset can be used for tasks such as Italian OCR models and handwritten text recognition systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Egyptian Handwriting Dataset
A dataset of 11k+ handwritten Arabic words from Egyptian writers, extracted and tightly cropped from scanned paper forms. This dataset offers diverse handwriting samples ranging from children to elderly contributors, making it ideal for training robust Arabic handwriting recognition models.
Each form contains 6 unique words, resulting in 24 handwritten word images per form. Each word is written four times by the same writer to capture⦠See the full description on the dataset page: https://huggingface.co/datasets/OmarMDiab/Egyptian-Handwriting-Dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
IAM - line level
Dataset Summary
The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments. Note that all images are resized to a fixed height of 128 pixels.
Languages
All the documents in the dataset are written in English.
Dataset Structure
Data Instances
{ 'image':⦠See the full description on the dataset page: https://huggingface.co/datasets/Teklia/IAM-line.