100+ datasets found

h
IAM-line
huggingface.co
Updated Jun 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teklia (2024). IAM-line [Dataset]. https://huggingface.co/datasets/Teklia/IAM-line
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2024
Dataset authored and provided by
Teklia
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
IAM - line level

Dataset Summary

The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments. Note that all images are resized to a fixed height of 128 pixels.

Languages

All the documents in the dataset are written in English.

Dataset Structure Data Instances

{ 'image':… See the full description on the dataset page: https://huggingface.co/datasets/Teklia/IAM-line.
h
thai_handwriting_dataset
huggingface.co
Updated Nov 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
iApp Technology (2024). thai_handwriting_dataset [Dataset]. https://huggingface.co/datasets/iapp/thai_handwriting_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2024
Dataset authored and provided by
iApp Technology
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Thai Handwriting Dataset

This dataset combines two major Thai handwriting datasets:

BEST 2019 Thai Handwriting Recognition dataset (train-0000.parquet) Thai Handwritten Free Dataset by Wang (train-0001.parquet onwards)

Maintainer

kobkrit@iapp.co.th

Dataset Description BEST 2019 Dataset

Contains handwritten Thai text images along with their ground truth transcriptions. The images have been processed and standardized for machine learning tasks.… See the full description on the dataset page: https://huggingface.co/datasets/iapp/thai_handwriting_dataset.
r
Handwritten synthetic dataset from the IAM
researchdata.edu.au
research-repository.rmit.edu.au
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hiqmat Nisa (2023). Handwritten synthetic dataset from the IAM [Dataset]. http://doi.org/10.25439/RMT.24309730.V1
Explore at:
Unique identifier
https://doi.org/10.25439/RMT.24309730.V1
Dataset updated
Nov 20, 2023
Dataset provided by
RMIT University, Australia
Authors
Hiqmat Nisa
Description
This dataset was generated employing a technique of randomly crossing out words from the IAM database, utilizing several types of strokes. The ratio of cross-out words to regular words in handwritten documents can vary greatly depending on the document and context. However, typically, the number of cross-out words is small compared with regular words. To ensure a realistic ratio of regular to cross-out words in our synthetic database, 30% of samples from the IAM training set were selected. First, the bounding box of each word in a line was detected. The bounding box covers the core area of the word. Then, at random, a word is crossed out within the core area. Each line contains a randomly struck-out word at a different position. The annotation of these struck-out words was replaced with the symbol #.

The folder has:
s-s0 images
Syn-trainset
Syn-validset
Syn_IAM_testset
The transcription files are in the format of
Filename, threshold label of handwritten line
s-s0-0,157 A # to stop Mr. Gaitskell from

Cite the below work if you have used this dataset:
"A deep learning approach to handwritten text recognition in the presence of struck-out text"
https://ieeexplore.ieee.org/document/8961024
Synthetic Dyslexia Handwriting Dataset (YOLO-Format)
zenodo.org
zip
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nora Fink; Nora Fink (2025). Synthetic Dyslexia Handwriting Dataset (YOLO-Format) [Dataset]. http://doi.org/10.5281/zenodo.14852659
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14852659
Dataset updated
Feb 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nora Fink; Nora Fink
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description
This synthetic dataset has been generated to facilitate object detection (in YOLO format) for research on dyslexia-related handwriting patterns. It builds upon an original corpus of uppercase and lowercase letters obtained from multiple sources: the NIST Special Database 19 111, the Kaggle dataset “A-Z Handwritten Alphabets in .csv format” 222, as well as handwriting samples from dyslexic primary school children of Seberang Jaya, Penang (Malaysia).

In the original dataset, uppercase letters originated from NIST Special Database 19, while lowercase letters came from the Kaggle dataset curated by S. Patel. Additional images (categorized as Normal, Reversal, and Corrected) were collected and labeled based on handwriting samples of dyslexic and non-dyslexic students, resulting in:

78,275 images labeled as Normal

52,196 images labeled as Reversal

8,029 images labeled as Corrected

Building upon this foundation, the Synthetic Dyslexia Handwriting Dataset presented here was programmatically generated to produce labeled examples suitable for training and validating object detection models. Each synthetic image arranges multiple letters of various classes (Normal, Reversal, Corrected) in a “text line” style on a black background, providing YOLO-compatible .txt annotations that specify bounding boxes for each letter.

Key Points of the Synthetic Generation Process

Letter-Level Source Data
Individual characters were sampled from the original image sets.

Randomized Layout
Letters are randomly assembled into words and lines, ensuring a wide variety of visual arrangements.

Bounding Box Labels
Each character is assigned a bounding box with (x, y, width, height) in YOLO format.

Class Annotations
Classes include 0 = Normal, 1 = Reversal, and 2 = Corrected.

Preservation of Visual Characteristics
Letters retain their key dyslexia-relevant features (e.g., reversals).

Historical References & Credits

If you are using this synthetic dataset or the original Dyslexia Handwriting Dataset, please cite the following papers:

M. S. A. B. Rosli, I. S. Isa, S. A. Ramlan, S. N. Sulaiman and M. I. F. Maruzuki, "Development of CNN Transfer Learning for Dyslexia Handwriting Recognition," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 194–199, doi: 10.1109/ICCSCE52189.2021.9530971.

N. S. L. Seman, I. S. Isa, S. A. Ramlan, W. Li-Chih and M. I. F. Maruzuki, "Notice of Removal: Classification of Handwriting Impairment Using CNN for Potential Dyslexia Symptom," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 188–193, doi: 10.1109/ICCSCE52189.2021.9530989.

Isa, Iza Sazanita. CNN Comparisons Models On Dyslexia Handwriting Classification / Iza Sazanita Isa … [et Al.]. Universiti Teknologi MARA Cawangan Pulau Pinang, 2021.

Isa, I. S., Rahimi, W. N. S., Ramlan, S. A., & Sulaiman, S. N. (2019). Automated detection of dyslexia symptom based on handwriting image for primary school children. Procedia Computer Science, 163, 440–449.

References to Original Data Sources

111 P. J. Grother, “NIST Special Database 19,” NIST, 2016. [Online]. Available:
https://www.nist.gov/srd/nist-special-database-19

222 S. Patel, “A-Z Handwritten Alphabets in .csv format,” Kaggle, 2017. [Online]. Available:
https://www.kaggle.com/sachinpatel21/az-handwritten-alphabets-in-csv-format

Usage & Citation

Researchers and practitioners are encouraged to integrate this synthetic dataset into their computer vision pipelines for tasks such as dyslexia pattern analysis, character recognition, and educational technology development. Please cite the original authors and publications if you utilize this synthetic dataset in your work.

Password Note (Original Data)

The original RAR file was password-protected with the password: WanAsy321. This synthetic dataset, however, is provided openly for streamlined usage.
PHCD - Polish Handwritten Characters Database
kaggle.com
Updated Dec 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiktor Flis (2023). PHCD - Polish Handwritten Characters Database [Dataset]. https://www.kaggle.com/datasets/westedcrean/phcd-polish-handwritten-characters-database
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Wiktor Flis
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F950187%2Fd8a0b40fa9a5ad45c65e703b28d4a504%2Fbackground.png?generation=1703873571061442&alt=media" alt="">

The process for collecting this dataset was documented in paper "https://doi.org/10.12913/22998624/122567">"Development of Extensive Polish Handwritten Characters Database for Text Recognition Research" by Mikhail Tokovarov, dr Monika Kaczorowska and dr Marek Miłosz. Link to download the original dataset: https://cs.pollub.pl/phcd/. The source fileset also contains a dataset of raw images of whole sentences written in Polish.

Context

PHCD (Polish Handwritten Characters Database) is a collection of handwritten texts in Polish. It was created by researchers at Lublin University of Technology for the purpose of offline handwritten text recognition. The database contains more than 530 000 images of handwritten characters. Each image is a 32x32 pixel grayscale image representing one of 89 classes (10 digits, 26 lowercase latin letters, 26 uppercase latin letters, 9 lowercase polish letters, 9 uppercase polish letters and 9 special characters), with around 6 000 examples per class.

How to use

This notebook contains a PyTorch example of how to load the dataset from .npz files and train a CNN model. You can also use the dataset with other frameworks, such as TensorFlow, Keras, etc.

For .npz files, use numpy.load method.

Contents

The dataset contains the following:

dataset.npz - a file with two compressed numpy arrays:

"signs" - with all the images, sized 32 x 32 (grayscale)

"labels" - with all the labels (0-88) for examples from signs

label_mapping.csv - a csv file with columns label and char, mapping from ids to characters from dataset

images - folder with original 530 000 png images, sized 32 x 32, to use with other loading techniques

Acknowledgements

I want to express my gratitude to the following people: Dr. Edyta Łukasik for introducing me to this dataset and to authors of this dataset - Mikhail Tokovarov, dr. Monika Kaczorowska and dr. Marek Miłosz from Lublin University of Technology in Poland.

Inspiration

You can use this data the same way you used MNIST, KMNIST of Fashion MNIST: refine your image classification skills, use GPU & TPU to implement CNN architectures for models to perform such multiclass classifications.
Dyslexia Handwriting Dataset
kaggle.com
Updated Feb 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DR. IZA SAZANITA ISA (2022). Dyslexia Handwriting Dataset [Dataset]. https://www.kaggle.com/datasets/drizasazanitaisa/dyslexia-handwriting-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DR. IZA SAZANITA ISA
Description
Dataset

This dataset was created by DR. IZA SAZANITA ISA

Contents
r
A Messy Handwriting Dataset with Student Crossouts and Corrections...
researchdata.edu.au
research-repository.rmit.edu.au
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hiqmat Nisa (2023). A Messy Handwriting Dataset with Student Crossouts and Corrections (Line-version) [Dataset]. http://doi.org/10.25439/RMT.24419986.V1
Explore at:
Unique identifier
https://doi.org/10.25439/RMT.24419986.V1
Dataset updated
Nov 20, 2023
Dataset provided by
RMIT University, Australia
Authors
Hiqmat Nisa
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).
Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.
1)SMHD.txt contain all the line level transcription in the form of
image name, threshold value, label
0001-000,178 Bombay Phenotype :-
2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text.
3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt.
In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.
Dataset Description:
We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.
Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.
In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.
In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.
Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.
In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
GoBo - A Handwriting Recognition dataset for Personalization
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jun 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Gold; Dario van den Boom; Torsten Zesch; Christian Gold; Dario van den Boom; Torsten Zesch (2023). GoBo - A Handwriting Recognition dataset for Personalization [Dataset]. http://doi.org/10.5281/zenodo.8085511
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8085511
Dataset updated
Jun 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christian Gold; Dario van den Boom; Torsten Zesch; Christian Gold; Dario van den Boom; Torsten Zesch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises the images for the personalization described in the paper Personalizing Handwriting Recognition Systems with Limited User-Specific Samples.

Dataset Statistics (v.1.0)

* Handwritten word-level images
* English
* 40 Participants
* 5 sets from different sources for personalization
* 2 sets from 2 domains (same domains as 2 personalization sets) for testing
* 926 words/writer, 37k words in total

More details can be found on the Github Repository:
Github GoBo

Model
gobo_Baselinemodel.hdf5
h
hebrew-handwritten-dataset
huggingface.co
Updated May 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sivan Ratson (2023). hebrew-handwritten-dataset [Dataset]. https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2023
Authors
Sivan Ratson
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Dataset Information

Keywords

Hebrew, handwritten, letters

Description

HDD_v0 consists of images of isolated Hebrew characters together with training and test sets subdivision. The images were collected from hand-filled forms. For more details, please refer to [1]. When using this dataset in research work, please cite [1]. [1] I. Rabaev, B. Kurar Barakat, A. Churkin and J. El-Sana. The HHD Dataset. The 17th International Conference on Frontiers in Handwriting… See the full description on the dataset page: https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset.
Arabic Handwritten Digits Dataset
figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Loey (2023). Arabic Handwritten Digits Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12236948.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12236948.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Mohamed Loey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Arabic Handwritten Digits DatasetAbstractIn recent years, handwritten digits recognition has been an important areadue to its applications in several fields. This work is focusing on the recognitionpart of handwritten Arabic digits recognition that face several challenges, includingthe unlimited variation in human handwriting and the large public databases. Thepaper provided a deep learning technique that can be effectively apply to recognizing Arabic handwritten digits. LeNet-5, a Convolutional Neural Network (CNN)trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. A comparison is held amongst theresults, and it is shown by the end that the use of CNN was leaded to significantimprovements across different machine-learning classification algorithms.The Convolutional Neural Network was trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. Moreover, the CNN is giving an average recognition accuracy of 99.15%.ContextThe motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten digits recognition. In recent years, Arabic handwritten digits recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions.ContentThe MADBase is modified Arabic handwritten digits database contains 60,000 training images, and 10,000 test images. MADBase were written by 700 writers. Each writer wrote each digit (from 0 -9) ten times. To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution.MADBase is available for free and can be downloaded from (http://datacenter.aucegypt.edu/shazeem/) .AcknowledgementsCNN for Handwritten Arabic Digits Recognition Based on LeNet-5http://link.springer.com/chapter/10.1007/978-3-319-48308-5_54Ahmed El-Sawy, Hazem El-Bakry, Mohamed LoeyProceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016Volume 533 of the series Advances in Intelligent Systems and Computing pp 566-575InspirationCreating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position.Arabic Handwritten Characters Datasethttps://www.kaggle.com/mloey1/ahcd1Benha Universityhttp://bu.edu.eg/staff/mloeyhttps://mloey.github.io/
F
Thai Shopping List OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Thai Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/thai-shopping-list-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Thai Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Thai language.
Dataset Contain & Diversity:
Containing more than 2000 images, this Thai OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Thai text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these shopping lists were written and images were captured by native Thai people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:
In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Thai text recognition models.
Update & Custom Collection:
We are committed to continually expanding this dataset by adding more images with the help of our native Thai crowd community.
If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:
This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Thai language. Your journey to improved language understanding and processing begins here.
R
Doctors Prescriptions Handwriting Dataset
universe.roboflow.com
zip
Updated Jun 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daffodil International University (2023). Doctors Prescriptions Handwriting Dataset [Dataset]. https://universe.roboflow.com/daffodil-international-university-s5vpr/doctors-prescriptions-handwriting/model/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 24, 2023
Dataset authored and provided by
Daffodil International University
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Words Bounding Boxes
Description
Here are a few use cases for this project:

Healthcare Automation: The model can be used to digitize handwritten medical prescriptions thus reducing manual transcription errors and streamlining the process in pharmacies and hospitals.

Historical Document Digitization: This model could be utilized for transcribing old handwritten medical documents for research purposes.

Handwriting Analysis Tool: The model can be used for general handwriting analysis purposes, for example in educational institutions to improve handwriting recognition or in forensic analysis.

OCR Software Improvement: This model can be integrated with OCR (Optical Character Recognition) software to enhance its performance in recognizing and interpreting handwritten text, capitalizing on the diverse range of characters available.

Medical Informatics Studies: Researchers using digital health records for epidemiological studies can utilize this model to extract data from handwritten prescriptions or doctor's notes.
t
Data from: The IAM-database: an English sentence database for offline...
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). The IAM-database: an English sentence database for offline handwriting recognition [Dataset]. https://service.tib.eu/ldmservice/dataset/the-iam-database--an-english-sentence-database-for-offline-handwriting-recognition
Explore at:
Dataset updated
Dec 16, 2024
Description
The IAM-database: an English sentence database for offline handwriting recognition.
IBM-Crosspad on-line handwriting database in STK format - donated...
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBM; IBM (2020). IBM-Crosspad on-line handwriting database in STK format - donated exclusively to University of Groningen [Dataset]. http://doi.org/10.5281/zenodo.1195853
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1195853
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
IBM; IBM
Area covered
Groningen
Description
This data was donated to AI Dept. RuG by IBM in 2007

It contains *.STK files (ASCII) containing on-line
handwriting pen-tip coordinates (x,y). The format
can be converted to unipen.

For internal use at RuG only.

Lambert Schomaker
R
Handwriting Dataset
universe.roboflow.com
zip
Updated Oct 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tibetan (2024). Handwriting Dataset [Dataset]. https://universe.roboflow.com/tibetan/handwriting-cavdy
Explore at:
zipAvailable download formats
Dataset updated
Oct 9, 2024
Dataset authored and provided by
Tibetan
Variables measured
1 Bounding Boxes
Description
Handwriting

## Overview Handwriting is a dataset for object detection tasks - it contains 1 annotations for 1,000 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
R
Handwritten Letters Dataset
universe.roboflow.com
zip
Updated Mar 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Workspace (2024). Handwritten Letters Dataset [Dataset]. https://universe.roboflow.com/workspace-qazxh/handwritten-letters-nkl2g
Explore at:
zipAvailable download formats
Dataset updated
Mar 2, 2024
Dataset authored and provided by
Workspace
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Letters
Description
Handwritten Letters

## Overview Handwritten Letters is a dataset for classification tasks - it contains Letters annotations for 3,410 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
g
Urdu Handwritten Text Dataset
gts.ai
jpg, png
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). Urdu Handwritten Text Dataset [Dataset]. https://gts.ai/dataset-download/urdu-handwritten-text-dataset/
Explore at:
png, jpgAvailable download formats
Dataset updated
Sep 6, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Urdu Handwritten Text Dataset contains high-quality images of handwritten Urdu text collected from native speakers across diverse demographics, including people with disabilities. The dataset covers the full Urdu character set, ligatures, diacritics, and dots, making it ideal for OCR, handwriting authentication, forensic analysis, and multilingual handwriting recognition research.
Handwriting Data to Detect Alzheimer’s Disease
kaggle.com
Updated Aug 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taeef Najib (2023). Handwriting Data to Detect Alzheimer’s Disease [Dataset]. https://www.kaggle.com/datasets/taeefnajib/handwriting-data-to-detect-alzheimers-disease/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Dataset provided by
Kaggle
Authors
Taeef Najib
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The DARWIN dataset includes handwriting data from 174 participants. The classification task consists in distinguishing Alzheimer’s disease patients from healthy people.

Creator: Francesco Fontanella

Source: https://archive.ics.uci.edu/dataset/732/darwin

The DARWIN dataset was created to allow researchers to improve the existing machine-learning methodologies for the prediction of Alzheimer's disease via handwriting analysis.

Citation Requests/Acknowledgements

N. D. Cilia, C. De Stefano, F. Fontanella, A. S. Di Freca, An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis, Procedia Computer Science 141 (2018) 466–471. https://doi.org/10.1016/j.procs.2018.10.141

N. D. Cilia, G. De Gregorio, C. De Stefano, F. Fontanella, A. Marcelli, A. Parziale, Diagnosing Alzheimer’s disease from online handwriting: A novel dataset and performance benchmarking, Engineering Applications of Artificial Intelligence, Vol. 111 (20229) 104822. https://doi.org/10.1016/j.engappai.2022.104822
1,000 People - Italian Handwriting OCR Dataset
m.nexdata.ai
nexdata.ai
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 1,000 People - Italian Handwriting OCR Dataset [Dataset]. https://m.nexdata.ai/datasets/ocr/1406?source=Huggingface
Explore at:
Dataset updated
May 3, 2025
Dataset authored and provided by
Nexdata
Variables measured
Device, Writer, Data size, Data format, Data content, Accuracy rate, Photographic angle, Collecting environment, Population distribution
Description
The writers are Europeans who often write Italian. The device is scanner, the collection angle is eye-level angle. The dataset content includes address, company name, personal name.The dataset can be used for tasks such as Italian OCR models and handwritten text recognition systems.
h
Egyptian-Handwriting-Dataset
huggingface.co
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Diab (2025). Egyptian-Handwriting-Dataset [Dataset]. https://huggingface.co/datasets/OmarMDiab/Egyptian-Handwriting-Dataset
Explore at:
Dataset updated
Aug 2, 2025
Authors
Omar Diab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Egyptian Handwriting Dataset

A dataset of 11k+ handwritten Arabic words from Egyptian writers, extracted and tightly cropped from scanned paper forms. This dataset offers diverse handwriting samples ranging from children to elderly contributors, making it ideal for training robust Arabic handwriting recognition models.

Each form contains 6 unique words, resulting in 24 handwritten word images per form. Each word is written four times by the same writer to capture… See the full description on the dataset page: https://huggingface.co/datasets/OmarMDiab/Egyptian-Handwriting-Dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Teklia (2024). IAM-line [Dataset]. https://huggingface.co/datasets/Teklia/IAM-line

IAM-line

Teklia/IAM-line

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 18, 2024

Dataset authored and provided by

Teklia

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

IAM - line level

  Dataset Summary

The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments. Note that all images are resized to a fixed height of 128 pixels.

  Languages

All the documents in the dataset are written in English.

  Dataset Structure





  Data Instances

{ 'image':… See the full description on the dataset page: https://huggingface.co/datasets/Teklia/IAM-line.

Clear search

Close search

Google apps

Main menu

IAM-line

thai_handwriting_dataset

Handwritten synthetic dataset from the IAM

Synthetic Dyslexia Handwriting Dataset (YOLO-Format)

Key Points of the Synthetic Generation Process

Historical References & Credits

References to Original Data Sources

Usage & Citation

Password Note (Original Data)

PHCD - Polish Handwritten Characters Database

Context

How to use

Contents

Acknowledgements

Inspiration

Dyslexia Handwriting Dataset

Dataset

Contents

A Messy Handwriting Dataset with Student Crossouts and Corrections...

Dataset Description:

GoBo - A Handwriting Recognition dataset for Personalization

hebrew-handwritten-dataset

Arabic Handwritten Digits Dataset

Thai Shopping List OCR Image Dataset

What’s Included

Doctors Prescriptions Handwriting Dataset

Data from: The IAM-database: an English sentence database for offline...

IBM-Crosspad on-line handwriting database in STK format - donated...

Handwriting Dataset

Handwriting

Handwritten Letters Dataset

Handwritten Letters

Urdu Handwritten Text Dataset

Handwriting Data to Detect Alzheimer’s Disease

1,000 People - Italian Handwriting OCR Dataset

Egyptian-Handwriting-Dataset

IAM-line

IAM-line

Teklia/IAM-line