100+ datasets found
  1. h

    IAM-line

    • huggingface.co
    Updated Jun 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teklia (2024). IAM-line [Dataset]. https://huggingface.co/datasets/Teklia/IAM-line
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2024
    Dataset authored and provided by
    Teklia
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    IAM - line level

      Dataset Summary
    

    The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments. Note that all images are resized to a fixed height of 128 pixels.

      Languages
    

    All the documents in the dataset are written in English.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    { 'image':… See the full description on the dataset page: https://huggingface.co/datasets/Teklia/IAM-line.

  2. h

    thai_handwriting_dataset

    • huggingface.co
    Updated Nov 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iApp Technology (2024). thai_handwriting_dataset [Dataset]. https://huggingface.co/datasets/iapp/thai_handwriting_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2024
    Dataset authored and provided by
    iApp Technology
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Thai Handwriting Dataset

    This dataset combines two major Thai handwriting datasets:

    BEST 2019 Thai Handwriting Recognition dataset (train-0000.parquet) Thai Handwritten Free Dataset by Wang (train-0001.parquet onwards)

      Maintainer
    

    kobkrit@iapp.co.th

      Dataset Description
    
    
    
    
    
      BEST 2019 Dataset
    

    Contains handwritten Thai text images along with their ground truth transcriptions. The images have been processed and standardized for machine learning tasks.… See the full description on the dataset page: https://huggingface.co/datasets/iapp/thai_handwriting_dataset.

  3. r

    Handwritten synthetic dataset from the IAM

    • researchdata.edu.au
    • research-repository.rmit.edu.au
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiqmat Nisa (2023). Handwritten synthetic dataset from the IAM [Dataset]. http://doi.org/10.25439/RMT.24309730.V1
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    RMIT University, Australia
    Authors
    Hiqmat Nisa
    Description

    This dataset was generated employing a technique of randomly crossing out words from the IAM database, utilizing several types of strokes. The ratio of cross-out words to regular words in handwritten documents can vary greatly depending on the document and context. However, typically, the number of cross-out words is small compared with regular words. To ensure a realistic ratio of regular to cross-out words in our synthetic database, 30% of samples from the IAM training set were selected. First, the bounding box of each word in a line was detected. The bounding box covers the core area of the word. Then, at random, a word is crossed out within the core area. Each line contains a randomly struck-out word at a different position. The annotation of these struck-out words was replaced with the symbol #.

    The folder has:
    s-s0 images
    Syn-trainset
    Syn-validset
    Syn_IAM_testset
    The transcription files are in the format of
    Filename, threshold label of handwritten line
    s-s0-0,157 A # to stop Mr. Gaitskell from

    Cite the below work if you have used this dataset:
    "A deep learning approach to handwritten text recognition in the presence of struck-out text"
    https://ieeexplore.ieee.org/document/8961024


  4. Synthetic Dyslexia Handwriting Dataset (YOLO-Format)

    • zenodo.org
    zip
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nora Fink; Nora Fink (2025). Synthetic Dyslexia Handwriting Dataset (YOLO-Format) [Dataset]. http://doi.org/10.5281/zenodo.14852659
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nora Fink; Nora Fink
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description
    This synthetic dataset has been generated to facilitate object detection (in YOLO format) for research on dyslexia-related handwriting patterns. It builds upon an original corpus of uppercase and lowercase letters obtained from multiple sources: the NIST Special Database 19 111, the Kaggle dataset ā€œA-Z Handwritten Alphabets in .csv formatā€ 222, as well as handwriting samples from dyslexic primary school children of Seberang Jaya, Penang (Malaysia).

    In the original dataset, uppercase letters originated from NIST Special Database 19, while lowercase letters came from the Kaggle dataset curated by S. Patel. Additional images (categorized as Normal, Reversal, and Corrected) were collected and labeled based on handwriting samples of dyslexic and non-dyslexic students, resulting in:

    • 78,275 images labeled as Normal
    • 52,196 images labeled as Reversal
    • 8,029 images labeled as Corrected

    Building upon this foundation, the Synthetic Dyslexia Handwriting Dataset presented here was programmatically generated to produce labeled examples suitable for training and validating object detection models. Each synthetic image arranges multiple letters of various classes (Normal, Reversal, Corrected) in a ā€œtext lineā€ style on a black background, providing YOLO-compatible .txt annotations that specify bounding boxes for each letter.

    Key Points of the Synthetic Generation Process

    1. Letter-Level Source Data
      Individual characters were sampled from the original image sets.
    2. Randomized Layout
      Letters are randomly assembled into words and lines, ensuring a wide variety of visual arrangements.
    3. Bounding Box Labels
      Each character is assigned a bounding box with (x, y, width, height) in YOLO format.
    4. Class Annotations
      Classes include 0 = Normal, 1 = Reversal, and 2 = Corrected.
    5. Preservation of Visual Characteristics
      Letters retain their key dyslexia-relevant features (e.g., reversals).

    Historical References & Credits

    If you are using this synthetic dataset or the original Dyslexia Handwriting Dataset, please cite the following papers:

    • M. S. A. B. Rosli, I. S. Isa, S. A. Ramlan, S. N. Sulaiman and M. I. F. Maruzuki, "Development of CNN Transfer Learning for Dyslexia Handwriting Recognition," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 194–199, doi: 10.1109/ICCSCE52189.2021.9530971.
    • N. S. L. Seman, I. S. Isa, S. A. Ramlan, W. Li-Chih and M. I. F. Maruzuki, "Notice of Removal: Classification of Handwriting Impairment Using CNN for Potential Dyslexia Symptom," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 188–193, doi: 10.1109/ICCSCE52189.2021.9530989.
    • Isa, Iza Sazanita. CNN Comparisons Models On Dyslexia Handwriting Classification / Iza Sazanita Isa … [et Al.]. Universiti Teknologi MARA Cawangan Pulau Pinang, 2021.
    • Isa, I. S., Rahimi, W. N. S., Ramlan, S. A., & Sulaiman, S. N. (2019). Automated detection of dyslexia symptom based on handwriting image for primary school children. Procedia Computer Science, 163, 440–449.

    References to Original Data Sources

    111 P. J. Grother, ā€œNIST Special Database 19,ā€ NIST, 2016. [Online]. Available:
    https://www.nist.gov/srd/nist-special-database-19

    222 S. Patel, ā€œA-Z Handwritten Alphabets in .csv format,ā€ Kaggle, 2017. [Online]. Available:
    https://www.kaggle.com/sachinpatel21/az-handwritten-alphabets-in-csv-format

    Usage & Citation

    Researchers and practitioners are encouraged to integrate this synthetic dataset into their computer vision pipelines for tasks such as dyslexia pattern analysis, character recognition, and educational technology development. Please cite the original authors and publications if you utilize this synthetic dataset in your work.

    Password Note (Original Data)

    The original RAR file was password-protected with the password: WanAsy321. This synthetic dataset, however, is provided openly for streamlined usage.

  5. PHCD - Polish Handwritten Characters Database

    • kaggle.com
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiktor Flis (2023). PHCD - Polish Handwritten Characters Database [Dataset]. https://www.kaggle.com/datasets/westedcrean/phcd-polish-handwritten-characters-database
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wiktor Flis
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F950187%2Fd8a0b40fa9a5ad45c65e703b28d4a504%2Fbackground.png?generation=1703873571061442&alt=media" alt="">

    The process for collecting this dataset was documented in paper "https://doi.org/10.12913/22998624/122567">"Development of Extensive Polish Handwritten Characters Database for Text Recognition Research" by Mikhail Tokovarov, dr Monika Kaczorowska and dr Marek Miłosz. Link to download the original dataset: https://cs.pollub.pl/phcd/. The source fileset also contains a dataset of raw images of whole sentences written in Polish.

    Context

    PHCD (Polish Handwritten Characters Database) is a collection of handwritten texts in Polish. It was created by researchers at Lublin University of Technology for the purpose of offline handwritten text recognition. The database contains more than 530 000 images of handwritten characters. Each image is a 32x32 pixel grayscale image representing one of 89 classes (10 digits, 26 lowercase latin letters, 26 uppercase latin letters, 9 lowercase polish letters, 9 uppercase polish letters and 9 special characters), with around 6 000 examples per class.

    How to use

    This notebook contains a PyTorch example of how to load the dataset from .npz files and train a CNN model. You can also use the dataset with other frameworks, such as TensorFlow, Keras, etc.

    For .npz files, use numpy.load method.

    Contents

    The dataset contains the following:

    • dataset.npz - a file with two compressed numpy arrays:
      • "signs" - with all the images, sized 32 x 32 (grayscale)
      • "labels" - with all the labels (0-88) for examples from signs
    • label_mapping.csv - a csv file with columns label and char, mapping from ids to characters from dataset
    • images - folder with original 530 000 png images, sized 32 x 32, to use with other loading techniques

    Acknowledgements

    I want to express my gratitude to the following people: Dr. Edyta Łukasik for introducing me to this dataset and to authors of this dataset - Mikhail Tokovarov, dr. Monika Kaczorowska and dr. Marek Miłosz from Lublin University of Technology in Poland.

    Inspiration

    You can use this data the same way you used MNIST, KMNIST of Fashion MNIST: refine your image classification skills, use GPU & TPU to implement CNN architectures for models to perform such multiclass classifications.

  6. Dyslexia Handwriting Dataset

    • kaggle.com
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DR. IZA SAZANITA ISA (2022). Dyslexia Handwriting Dataset [Dataset]. https://www.kaggle.com/datasets/drizasazanitaisa/dyslexia-handwriting-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DR. IZA SAZANITA ISA
    Description

    Dataset

    This dataset was created by DR. IZA SAZANITA ISA

    Contents

  7. r

    A Messy Handwriting Dataset with Student Crossouts and Corrections...

    • researchdata.edu.au
    • research-repository.rmit.edu.au
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiqmat Nisa (2023). A Messy Handwriting Dataset with Student Crossouts and Corrections (Line-version) [Dataset]. http://doi.org/10.25439/RMT.24419986.V1
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    RMIT University, Australia
    Authors
    Hiqmat Nisa
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).

    Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.

    1)SMHD.txt contain all the line level transcription in the form of
    image name, threshold value, label
    0001-000,178 Bombay Phenotype :-

    2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text.

    3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt.

    In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.

    Dataset Description:

    We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.

    Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.

    In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.

    In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.

    Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.

    In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.

  8. GoBo - A Handwriting Recognition dataset for Personalization

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Gold; Dario van den Boom; Torsten Zesch; Christian Gold; Dario van den Boom; Torsten Zesch (2023). GoBo - A Handwriting Recognition dataset for Personalization [Dataset]. http://doi.org/10.5281/zenodo.8085511
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jun 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christian Gold; Dario van den Boom; Torsten Zesch; Christian Gold; Dario van den Boom; Torsten Zesch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises the images for the personalization described in the paper Personalizing Handwriting Recognition Systems with Limited User-Specific Samples.

    Dataset Statistics (v.1.0)

    * Handwritten word-level images
    * English
    * 40 Participants
    * 5 sets from different sources for personalization
    * 2 sets from 2 domains (same domains as 2 personalization sets) for testing
    * 926 words/writer, 37k words in total

    More details can be found on the Github Repository:
    Github GoBo


    Model
    gobo_Baselinemodel.hdf5

  9. h

    hebrew-handwritten-dataset

    • huggingface.co
    Updated May 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sivan Ratson (2023). hebrew-handwritten-dataset [Dataset]. https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2023
    Authors
    Sivan Ratson
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Dataset Information

      Keywords
    

    Hebrew, handwritten, letters

      Description
    

    HDD_v0 consists of images of isolated Hebrew characters together with training and test sets subdivision. The images were collected from hand-filled forms. For more details, please refer to [1]. When using this dataset in research work, please cite [1]. [1] I. Rabaev, B. Kurar Barakat, A. Churkin and J. El-Sana. The HHD Dataset. The 17th International Conference on Frontiers in Handwriting… See the full description on the dataset page: https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset.

  10. Arabic Handwritten Digits Dataset

    • figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Loey (2023). Arabic Handwritten Digits Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12236948.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Mohamed Loey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Arabic Handwritten Digits DatasetAbstractIn recent years, handwritten digits recognition has been an important areadue to its applications in several fields. This work is focusing on the recognitionpart of handwritten Arabic digits recognition that face several challenges, includingthe unlimited variation in human handwriting and the large public databases. Thepaper provided a deep learning technique that can be effectively apply to recognizing Arabic handwritten digits. LeNet-5, a Convolutional Neural Network (CNN)trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. A comparison is held amongst theresults, and it is shown by the end that the use of CNN was leaded to significantimprovements across different machine-learning classification algorithms.The Convolutional Neural Network was trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. Moreover, the CNN is giving an average recognition accuracy of 99.15%.ContextThe motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten digits recognition. In recent years, Arabic handwritten digits recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions.ContentThe MADBase is modified Arabic handwritten digits database contains 60,000 training images, and 10,000 test images. MADBase were written by 700 writers. Each writer wrote each digit (from 0 -9) ten times. To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution.MADBase is available for free and can be downloaded from (http://datacenter.aucegypt.edu/shazeem/) .AcknowledgementsCNN for Handwritten Arabic Digits Recognition Based on LeNet-5http://link.springer.com/chapter/10.1007/978-3-319-48308-5_54Ahmed El-Sawy, Hazem El-Bakry, Mohamed LoeyProceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016Volume 533 of the series Advances in Intelligent Systems and Computing pp 566-575InspirationCreating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position.Arabic Handwritten Characters Datasethttps://www.kaggle.com/mloey1/ahcd1Benha Universityhttp://bu.edu.eg/staff/mloeyhttps://mloey.github.io/

  11. F

    Thai Shopping List OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Thai Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/thai-shopping-list-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Thai Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Thai language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this Thai OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Thai text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these shopping lists were written and images were captured by native Thai people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Thai text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native Thai crowd community.

    If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Thai language. Your journey to improved language understanding and processing begins here.

  12. R

    Doctors Prescriptions Handwriting Dataset

    • universe.roboflow.com
    zip
    Updated Jun 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daffodil International University (2023). Doctors Prescriptions Handwriting Dataset [Dataset]. https://universe.roboflow.com/daffodil-international-university-s5vpr/doctors-prescriptions-handwriting/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 24, 2023
    Dataset authored and provided by
    Daffodil International University
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Words Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Healthcare Automation: The model can be used to digitize handwritten medical prescriptions thus reducing manual transcription errors and streamlining the process in pharmacies and hospitals.

    2. Historical Document Digitization: This model could be utilized for transcribing old handwritten medical documents for research purposes.

    3. Handwriting Analysis Tool: The model can be used for general handwriting analysis purposes, for example in educational institutions to improve handwriting recognition or in forensic analysis.

    4. OCR Software Improvement: This model can be integrated with OCR (Optical Character Recognition) software to enhance its performance in recognizing and interpreting handwritten text, capitalizing on the diverse range of characters available.

    5. Medical Informatics Studies: Researchers using digital health records for epidemiological studies can utilize this model to extract data from handwritten prescriptions or doctor's notes.

  13. t

    Data from: The IAM-database: an English sentence database for offline...

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). The IAM-database: an English sentence database for offline handwriting recognition [Dataset]. https://service.tib.eu/ldmservice/dataset/the-iam-database--an-english-sentence-database-for-offline-handwriting-recognition
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The IAM-database: an English sentence database for offline handwriting recognition.

  14. IBM-Crosspad on-line handwriting database in STK format - donated...

    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IBM; IBM (2020). IBM-Crosspad on-line handwriting database in STK format - donated exclusively to University of Groningen [Dataset]. http://doi.org/10.5281/zenodo.1195853
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    IBM; IBM
    Area covered
    Groningen
    Description

    This data was donated to AI Dept. RuG by IBM in 2007

    It contains *.STK files (ASCII) containing on-line
    handwriting pen-tip coordinates (x,y). The format
    can be converted to unipen.

    For internal use at RuG only.


    Lambert Schomaker

  15. R

    Handwriting Dataset

    • universe.roboflow.com
    zip
    Updated Oct 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tibetan (2024). Handwriting Dataset [Dataset]. https://universe.roboflow.com/tibetan/handwriting-cavdy
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset authored and provided by
    Tibetan
    Variables measured
    1 Bounding Boxes
    Description

    Handwriting

    ## Overview
    
    Handwriting is a dataset for object detection tasks - it contains 1 annotations for 1,000 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
  16. R

    Handwritten Letters Dataset

    • universe.roboflow.com
    zip
    Updated Mar 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Workspace (2024). Handwritten Letters Dataset [Dataset]. https://universe.roboflow.com/workspace-qazxh/handwritten-letters-nkl2g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 2, 2024
    Dataset authored and provided by
    Workspace
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Letters
    Description

    Handwritten Letters

    ## Overview
    
    Handwritten Letters is a dataset for classification tasks - it contains Letters annotations for 3,410 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. g

    Urdu Handwritten Text Dataset

    • gts.ai
    jpg, png
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Urdu Handwritten Text Dataset [Dataset]. https://gts.ai/dataset-download/urdu-handwritten-text-dataset/
    Explore at:
    png, jpgAvailable download formats
    Dataset updated
    Sep 6, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Urdu Handwritten Text Dataset contains high-quality images of handwritten Urdu text collected from native speakers across diverse demographics, including people with disabilities. The dataset covers the full Urdu character set, ligatures, diacritics, and dots, making it ideal for OCR, handwriting authentication, forensic analysis, and multilingual handwriting recognition research.

  18. Handwriting Data to Detect Alzheimer’s Disease

    • kaggle.com
    Updated Aug 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taeef Najib (2023). Handwriting Data to Detect Alzheimer’s Disease [Dataset]. https://www.kaggle.com/datasets/taeefnajib/handwriting-data-to-detect-alzheimers-disease/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2023
    Dataset provided by
    Kaggle
    Authors
    Taeef Najib
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The DARWIN dataset includes handwriting data from 174 participants. The classification task consists in distinguishing Alzheimer’s disease patients from healthy people.

    Creator: Francesco Fontanella

    Source: https://archive.ics.uci.edu/dataset/732/darwin

    The DARWIN dataset was created to allow researchers to improve the existing machine-learning methodologies for the prediction of Alzheimer's disease via handwriting analysis.

    Citation Requests/Acknowledgements

    N. D. Cilia, C. De Stefano, F. Fontanella, A. S. Di Freca, An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis, Procedia Computer Science 141 (2018) 466–471. https://doi.org/10.1016/j.procs.2018.10.141

    N. D. Cilia, G. De Gregorio, C. De Stefano, F. Fontanella, A. Marcelli, A. Parziale, Diagnosing Alzheimer’s disease from online handwriting: A novel dataset and performance benchmarking, Engineering Applications of Artificial Intelligence, Vol. 111 (20229) 104822. https://doi.org/10.1016/j.engappai.2022.104822

  19. 1,000 People - Italian Handwriting OCR Dataset

    • m.nexdata.ai
    • nexdata.ai
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 1,000 People - Italian Handwriting OCR Dataset [Dataset]. https://m.nexdata.ai/datasets/ocr/1406?source=Huggingface
    Explore at:
    Dataset updated
    May 3, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Device, Writer, Data size, Data format, Data content, Accuracy rate, Photographic angle, Collecting environment, Population distribution
    Description

    The writers are Europeans who often write Italian. The device is scanner, the collection angle is eye-level angle. The dataset content includes address, company name, personal name.The dataset can be used for tasks such as Italian OCR models and handwritten text recognition systems.

  20. h

    Egyptian-Handwriting-Dataset

    • huggingface.co
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Diab (2025). Egyptian-Handwriting-Dataset [Dataset]. https://huggingface.co/datasets/OmarMDiab/Egyptian-Handwriting-Dataset
    Explore at:
    Dataset updated
    Aug 2, 2025
    Authors
    Omar Diab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Egyptian Handwriting Dataset

    A dataset of 11k+ handwritten Arabic words from Egyptian writers, extracted and tightly cropped from scanned paper forms. This dataset offers diverse handwriting samples ranging from children to elderly contributors, making it ideal for training robust Arabic handwriting recognition models.

    Each form contains 6 unique words, resulting in 24 handwritten word images per form. Each word is written four times by the same writer to capture… See the full description on the dataset page: https://huggingface.co/datasets/OmarMDiab/Egyptian-Handwriting-Dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Teklia (2024). IAM-line [Dataset]. https://huggingface.co/datasets/Teklia/IAM-line

IAM-line

IAM-line

Teklia/IAM-line

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2024
Dataset authored and provided by
Teklia
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

IAM - line level

  Dataset Summary

The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments. Note that all images are resized to a fixed height of 128 pixels.

  Languages

All the documents in the dataset are written in English.

  Dataset Structure





  Data Instances

{ 'image':… See the full description on the dataset page: https://huggingface.co/datasets/Teklia/IAM-line.

Search
Clear search
Close search
Google apps
Main menu