100+ datasets found
  1. IAM Handwriting Top50

    • kaggle.com
    zip
    Updated Jun 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TejasReddy (2018). IAM Handwriting Top50 [Dataset]. https://www.kaggle.com/datasets/tejasreddy/iam-handwriting-top50
    Explore at:
    zip(196047805 bytes)Available download formats
    Dataset updated
    Jun 30, 2018
    Authors
    TejasReddy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    IAM Handwriting Dataset is a collection of handwritten passages by several writers. Generally, they use that data to classify writers according to their writing styles. A traditional way of solving such problem is extracting features like spacing between letters, curvatures, etc. and feeding them into Support Vector Machines. But, I wanted to solve this problem by Deep learning using Keras and Tensorflow. For the purpose, we don't need the full IAM Handwriting Dataset, but some authentic subset which can be used for training such as a subset of images by top 50 persons who contributed the most towards the dataset.

    Content

    This dataset contains images of each handwritten sentence with the dash-separated filename format. The first field represents the test code, second the writer id, third passage id, and fourth the sentence id.

    Acknowledgements

    This dataset won't be here without the help of FKI Computer Vision and Artificial Intelligence. As I came across the IAM Handwriting dataset from their website.

    Inspiration

    I would like to see people use this data for more insights, exploratory notebooks, and many more because Handwriting recognition is not an easy task to be done individually. I need you Kagglers to have a look at it.

  2. IamOnDB Handwriting Dataset

    • kaggle.com
    Updated Oct 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tey Kai Cong (2022). IamOnDB Handwriting Dataset [Dataset]. https://www.kaggle.com/datasets/teykaicong/iamondb-handwriting-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tey Kai Cong
    Description

    Structure

    words.tgz : Contains words (example: a01/a01-122/a01-122-s01-02.png) xml.tgz: Contains the meta-infornation in XML format (example: a01-122.xml).

    Terms of usage

    The IAM Handwriting Database is publicly accessible and freely available for non-commercial research purposes. If you are using data from the IAM Handwriting Database, we request you to register, so we are aware of who is using our data. If you are publishing scientific work based on the IAM Handwriting Database, we request you to include a reference to the paper.

    Original Link

    https://fki.tic.heia-fr.ch/databases/download-the-iam-handwriting-database

  3. r

    Handwritten synthetic dataset from the IAM

    • researchdata.edu.au
    • research-repository.rmit.edu.au
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiqmat Nisa (2023). Handwritten synthetic dataset from the IAM [Dataset]. http://doi.org/10.25439/RMT.24309730.V1
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    RMIT University, Australia
    Authors
    Hiqmat Nisa
    Description

    This dataset was generated employing a technique of randomly crossing out words from the IAM database, utilizing several types of strokes. The ratio of cross-out words to regular words in handwritten documents can vary greatly depending on the document and context. However, typically, the number of cross-out words is small compared with regular words. To ensure a realistic ratio of regular to cross-out words in our synthetic database, 30% of samples from the IAM training set were selected. First, the bounding box of each word in a line was detected. The bounding box covers the core area of the word. Then, at random, a word is crossed out within the core area. Each line contains a randomly struck-out word at a different position. The annotation of these struck-out words was replaced with the symbol #.

    The folder has:
    s-s0 images
    Syn-trainset
    Syn-validset
    Syn_IAM_testset
    The transcription files are in the format of
    Filename, threshold label of handwritten line
    s-s0-0,157 A # to stop Mr. Gaitskell from

    Cite the below work if you have used this dataset:
    "A deep learning approach to handwritten text recognition in the presence of struck-out text"
    https://ieeexplore.ieee.org/document/8961024


  4. h

    hebrew-handwritten-dataset

    • huggingface.co
    Updated May 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sivan Ratson (2023). hebrew-handwritten-dataset [Dataset]. https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2023
    Authors
    Sivan Ratson
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Dataset Information

      Keywords
    

    Hebrew, handwritten, letters

      Description
    

    HDD_v0 consists of images of isolated Hebrew characters together with training and test sets subdivision. The images were collected from hand-filled forms. For more details, please refer to [1]. When using this dataset in research work, please cite [1]. [1] I. Rabaev, B. Kurar Barakat, A. Churkin and J. El-Sana. The HHD Dataset. The 17th International Conference on Frontiers in Handwriting… See the full description on the dataset page: https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset.

  5. r

    A Messy Handwriting Dataset with Student Crossouts and Corrections...

    • researchdata.edu.au
    • research-repository.rmit.edu.au
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiqmat Nisa (2023). A Messy Handwriting Dataset with Student Crossouts and Corrections (Line-version) [Dataset]. http://doi.org/10.25439/RMT.24419986.V1
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    RMIT University, Australia
    Authors
    Hiqmat Nisa
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).

    Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.

    1)SMHD.txt contain all the line level transcription in the form of
    image name, threshold value, label
    0001-000,178 Bombay Phenotype :-

    2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text.

    3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt.

    In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.

    Dataset Description:

    We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.

    Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.

    In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.

    In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.

    Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.

    In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.

  6. h

    Egyptian-Handwriting-Dataset

    • huggingface.co
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Diab (2025). Egyptian-Handwriting-Dataset [Dataset]. https://huggingface.co/datasets/OmarMDiab/Egyptian-Handwriting-Dataset
    Explore at:
    Dataset updated
    Aug 2, 2025
    Authors
    Omar Diab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Egyptian Handwriting Dataset

    A dataset of 11k+ handwritten Arabic words from Egyptian writers, extracted and tightly cropped from scanned paper forms. This dataset offers diverse handwriting samples ranging from children to elderly contributors, making it ideal for training robust Arabic handwriting recognition models.

    Each form contains 6 unique words, resulting in 24 handwritten word images per form. Each word is written four times by the same writer to capture… See the full description on the dataset page: https://huggingface.co/datasets/OmarMDiab/Egyptian-Handwriting-Dataset.

  7. r

    CHoiCe: A Complex Handwritten Character dataset

    • researchdata.edu.au
    Updated Feb 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Australian National University (2021). CHoiCe: A Complex Handwritten Character dataset [Dataset]. http://doi.org/10.25911/602355a95f787
    Explore at:
    Dataset updated
    Feb 10, 2021
    Dataset provided by
    The Australian National University
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    The data of the dataset is collected from Professor Tom Gedeon and the complete handwriting paper of the CEDAR handwriting dataset. A CHoiCE Dataset with 62 classes cursive handwriting letters, "0-9, a-z, A-Z", each class in both the original data and the binary data at least have 40 pictures. The data format is a 28x28 ".png" format picture. The data set has a total of 62 categories of 0-9, a-z and A-Z, corresponding to the files "0" to "61" in the order of "label.txt". The data set is divided into two parts, the unprocessed original data image is stored in the "0" to "61" in the "V0.3/data" folder, and the binarized data image Stored in "0" to "61" in the "V0.3/data-bin" folder.

  8. H

    Khayyam Offline Persian Handwriting Dataset

    • dataverse.harvard.edu
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pourya Jafarzadeh; Vahid Mohammadi Safarzadeh (2025). Khayyam Offline Persian Handwriting Dataset [Dataset]. http://doi.org/10.7910/DVN/WYRTKS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Pourya Jafarzadeh; Vahid Mohammadi Safarzadeh
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/WYRTKShttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/WYRTKS

    Description

    Handwriting analysis is still an important application in machine learning. A basic requirement for any handwriting recognition application is the availability of comprehensive datasets. Standard labelled datasets play a significant role in training and evaluating learning algorithms. In this paper, we present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language. We intentionally concentrated on collecting Persian word samples which are rare in the currently available datasets. Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits. Moreover, the forms were filled out by 400 native Persian writers. To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported. This dataset is available for research and academic use.

  9. Z

    GoBo - A Handwriting Recognition dataset for Personalization

    • data.niaid.nih.gov
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gold, Christian; van den Boom, Dario; Zesch, Torsten (2023). GoBo - A Handwriting Recognition dataset for Personalization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8085510
    Explore at:
    Dataset updated
    Jun 28, 2023
    Authors
    Gold, Christian; van den Boom, Dario; Zesch, Torsten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises the images for the personalization described in the paper Personalizing Handwriting Recognition Systems with Limited User-Specific Samples.

    Dataset Statistics (v.1.0)

    • Handwritten word-level images
    • English
    • 40 Participants
    • 5 sets from different sources for personalization
    • 2 sets from 2 domains (same domains as 2 personalization sets) for testing
    • 926 words/writer, 37k words in total

    More details can be found on the Github Repository: Github GoBo

    Model gobo_Baselinemodel.hdf5

  10. IAM FORMS DATA

    • kaggle.com
    zip
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gwachat Kozah (2024). IAM FORMS DATA [Dataset]. https://www.kaggle.com/datasets/gwachatkozah/iam-forms-dataset
    Explore at:
    zip(4631826955 bytes)Available download formats
    Dataset updated
    Nov 4, 2024
    Authors
    Gwachat Kozah
    Description

    Dataset

    This dataset was created by Gwachat Kozah

    Contents

  11. r

    Data from: The IAM-database: an English sentence database for offline...

    • resodate.org
    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.-V. Marti; H. Bunke (2024). The IAM-database: an English sentence database for offline handwriting recognition [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdGhlLWlhbS1kYXRhYmFzZS0tYW4tZW5nbGlzaC1zZW50ZW5jZS1kYXRhYmFzZS1mb3Itb2ZmbGluZS1oYW5kd3JpdGluZy1yZWNvZ25pdGlvbg==
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    U.-V. Marti; H. Bunke
    Description

    The IAM-database: an English sentence database for offline handwriting recognition.

  12. r

    Student Messy Handwritten Dataset (SMHD)

    • research-repository.rmit.edu.au
    • researchdata.edu.au
    application/x-rar
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiqmat Nisa; James Thom; Vic ciesielski; Ruwan Tennakoon (2023). Student Messy Handwritten Dataset (SMHD) [Dataset]. http://doi.org/10.25439/rmt.24312715.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Oct 16, 2023
    Dataset provided by
    RMIT University
    Authors
    Hiqmat Nisa; James Thom; Vic ciesielski; Ruwan Tennakoon
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Within the central repository, there are subfolders of different categories. Each of these subfolders contains both images and their corresponding transcriptions, saved as .txt files. As an example, the folder 'summary-based-0001-0055' encompasses 55 handwritten image documents pertaining to the summary task, with the images ranging from 0001 to 0055 within this category. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.Moreover, there exists a document detailing the transcription rules utilized for transcribing the dataset. Following these guidelines will enable the seamless addition of more images.Dataset Description:We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.

  13. h

    thai_handwriting_dataset

    • huggingface.co
    Updated Nov 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iApp Technology (2024). thai_handwriting_dataset [Dataset]. https://huggingface.co/datasets/iapp/thai_handwriting_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2024
    Dataset authored and provided by
    iApp Technology
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Thai Handwriting Dataset

    This dataset combines two major Thai handwriting datasets:

    BEST 2019 Thai Handwriting Recognition dataset (train-0000.parquet) Thai Handwritten Free Dataset by Wang (train-0001.parquet onwards)

      Maintainer
    

    kobkrit@iapp.co.th

      Dataset Description
    
    
    
    
    
      BEST 2019 Dataset
    

    Contains handwritten Thai text images along with their ground truth transcriptions. The images have been processed and standardized for machine learning tasks.… See the full description on the dataset page: https://huggingface.co/datasets/iapp/thai_handwriting_dataset.

  14. IBM-Crosspad on-line handwriting database in STK format - donated...

    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IBM; IBM (2020). IBM-Crosspad on-line handwriting database in STK format - donated exclusively to University of Groningen [Dataset]. http://doi.org/10.5281/zenodo.1195853
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    IBM; IBM
    Area covered
    Groningen
    Description

    This data was donated to AI Dept. RuG by IBM in 2007

    It contains *.STK files (ASCII) containing on-line
    handwriting pen-tip coordinates (x,y). The format
    can be converted to unipen.

    For internal use at RuG only.


    Lambert Schomaker

  15. F

    Thai Shopping List OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Thai Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/thai-shopping-list-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Thai Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Thai language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this Thai OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Thai text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these shopping lists were written and images were captured by native Thai people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Thai text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native Thai crowd community.

    If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Thai language. Your journey to improved language understanding and processing begins here.

  16. m

    Russian Handwritings Tracked

    • data.mendeley.com
    • kaggle.com
    Updated Jan 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitry Iatsenko (2023). Russian Handwritings Tracked [Dataset]. http://doi.org/10.17632/3h6h5d7xg2.2
    Explore at:
    Dataset updated
    Jan 3, 2023
    Authors
    Dmitry Iatsenko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We created a character dataset by collecting samples from 12 writers. Each writer contributed with letters (lower and uppercase), digits, and words from a pangram that we have not employed in our experiments, but they are included in "extra" folder for each writer in this database. Up to 4 samples have been collected for each pair writer/character, and the total number of samples in this database version is 2812.

    Database structure:

    scanner.py - character scanning program, dataset collection. convert2mnist.py - a program for converting a dataset into a mnist-like form. It is intended for an example with the test. example_using.py - example of a primitive grid for character recognition. It is intended only to demonstrate the consistency of the dataset. When using the dataset, of course, the user can and will use their own, more advanced approaches. data - folder with dataset. w_n_m - folder with writer's attempt (in total 37 folders) [char] - the main file of the symbol track, a text file with a list of coordinates of the form - "x1","y1","x2","y2",...,"xN","yN". [char]_times - a file with additional information on the track with a list of time in ms between receiving coordinates of points. [char].png is an auxiliary file - a picture of the symbol as it was visible to the writer. The file is for understanding only.

    Class distribution in example_using.py, which you can find in github repository provided below:

    [A] = { "а" , "А" } [Б] = { "б" , "Б" } [В] = { "в" , "В" } [Г] = { "г" , "Г" } [Д] = { "д" , "Д" } [Е] = { "е" , "Е" } [Ё] = { "ё" , "Ё" } [Ж] = { "ж" , "Ж" } [З] = { "з" , "З" } [И] = { "и" , "И" } [Й] = { "й" , "Й" } [К] = { "к" , "К" } [Л] = { "л" , "Л" } [М] = { "м" , "М" } [Н] = { "н" , "Н" } [О] = { "о" , "О", "0" } [П] = { "п" , "П" } [Р] = { "р" , "Р" } [С] = { "с" , "С" } [Т] = { "т" , "Т" } [У] = { "у" , "У" } [Ф] = { "ф" , "Ф" } [Х] = { "х" , "Х" } [Ц] = { "ц" , "Ц" } [Ч] = { "ч" , "Ч" } [Ш] = { "ш" , "Ш" } [Щ] = { "щ" , "Щ" } [Ъ] = { "ъ" , "Ъ" } [Ы] = { "ы" , "Ы" } [Ь] = { "ь" , "Ь" } [Э] = { "э" , "Э" } [Ю] = { "ю" , "Ю" } [Я] = { "я" , "Я" } [1] = { "1" } [2] = { "2" } [3] = { "3" } [4] = { "4" } [5] = { "5" } [6] = { "6" } [7] = { "7" } [8] = { "8" } [9] = { "9" }

  17. R

    Handwritten Letters Dataset

    • universe.roboflow.com
    zip
    Updated Mar 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Workspace (2024). Handwritten Letters Dataset [Dataset]. https://universe.roboflow.com/workspace-qazxh/handwritten-letters-nkl2g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 2, 2024
    Dataset authored and provided by
    Workspace
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Letters
    Description

    Handwritten Letters

    ## Overview
    
    Handwritten Letters is a dataset for classification tasks - it contains Letters annotations for 3,410 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  18. 14,511 Images English Handwriting OCR Dataset

    • nexdata.ai
    • m.nexdata.ai
    Updated Sep 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 14,511 Images English Handwriting OCR Dataset [Dataset]. https://www.nexdata.ai/datasets/ocr/1215
    Explore at:
    Dataset updated
    Sep 29, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Device, Accuracy, Data size, Data format, Data content, Photographic angle, Collecting environment, Population distribution, Nationality distribution
    Description

    The text carrier are A4 paper, lined paper, English paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes English composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as English handwriting OCR.

  19. iam_handwriting_word_database

    • kaggle.com
    zip
    Updated May 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    last_theorem (2021). iam_handwriting_word_database [Dataset]. https://www.kaggle.com/datasets/nibinv23/iam-handwriting-word-database/
    Explore at:
    zip(1184020415 bytes)Available download formats
    Dataset updated
    May 18, 2021
    Authors
    last_theorem
    Description

    IAM Handwriting Database

    The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments.

    The database was first published in [1] at the ICDAR 1999. Using this database an HMM based recognition system for handwritten sentences was developed and published in [2] at the ICPR 2000. The segmentation scheme used in the second version of the database is documented in [3] and has been published in the ICPR 2002. The IAM-database as of October 2002 is described in [4]. We use the database extensively in our own research, see publications for further details.

    The database contains forms of unconstrained handwritten text, which were scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels. The figure below provides samples of a complete form, a text line and some extracted words.

    Characteristics

    The IAM Handwriting Database 3.0 is structured as follows:

    657 writers contributed samples of their handwriting 1'539 pages of scanned text 5'685 isolated and labeled sentences 13'353 isolated and labeled text lines 115'320 isolated and labeled words The words have been extracted from pages of scanned text using an automatic segmentation scheme and were verified manually. The segmentation scheme has been developed at our institute [3].

    All form, line and word images are provided as PNG files and the corresponding form label files, including segmentation information and variety of estimated parameters (from the preprocessing steps described in [2]), are included in the image files as meta-information in XML format which is described in XML file and XML file format (DTD).

    References

    [1] U. Marti and H. Bunke. A full English sentence database for off-line handwriting recognition. In Proc. of the 5th Int. Conf. on Document Analysis and Recognition, pages 705 - 708, 1999.

    [2] U. Marti and H. Bunke. Handwritten Sentence Recognition. In Proc. of the 15th Int. Conf. on Pattern Recognition, Volume 3, pages 467 - 470, 2000.

    [3] M. Zimmermann and H. Bunke. Automatic Segmentation of the IAM Off-line Database for Handwritten English Text. In Proc. of the 16th Int. Conf. on Pattern Recognition, Volume 4, pages 35 - 39, 2000.

    [4] U. Marti and H. Bunke. The IAM-database: An English Sentence Database for Off-line Handwriting Recognition. Int. Journal on Document Analysis and Recognition, Volume 5, pages 39 - 46, 2002.

    [5] S. Johansson, G.N. Leech and H. Goodluck. Manual of Information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital Computers. Department of English, University of Oslo, Norway, 1978.

  20. R

    Handwriting Dataset

    • universe.roboflow.com
    zip
    Updated May 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohd (2024). Handwriting Dataset [Dataset]. https://universe.roboflow.com/mohd/handwriting-l6jnt
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset authored and provided by
    mohd
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Digits Bounding Boxes
    Description

    Handwriting

    ## Overview
    
    Handwriting is a dataset for object detection tasks - it contains Digits annotations for 1,866 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TejasReddy (2018). IAM Handwriting Top50 [Dataset]. https://www.kaggle.com/datasets/tejasreddy/iam-handwriting-top50
Organization logo

IAM Handwriting Top50

Offline IAM Handwriting Dataset's subset, w.r.t. the 50 most common writers.

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
zip(196047805 bytes)Available download formats
Dataset updated
Jun 30, 2018
Authors
TejasReddy
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Context

IAM Handwriting Dataset is a collection of handwritten passages by several writers. Generally, they use that data to classify writers according to their writing styles. A traditional way of solving such problem is extracting features like spacing between letters, curvatures, etc. and feeding them into Support Vector Machines. But, I wanted to solve this problem by Deep learning using Keras and Tensorflow. For the purpose, we don't need the full IAM Handwriting Dataset, but some authentic subset which can be used for training such as a subset of images by top 50 persons who contributed the most towards the dataset.

Content

This dataset contains images of each handwritten sentence with the dash-separated filename format. The first field represents the test code, second the writer id, third passage id, and fourth the sentence id.

Acknowledgements

This dataset won't be here without the help of FKI Computer Vision and Artificial Intelligence. As I came across the IAM Handwriting dataset from their website.

Inspiration

I would like to see people use this data for more insights, exploratory notebooks, and many more because Handwriting recognition is not an easy task to be done individually. I need you Kagglers to have a look at it.

Search
Clear search
Close search
Google apps
Main menu