12 datasets found
  1. m

    Urdu Handwritten Text Dataset

    • data.mendeley.com
    Updated Aug 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mujtaba Husnain (2021). Urdu Handwritten Text Dataset [Dataset]. http://doi.org/10.17632/bg2sctsysf.1
    Explore at:
    Dataset updated
    Aug 9, 2021
    Authors
    Mujtaba Husnain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains the images of handwritten text in Urdu language, one of the most widely spoken languages in South-East Asian regions. The native-speaking authors from different social domains were invited to write a pre-written text in their handwritings. The pre-written text is carefully written in a way that it includes almost all the characters, ligatures, diacritics, and dots used in writing the text Urdu script. The disabled persons are also involved to write the text to make the data collection more comprehensive. The demographic data of the authors is also recorded for supporting the research activities like author identification, text-matching etc.

  2. i

    MANUU: Handwritten Urdu OCR Dataset

    • ieee-dataport.org
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaik Ahmed (2024). MANUU: Handwritten Urdu OCR Dataset [Dataset]. https://ieee-dataport.org/documents/manuu-handwritten-urdu-ocr-dataset
    Explore at:
    Dataset updated
    Dec 15, 2024
    Authors
    Shaik Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    digits

  3. UHaT Dataset: Urdu Handwritten Text Dataset

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Feb 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hazrat Ali; Hazrat Ali (2020). UHaT Dataset: Urdu Handwritten Text Dataset [Dataset]. http://doi.org/10.5281/zenodo.3670611
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 18, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hazrat Ali; Hazrat Ali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    UHaT Dataset

    UHaT: Urdu Handwritten Text Dataset

    This dataset contains handwritten characters and digits of Urdu language. The samples are written by 900+ individuals.

    Description and organization

    Size of images: All the images are stored in 28 by 28 resolution.

    How many images: The training set per each character contains of 700 images on average. For example, there are 811 train set images for AYN and 697 train set images for ALIF. Similarly, the train set per each contains 700 images on average. For example, there are 678 train set images for digits one. The test set per each character contains 140 images on average. For example, there are 145 test set images for character ALIF. The test set per each digit contains 140 images on average. For example, there are 147 test set images for digit nine.

    The dataset is organized into four sub-directories. Characters Training set, Characters Test set, Digits training set and digits test set. Each sub-director contains one sub-folder per one character. For example, all the train images for character ALIF are placed in sub-folder Alif.

    The folder hierarchy is given as:

    *Data > characters train set > alif

    Data > characters train set > ayn*

    And so on….

    How to load directly?

    You can also load it directly from the uhat_dataset.npz file. See the kernel load_dataset

    Acknowledgements

    Thanks to all volunteers who contributed by providing handwriting samples.

    Inspiration

    This is an MNIST style dataset. The machine learning community in general will find it useful for experimentation, demonstration purposes of machine learning models.
    The dataset will also provide an opportunity to researchers to work on Urdu text recognition.

  4. P

    Urdu Handwritten Text Dataset Dataset

    • paperswithcode.com
    Updated Mar 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Urdu Handwritten Text Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/urdu-handwritten-text-dataset
    Explore at:
    Dataset updated
    Mar 18, 2025
    Description

    Description:

    👉 Download the dataset here

    This dataset consists of high-quality images of handwritten text in the Urdu language, one of the most commonly spoken languages in South Asia, especially in Pakistan, India, and surrounding regions. The dataset has been created by inviting native Urdu speakers from diverse social, educational, and cultural backgrounds to write a predefined text in their natural handwriting style. This predefined text was carefully curated to cover the full range of Urdu characters, ligatures, diacritics, dots, and special symbols used in everyday writing.

    Dataset Features

    Diverse Handwriting Styles: The dataset includes contributions from native speakers across different demographics, ensuring a rich variety of handwriting styles.

    Comprehensive Character Set: The predefined text covers all characters, ligatures, diacritics, and dots commonly used in Urdu script.

    Inclusivity: Contributions from people with disabilities add unique variations to the dataset, making it more diverse and comprehensive.

    Download Dataset

    Demographic Information

    The demographic details of contributors, including age, gender, and educational background, are recorded. This information is particularly valuable for research related to author identification, handwriting analysis, and text-matching algorithms.

    Potential Applications

    This dataset has numerous applications, including:

    OCR Development: Enhancing Optical Character Recognition systems for Urdu text.

    Handwriting Authentication: Improving security through handwriting-based user verification.

    Linguistic Studies: Supporting research in Urdu language processing, script digitalization, and handwriting analysis.

    Forensic Handwriting Analysis: Assisting in forensic research for identifying individual handwriting patterns.

    Multilingual Handwriting Recognition: Building robust AI models that can recognize handwriting across different languages and scripts.

    Quality Control

    The dataset has undergone a rigorous quality check to ensure consistency, accuracy, and usability across various academic and commercial research projects, particularly those that involve natural language processing and computer vision technologies.

    This dataset is sourced from Kaggle.

  5. F

    Urdu Shopping List OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Urdu Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/urdu-shopping-list-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Urdu Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Urdu language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this Urdu OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Urdu text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these shopping lists were written and images were captured by native Urdu people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Urdu text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native Urdu crowd community.

    If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Urdu language. Your journey to improved language understanding and processing begins here.

  6. i

    Urdu Handwritten Ligature Dataset

    • ieee-dataport.org
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AEJAZ GANAI (2024). Urdu Handwritten Ligature Dataset [Dataset]. https://ieee-dataport.org/documents/urdu-handwritten-ligature-dataset
    Explore at:
    Dataset updated
    Nov 4, 2024
    Authors
    AEJAZ GANAI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ink color

  7. Z

    Urdu Handwriting Dataset for Demographic Traits Classification

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rehman, Huma (2020). Urdu Handwriting Dataset for Demographic Traits Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2573098
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Mirza, Ghulam Ali
    Rehman, Huma
    Mustufa, Syed Ghulam
    Description

    Urdu Handwriting Dataset for Demographic Traits Classification was developed in Bahria University, Islamabad, Pakistan as a part of the bachelor's degree final year thesis/project. This is a unique dataset which is the first of its kind. The dataset is composed of 1000 unique handwriting images each taken from unique individuals. It can be seen in the title, the handwriting samples are specifically in Urdu Language. Urdu Handwriting Dataset is made for the Classification of Demographic Traits problem due to which it consists of the demographic information of each individual. Following are the demographic traits that are covered in the dataset:

    Gender (Male, Female)

    Handedness (Left, Right)

    Age-Group (15-20,21-30,31-40,41-50,51-up)

    Province (Balochistan, Sindh, Punjab, kpk, gilgit-baltistan, none)

    Occupation (Student, Employee, Both, None)

    Education (Primary(Below Matriculation), Matriculation, Intermediate, Bachelors, Masters, PHD, None)

  8. f

    Details of hyper-parameters for proposed ET-Network.

    • plos.figshare.com
    xls
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ameer Hamza; Shengbing Ren; Usman Saeed (2024). Details of hyper-parameters for proposed ET-Network. [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ameer Hamza; Shengbing Ren; Usman Saeed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Details of hyper-parameters for proposed ET-Network.

  9. f

    The comparison of CER on valid and test splits with proposed models.

    • plos.figshare.com
    xls
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ameer Hamza; Shengbing Ren; Usman Saeed (2024). The comparison of CER on valid and test splits with proposed models. [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ameer Hamza; Shengbing Ren; Usman Saeed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The comparison of CER on valid and test splits with proposed models.

  10. f

    The comparison of the proposed model with state-of-the-art methods.

    • plos.figshare.com
    xls
    Updated May 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ameer Hamza; Shengbing Ren; Usman Saeed (2024). The comparison of the proposed model with state-of-the-art methods. [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ameer Hamza; Shengbing Ren; Usman Saeed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The comparison of the proposed model with state-of-the-art methods.

  11. f

    Advantages and disadvantages of state-of-the-art methods.

    • plos.figshare.com
    xls
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ameer Hamza; Shengbing Ren; Usman Saeed (2024). Advantages and disadvantages of state-of-the-art methods. [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ameer Hamza; Shengbing Ren; Usman Saeed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advantages and disadvantages of state-of-the-art methods.

  12. f

    Ablation study of self-attention layer (SAL).

    • plos.figshare.com
    xls
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ameer Hamza; Shengbing Ren; Usman Saeed (2024). Ablation study of self-attention layer (SAL). [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ameer Hamza; Shengbing Ren; Usman Saeed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table evaluate the proposed model with different layers of attention.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mujtaba Husnain (2021). Urdu Handwritten Text Dataset [Dataset]. http://doi.org/10.17632/bg2sctsysf.1

Urdu Handwritten Text Dataset

Explore at:
92 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 9, 2021
Authors
Mujtaba Husnain
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset contains the images of handwritten text in Urdu language, one of the most widely spoken languages in South-East Asian regions. The native-speaking authors from different social domains were invited to write a pre-written text in their handwritings. The pre-written text is carefully written in a way that it includes almost all the characters, ligatures, diacritics, and dots used in writing the text Urdu script. The disabled persons are also involved to write the text to make the data collection more comprehensive. The demographic data of the authors is also recorded for supporting the research activities like author identification, text-matching etc.

Search
Clear search
Close search
Google apps
Main menu