4 datasets found
  1. h

    Tashkeela

    • huggingface.co
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eman Khater (2024). Tashkeela [Dataset]. https://huggingface.co/datasets/EmanKhater/Tashkeela
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2024
    Authors
    Eman Khater
    Description

    Dataset Card for Dataset Name

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Content A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets. The cleaning process includes removing the XML tags and strange symbols, as well as fixing… See the full description on the dataset page: https://huggingface.co/datasets/EmanKhater/Tashkeela.

  2. h

    Sadeed_Tashkeela

    • huggingface.co
    Updated May 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Misraj Ai (2025). Sadeed_Tashkeela [Dataset]. https://huggingface.co/datasets/Misraj/Sadeed_Tashkeela
    Explore at:
    Dataset updated
    May 1, 2025
    Dataset authored and provided by
    Misraj Ai
    Description

    📚 Sadeed Tashkeela Arabic Diacritization Dataset

    The Sadeed dataset is a large, high-quality Arabic diacritized corpus optimized for training and evaluating Arabic diacritization models.It is built exclusively from the Tashkeela corpus for the training set and a refined version of the Fadel Tashkeela test set for the test set.

      Dataset Overview
    

    Training Data:

    Source: Cleaned version of the Tashkeela corpus (original data is ~75 million words, mostly Classical… See the full description on the dataset page: https://huggingface.co/datasets/Misraj/Sadeed_Tashkeela.

  3. h

    roots_ar_tashkeela

    • huggingface.co
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Data (2023). roots_ar_tashkeela [Dataset]. https://huggingface.co/datasets/bigscience-data/roots_ar_tashkeela
    Explore at:
    Dataset updated
    Apr 28, 2023
    Dataset authored and provided by
    BigScience Data
    License

    https://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/

    Description

    ROOTS Subset: roots_ar_tashkeela

      Tashkeela
    

    Dataset uid: tashkeela

      Description
    

    The dataset collected from 97 books in both modern and classic arabic. The dataset contains Arabic diacritics. The dataset is

      Homepage
    

    https://sourceforge.net/projects/tashkeela/

      Licensing
    

    gpl-2.0: GNU General Public License v2.0 only

      Speaker Locations
    
    
    
    
    
      Sizes
    

    0.2533 % of total 2.3340 % of ar

      BigScience processing steps
    
    
    
    
    
      Filters… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_ar_tashkeela.
    
  4. h

    arabic-tashkeel-dataset

    • huggingface.co
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rockikz (2024). arabic-tashkeel-dataset [Dataset]. https://huggingface.co/datasets/Abdou/arabic-tashkeel-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2024
    Authors
    Rockikz
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Arabic Tashkeel Dataset

    This is a fairly large dataset gathered from five main sources:

    tashkeela (1.79GB - 45.05%): The entire Tashkeela dataset, repurposed in sentences. Some rows were omitted as they contain low diacritic (tashkeel characters) rate. shamela (1.67GB - 42.10%): Random pages from over 2,000 books on the Shamela Library. Pages were selected using the below function (high diacritics rate) wikipedia (269.94MB - 6.64%): A collection of Wikipedia articles. Diacritics… See the full description on the dataset page: https://huggingface.co/datasets/Abdou/arabic-tashkeel-dataset.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Eman Khater (2024). Tashkeela [Dataset]. https://huggingface.co/datasets/EmanKhater/Tashkeela

Tashkeela

EmanKhater/Tashkeela

Explore at:
162 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2024
Authors
Eman Khater
Description

Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

  Dataset Details





  Dataset Description

Content A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets. The cleaning process includes removing the XML tags and strange symbols, as well as fixing… See the full description on the dataset page: https://huggingface.co/datasets/EmanKhater/Tashkeela.

Search
Clear search
Close search
Google apps
Main menu