100+ datasets found
  1. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    • kaggle.com
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  2. h

    turkish-toxic-language

    • huggingface.co
    Updated Apr 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Overfit-GM (2023). turkish-toxic-language [Dataset]. https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2023
    Dataset authored and provided by
    Overfit-GM
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Turkish Texts for Toxic Language Detection

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    This text dataset is a collection of Turkish texts that have been merged from various existing offensive language datasets found online. The dataset contains a total of 77,800 instances, each labeled as either offensive or not offensive. To ensure the dataset's completeness, we utilized multiple transformer models to augment the dataset with pseudo labels. The resulting dataset is… See the full description on the dataset page: https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language.

  3. s

    Turkish Turkey Dataset

    • shaip.com
    Updated Feb 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Turkish Turkey Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Turkish Turkey DatasetTürkiye Türkiye Veri KümesiHigh-Quality Turkish Turkey Scripted Monologue Dataset for AI & Speech Models Contact Us OverviewTitle (Language)Turkish Turkey Language DatasetDataset TypesScripted MonologueCountryTurkeyDescriptionThe Scripted Monologue dataset consists…

  4. s

    Wake Word Turkish Dataset

    • shaip.com
    Updated Oct 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word Turkish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Oct 12, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  5. s

    Wake Word Turkish Dataset

    • ny.shaip.com
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2025). Wake Word Turkish Dataset [Dataset]. https://ny.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Sep 20, 2025
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Mawu Turkish DatasetHigh-Quality Turkish Wake Word Dataset ya AI & Zolankhula Zolankhula Lumikizanani Nafe mwachiduleTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Mawu / Lamulo Lamawu / Limbitsani Mawu…

  6. Turkish Sentiment Analysis Dataset

    • humirapps.cs.hacettepe.edu.tr
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacettepe University Multimedia Information Retrieval Laboratory (2017). Turkish Sentiment Analysis Dataset [Dataset]. http://doi.org/10.1109/SITIS.2016.57
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2017
    Dataset provided by
    Hacettepe Universityhttp://hacettepe.edu.tr/
    Authors
    Hacettepe University Multimedia Information Retrieval Laboratory
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.

  7. h

    turkish-academic-theses-dataset

    • huggingface.co
    Updated Nov 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umut Ertuğrul Daşgın (2025). turkish-academic-theses-dataset [Dataset]. https://huggingface.co/datasets/umutertugrul/turkish-academic-theses-dataset
    Explore at:
    Dataset updated
    Nov 10, 2025
    Authors
    Umut Ertuğrul Daşgın
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📚 Turkish Academic Theses Abstracts (TR/EN)

    A large-scale bilingual (Turkish–English) collection of abstracts from Turkish academic theses (YÖK Ulusal Tez Merkezi). This dataset focuses only on abstracts, provided in Turkish (abstract_tr) and English (abstract_en), suitable for summarization, translation, classification, and retrieval.

    Records: ~650k abstracts (TR & EN) Format: Parquet (.parquet) — fast & analytics-friendly Language: 🇹🇷 Turkish + 🇬🇧 English (parallel abstracts… See the full description on the dataset page: https://huggingface.co/datasets/umutertugrul/turkish-academic-theses-dataset.

  8. F

    Turkish Handwritten Sticky Notes OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Handwritten Sticky Notes OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-sticky-notes-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Turkish Sticky Notes Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this Turkish OCR dataset offers a wide distribution of different types of sticky note images. Within this dataset, you'll discover a variety of handwritten text, including quotes, sentences, and individual words on sticky notes. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Turkish text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these sticky notes were written and images were captured by native Turkish people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Turkish text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native Turkish crowd community.

    If you require a customized OCR dataset containing sticky note images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this sticky notes image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Turkish language. Your journey to improved language understanding and processing begins here.

  9. w

    Dataset of books about Turkish language-Spoken Turkish

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books about Turkish language-Spoken Turkish [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=j0-book_subject&fop0=%3D&fval0=Turkish+language-Spoken+Turkish&j=1&j0=book_subjects
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 12 rows and is filtered where the book subjects is Turkish language-Spoken Turkish. It features 9 columns including author, publication date, language, and book publisher.

  10. s

    Wake Lo Lus Turkish Dataset

    • hmn.shaip.com
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Lo Lus Turkish Dataset [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Tsev Wake Lo Lus Turkish DatasetHigh-Quality Turkish Wake Word Dataset rau AI & Cov Qauv Hais Lus Hu rau Peb Txheej TxheemTitleWake Lo Lus Turkish Lus DatasetDataset HomWake WordDescriptionWake Words / Voice Command / Trigger Word…

  11. F

    Turkish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Türkiye
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Turkish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Turkish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Turkish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Turkish speech models that understand and respond to authentic Turkish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Turkish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Turkish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Turkey to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Turkish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Turkish.
    Voice Assistants: Build smart assistants capable of understanding natural Turkish conversations.
    <span

  12. Turkish Poems

    • kaggle.com
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emre Okcular (2022). Turkish Poems [Dataset]. https://www.kaggle.com/datasets/emreokcular/turkish-poems
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2022
    Dataset provided by
    Kaggle
    Authors
    Emre Okcular
    Description

    Context

    This dataset includes poems from popular Turkish poets. All the data is scraped from poetry pages. ( Bu veri seti popüler Türk şairlerin şiirlerden oluşmaktadır. Şiirler internet sitelerinden toplanmıştır.)

    Content

    Poems are a crucial part of a culture. It represents the language of emotions which makes this dataset interesting for answers to many questions.

    Inspiration

    This dataset is a great resource for applying NLP techniques. Many different methods from basic analysis like word cloud to lyrics generation with LSTM neural networks can be applied.

    Disclaimer: The file might contain missing poems, poem text in different languages, or incorrect poems.

  13. D

    Turkish Podcast

    • defined.ai
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Defined.ai (2025). Turkish Podcast [Dataset]. https://defined.ai/datasets/turkish-podcast
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Defined.ai
    Description

    51,694 hours of high quality, niche Turkish live podcast data to enhance your AI model.

  14. E

    Turkish Speecon database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Feb 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). Turkish Speecon database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0178/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Turkish Speecon database is divided into 2 sets: 1) The first set comprises the recordings of 550 adult Turkish speakers (280 males, 270 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place). 2) The second set comprises the recordings of 50 child Turkish speakers (25 boys, 25 girls), recorded over 4 microphone channels in 1 recording environment (children room). This database is partitioned into 28 DVDs (first set) and 4 DVDs (second set).The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:Calibration data: 6 noise recordingsThe “silence word” recordingFree spontaneous items (adults only):3 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)17 Elicited spontaneous items (adults only):3 dates, 2 times, 3 proper names, 2 city name, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language Read speech:30 phonetically rich sentences uttered by adults and 60 uttered by children5 phonetically rich words (adults only)4 isolated digits1 isolated digit sequence4 connected digit sequences1 telephone number3 natural numbers1 money amount2 time phrases (T1 : analogue, T2 : digital)3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)3 letter sequences1 proper name2 city or street names2 questions2 special keyboard characters 1 Web address1 email address222 application specific words and phrases per session (adults)74 toy commands, 14 general commands, 31 phone commands and 4 application word synonyms (children)The following age distribution has been obtained: Adults: 244 speakers are between 15 and 30, 235 speakers are between 31 and 45, and 71 speakers are over 46.Children: 25 speakers are between 8 and 10, 25 speakers are between 11 and 15.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

  15. s

    I-Turkish Turkey Dataset

    • zu.shaip.com
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). I-Turkish Turkey Dataset [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Ikhaya Lasekhaya I-Turkish Turkey DatasetTürkiye Türkiye Veri KümesiIkhwalithi ephezulu ye-Turkish Turkey Scripted Monologue Dataset ye-AI namamodeli Enkulumo Xhumana nathi UhlolojikeleleIsihloko (Ulimi)Idathasethi yedatha ye-Turkish Turkey Ulimi lwedathaIzinhloboIzinhlobo zedathasethi ye-MonologueCountryTurkeyIncazeloIdathasethi ye-Scripted Monologue iqukethe...

  16. Turkish OCR Dataset – 5,000 Natural Scene Images

    • nexdata.ai
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Turkish OCR Dataset – 5,000 Natural Scene Images [Dataset]. https://www.nexdata.ai/datasets/ocr/1697
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Device, Data size, Data format, Accuracy rate, Data diversity, Annotation content, Photographic angle, Collecting environment
    Description

    This dataset contains 5,000 images of Turkish natural scenes with text.This data include a variety of natural scenarios and multiple shooting angles. For annotation, quadrilateral or polygon bounding box annotation and transcription for the texts were annotated in the data. This data can be used for Turkish OCR, scene text recognition, and text detection in natural images.

  17. 1620 Hours - Turkish(Turkey) Real-world Casual Conversation and Monologue...

    • nexdata.ai
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 1620 Hours - Turkish(Turkey) Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1324
    Explore at:
    Dataset updated
    Feb 9, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Türkiye
    Variables measured
    Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Turkish(Turkey) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  18. 50Million Rows Turkish Market Sales Dataset(MSSQL)

    • kaggle.com
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omer Colakoglu (2023). 50Million Rows Turkish Market Sales Dataset(MSSQL) [Dataset]. https://www.kaggle.com/datasets/omercolakoglu/50million-rows-turkish-market-sales-datasetmssql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Omer Colakoglu
    Description

    50 Million Rows MSSQL Backup File with Clustered Columnstore Index.

    This dataset contains -27K categorized Turkish supermarket items. -81 stores (Every city of Turkey has a store) -100K real Turkish names customer, address -10M rows sales data generated randomly. -All data has a near real price with influation factor by the time.

    All the data generated randomly. So the usernames have been generated with real Turkish names and surnames but they are not real people. The sale data generated randomly. But it has some rules. For example, every order can contains 1-9 kind of item. Every orderline amount can be 1-9 pieces. The randomise function works according to population of the city. So the number of orders for Istanbul (the biggest city of Turkey) is about 20% of all data and another city for example orders for the Gaziantep (the population is 2.5% of Turkey population) is about 2.5% off all data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1611072%2F9442f2a1dbae7f05ead4fde9e1033ac6%2Finbox_1611072_135236e39b79d6fae8830dec3fca4961_1.png?generation=1693509562300174&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1611072%2F1c39195270db87250e59d9f2917ccea1%2Finbox_1611072_b73d9ca432dae956564cfa5bfe42268c_3.png?generation=1693509575061587&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1611072%2Fa908389f33ae5c983e383d17f0d9a763%2Finbox_1611072_c5d349aa1f33c0fc4fc74b79b7167d3a_F3za81TXkAA1Il4.png?generation=1693509586158658&alt=media" alt="">

  19. F

    Turkish Shopping List OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-shopping-list-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Turkish Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this Turkish OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Turkish text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these shopping lists were written and images were captured by native Turkish people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Turkish text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native Turkish crowd community.

    If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Turkish language. Your journey to improved language understanding and processing begins here.

  20. R

    Turkish Tiel Dataset

    • universe.roboflow.com
    zip
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    edoski (2023). Turkish Tiel Dataset [Dataset]. https://universe.roboflow.com/edoski/turkish-tiel/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 13, 2023
    Dataset authored and provided by
    edoski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Money Bounding Boxes
    Description

    Turkish Tiel

    ## Overview
    
    Turkish Tiel is a dataset for object detection tasks - it contains Money annotations for 604 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset

turkish-sentiment-analysis-dataset

Turkish Sentiment Dataset

winvoker/turkish-sentiment-analysis-dataset

Explore at:
26 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 22, 2022
Authors
Batuhan
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset

This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

Search
Clear search
Close search
Google apps
Main menu