100+ datasets found
  1. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  2. h

    turkish-wikiNER

    • huggingface.co
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turkish NLP Suite (2024). turkish-wikiNER [Dataset]. https://huggingface.co/datasets/turkish-nlp-suite/turkish-wikiNER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2024
    Dataset authored and provided by
    Turkish NLP Suite
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for "turkish-nlp-suite/turkish-wikiNER"

      Dataset Summary
    

    Turkish NER dataset from Wikipedia sentences. 20.000 sentences are sampled and re-annotated from Kuzgunlar NER dataset. Annotations are done by Co-one. Many thanks to them for their contributions. This dataset is also used in our brand new spaCy Turkish packages.

      Dataset Instances
    

    An instance of this dataset looks as follows: { "tokens": ["Çekimler", "5", "Temmuz", "2005", "tarihinde"… See the full description on the dataset page: https://huggingface.co/datasets/turkish-nlp-suite/turkish-wikiNER.

  3. g

    Turkish Web Treebank

    • github.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turkish Web Treebank [Dataset]. https://github.com/google-research-datasets/turkish-treebanks
    Explore at:
    License

    http://www.apache.org/licenses/LICENSE-2.0.txthttp://www.apache.org/licenses/LICENSE-2.0.txt

    Description

    A human-annotated morphosyntactic treebank for Turkish.

  4. s

    Turkish Turkey Dataset

    • shaip.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Turkish Turkey Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Turkish Turkey DatasetTürkiye Türkiye Veri KümesiHigh-Quality Turkish Turkey TTS Dataset for AI & Speech Models Contact Us OverviewTitleTurkish Turkey Language DatasetDataset TypeTTSDescriptionSingle-utterance recordings, which tend to fall in the…

  5. s

    Wake Word Turkish Dataset

    • shaip.com
    Updated Oct 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word Turkish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Oct 12, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  6. h

    turkishvoicedataset

    • huggingface.co
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EREN FAZLIOĞLU (2023). turkishvoicedataset [Dataset]. https://huggingface.co/datasets/erenfazlioglu/turkishvoicedataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2023
    Authors
    EREN FAZLIOĞLU
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for "turkishneuralvoice"

      Dataset Overview
    

    Dataset Name: Turkish Neural Voice Description: This dataset contains Turkish audio samples generated using Microsoft Text to Speech services. The dataset includes audio files and their corresponding transcriptions.

      Dataset Structure
    

    Configs:

    default

    Data Files:

    Split: train Path: data/train-*

    Dataset Info:

    Features: audio: Audio file transcription: Corresponding text transcription

    Splits: train… See the full description on the dataset page: https://huggingface.co/datasets/erenfazlioglu/turkishvoicedataset.

  7. s

    Turkish Turkey Dataset

    • sm.shaip.com
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Turkish Turkey Dataset [Dataset]. https://sm.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Dec 5, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Turkish Turkey DatasetTürkiye Türkiye Veri Kümesi High-Quality Turkish Turkey TTS Dataset for AI & Speech Models Faʻafesoʻotaʻi Matou Vaʻaiga LauteleIgoaTurkish Turkey Language Dataset Ituaiga Seti TusiTTSFaamatalaga Faamaumauga o upu e tasi, lea e masani ona pa'u i le…

  8. s

    Wake Lo Lus Turkish Dataset

    • hmn.shaip.com
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Lo Lus Turkish Dataset [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Tsev Wake Lo Lus Turkish DatasetHigh-Quality Turkish Wake Word Dataset rau AI & Cov Qauv Hais Lus Hu rau Peb Txheej TxheemTitleWake Lo Lus Turkish Lus DatasetDataset HomWake WordDescriptionWake Words / Voice Command / Trigger Word…

  9. F

    Turkish Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Turkish Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Turkish -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Turkish speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native Turkish speakers from our verified contributor community.
    Regions: Representing different provinces across Turkey to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for Turkish real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

  10. s

    Wake Shoko Turkish Dataset

    • sn.shaip.com
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Shoko Turkish Dataset [Dataset]. https://sn.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Sep 11, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Shoko Turkish DatasetHigh-Quality Turkish Wake Word Dataset yeAI & Maemodheru eMatauriro Bata Isu OverviewTitleWake Izwi Turkish Mutauro DatasetDataset TypeWake WordDescriptionWake Mazwi / Inzwi Raira / Rinosimudzira Shoko…

  11. F

    Turkish Handwritten Sticky Notes OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Handwritten Sticky Notes OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-sticky-notes-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Turkish Sticky Notes Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this Turkish OCR dataset offers a wide distribution of different types of sticky note images. Within this dataset, you'll discover a variety of handwritten text, including quotes, sentences, and individual words on sticky notes. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Turkish text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these sticky notes were written and images were captured by native Turkish people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Turkish text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native Turkish crowd community.

    If you require a customized OCR dataset containing sticky note images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this sticky notes image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Turkish language. Your journey to improved language understanding and processing begins here.

  12. i

    Corona Virus (COVID-19) Turkish Tweets Dataset

    • ieee-dataport.org
    Updated May 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Sabuncu (2020). Corona Virus (COVID-19) Turkish Tweets Dataset [Dataset]. https://ieee-dataport.org/open-access/corona-virus-covid-19-turkish-tweets-dataset-0
    Explore at:
    Dataset updated
    May 19, 2020
    Authors
    Ibrahim Sabuncu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Kovid

  13. s

    Wake Word Turkish Dataset

    • so.shaip.com
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word Turkish Dataset [Dataset]. https://so.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Jul 30, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word DatasetTurkiga-Tayo-sare oo tayo leh Qalabka Tooska ah ee Turkiga ee AI & Qaababka Hadalka Nala soo xidhiidh DulmarkaTitleWake Word Xogta Luuqadda Turkiga

  14. s

    Weke Okwu Turkish Dataset

    • ig.shaip.com
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Weke Okwu Turkish Dataset [Dataset]. https://ig.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Dec 8, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset maka AI & ụdị okwu Kpọtụrụ anyị OverviewTitleWake Okwu Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  15. s

    Wake Word Turkish Dataset

    • zu.shaip.com
    Updated Aug 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word Turkish Dataset [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Aug 25, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Ikhaya Wake IZwi I-Turkish DatasetIkhwalithi ephezulu ye-Turkish Wake Word Dataset ye-AI namamodeli Enkulumo Xhumana nathi UhlolojikeleleTitleWake Izwi Lase-Turkish Language DatasetDataset TypeWake WordDescriptionWake Amagama / Umyalo Wezwi / Qalisa Igama...

  16. o

    Turkish Natural Language Inference Dataset

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Turkish Natural Language Inference Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/f4951f96-ebbc-43bf-bed5-36dce9796e6e
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.

    Columns

    The dataset records typically include the following columns:

    • premise: This column contains sentences written in Turkish. These sentences have been translated from the English sources used for the original SNLI and MNLI datasets. It serves as the contextual information or the initial statement from which an inference is to be made.
    • hypothesis: This column also contains sentences in Turkish, translated from the English SNLI and MNLI datasets. It represents the conclusion or the statement whose relationship to the premise is being assessed.
    • label: This column assigns a relationship between the premise and hypothesis. Possible values include:
      • 'entailment': The hypothesis logically follows from the premise.
      • 'contradiction': The hypothesis directly contradicts the premise.
      • 'neutral': The hypothesis is unrelated to or neither entails nor contradicts the premise.
    • domain: An optional column assigned by some authors, primarily used when inferences are made between sentences across different semantic domains, such as weather, sports, or finance.

    Distribution

    The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv for training models, slni_tr_validation for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv for additional validation on complex scenarios. The multinli_tr_train.csv file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv file, for instance, containing approximately 392,700 records.

    Usage

    This dataset is ideal for various applications and use cases in NLP and machine learning:

    • Developing Natural Language Inference (NLI)-based question answering systems for the Turkish language.
    • Training sentiment analysis algorithms to discern sentiment in Turkish text.
    • Building Machine Learning Chatbots that leverage NLI to understand conversational context and respond appropriately in Turkish.
    • Conducting general NLI research in Turkish.
    • Investigating cross-lingual generalisation capabilities of NLP models.
    • Tasks such as sentence paraphrasing, classification, and other NLP techniques applied to Turkish text.

    Coverage

    The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.

    License

    CC0

    Who Can Use It

    The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:

    • The natural language processing (NLP) community.
    • The machine learning community.
    • Seasoned and budding researchers looking to delve into NLI tasks.
    • Developers aiming to create automated models for Turkish language inference.
    • Academics and practitioners exploring the cross-lingual generalisation capabilities of models.
    • Anyone working on NLP tasks in Turkish, such as sentence paraphrasing, text classification, or question answering.

    Dataset Name Suggestions

    • NLI-TR (Turkish NLI Research)
    • Turkish Natural Language Inference Dataset
    • SNLI-TR and MNLI-TR Turkish Data
    • Turkish Textual Entailment Data

    Attributes

    Original Data Source: NLI-TR (Turkish NLI Research)

  17. P

    MedTurkQuAD: Medical Turkish Question-Answering Dataset Dataset

    • paperswithcode.com
    Updated Oct 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mert Incidelen; Murat Aydogan (2024). MedTurkQuAD: Medical Turkish Question-Answering Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/medturkquad-medical-turkish-question
    Explore at:
    Dataset updated
    Oct 15, 2024
    Authors
    Mert Incidelen; Murat Aydogan
    Description

    A comprehensive Turkish dataset for question-answering tasks in medical domain

  18. R

    Turkish Lira Detection Dataset

    • universe.roboflow.com
    zip
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tltltltl (2023). Turkish Lira Detection Dataset [Dataset]. https://universe.roboflow.com/tltltltl/turkish-lira-detection
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 21, 2023
    Dataset authored and provided by
    tltltltl
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Türkiye
    Variables measured
    Banknote Bounding Boxes
    Description

    Turkish Lira Detection

    ## Overview
    
    Turkish Lira Detection is a dataset for object detection tasks - it contains Banknote annotations for 4,531 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
    
  19. m

    English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

    • data.mendeley.com
    Updated Feb 9, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
    Explore at:
    Dataset updated
    Feb 9, 2017
    Authors
    H. Bahadir Sahin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

    Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

    By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

    We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

    All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.

  20. F

    Turkish Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Turkish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Turkish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Turkish text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Turkish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Turkish text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Turkish crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Turkish language. Your journey to enhanced language understanding and processing starts here.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset

turkish-sentiment-analysis-dataset

Turkish Sentiment Dataset

winvoker/turkish-sentiment-analysis-dataset

Explore at:
17 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2022
Authors
Batuhan
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset

This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

Search
Clear search
Close search
Google apps
Main menu