Facebook
TwitterAccording to a survey conducted in 2021, Tunisian Arabic was the main language spoken in around 93 percent of households in Tunisia. Arabic followed, with by roughly 6 percent of Tunisians. Berber language accounted for only 0.1 percent, according to the survey.
Facebook
TwitterAccording to a survey conducted in 2022, Tunisian Arabic was the main language spoken at home in Tunisia. Tunisian Arabic-speaking households were more common in urban areas (around 94 percent) compared to rural areas (roughly 91 percent). On the contrary, rural areas had higher percentages of linguistic diversity among households, there were larger shares of people who spoke Arabic, French and Berber.
Facebook
TwitterAs of 2021, there were seven living languages in Tunisia. Most of those - amounting to three - were categorized as developing, meaning that they were in the initial phase of development. In addition, two languages (Arabic and French) were used at institutional levels in the country.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The OrienTel French as spoken in Tunisia database comprises 576 Tunisian speakers of French (290 males, 286 females) recorded over the Tunisian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequencesof 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•1 currency money amount•2 natural numbers•3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Western calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2+3 spontaneous items (for control)The following age distribution has been obtained: 2 speakers are below 16, 407 speakers are between 16 and 30, 104 speakers are between 31 and 45, 59 speakers are between 46 and 60, 4 speakers are over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We developed a tool for collecting Tunisian dialect data, prompting users to record themselves reading provided phrases. We sourced sentences from Tunisiya. These sentences are consequently removed from the LM training corpus. 89 persons have participated leading to the collection of 2631 distinct phrases. This set will be called TunSwitch TO, ``TO" standing for Tunisian Only, as these sentences do not have non-Tunisian words.
In response to the limited availability of paired Text-Speech Tunisian datasets with code-switching, we have built a corpus through meticulous manual annotation. Whenever encountered, French and English words are enclosed within "<>" tags, and left Tunisian words without any enclosing tags. While these tags have not been used in the proposed models, they allow to have language-usage statistics and may be useful for further approaches handling code-switching. The resulting set is released as TunSwitch CS, ``CS" standing for Code-Switched.
The TunSwitch CS dataset samples come from a set of radio shows and podcasts, representing diverse topics and a large number of unique speakers. The audio are first segmented into chunks, prioritizing word integrity using the WebRTC-VAD algorithm for silence detection. Afterward, we used a Pyannote overlap detection model to remove overlapping speech sections. Then, a music detection model is employed to eliminate music-containing chunks that could disrupt ASR model accuracy.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The OrienTel Tunisia MCA (Modern Colloquial Arabic) database comprises 792 Tunisian speakers (426 males, 366 females) recorded over the Tunisian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•1 sequence of 10 isolated digits•5 connected digits : 1 prompt sheet number (6 digits), 1 telephone number (6-15 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits), 1 spontaneous phone number•2 currency money amounts•1 natural number•4 dates : 1 spontaneous (date or year of birth), 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Islamic calendar)•2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)•3 spelled words : 1 spontaneous (own forename), 1 city name, 1 real word for coverage•5 directory assistance utterances : 1 spontaneous, own forename, 1 city of childhood (spontaneous), 1 frequent city name, 1 frequent company name, 1 common forename and surname•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•2+3 spontaneous items (for control)•1 free spontaneous speechThe following age distribution has been obtained: 516 speakers are between 16 and 30, 193 speakers are between 31 and 45, 82 speakers are between 46 and 60, 1 speaker over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The OrienTel Tunisia MSA (Modern Standard Arabic) database comprises 598 Tunisian speakers (359 males, 239 females) recorded over the Tunisian fixed and mobile telephone network. This database is partitioned into 1 CD and 1 DVD. The speech databases made within the OrienTel project were validated by SPEX, the Netherlands, to assess their compliance with the OrienTel format and content specifications.Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:•1 isolated single digit•2 sequences of 5 isolated digits•7+1 connected digits : 1 prompt sheet number (6 digits), 6 strings of 4 digits in written format, +1 prompt sheet number in digits•2 currency money amounts•2 natural numbers•3 dates : 1 prompted date, 1 relative or general date expression, 1 prompted date phrase (Islamic calendar)•1 time phrase•2 spelled words : string of 4 letter sequences•3 directory assistance utterances : 1 frequent city name, 1 frequent company name, 1 personal name ( first name and family name)•2 yes/no questions : 1 predominantly ”yes” question, 1 predominantly ”no” question•6 application keywords/keyphrases•1 word spotting phrase using embedded application words•4 phonetically rich words•9 phonetically rich sentences•4+1 spontaneous items (for control)The following age distribution has been obtained: 2 speakers are below 16, 441 speakers are between 16 and 30, 101 speakers are between 31 and 45, 54 speakers are between 46 and 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAccording to a survey conducted in 2021, Tunisian Arabic was the main language spoken in around 93 percent of households in Tunisia. Arabic followed, with by roughly 6 percent of Tunisians. Berber language accounted for only 0.1 percent, according to the survey.