Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.
Key Features
- Country: Name of the country.
- Density (P/Km2): Population density measured in persons per square kilometer.
- Abbreviation: Abbreviation or code representing the country.
- Agricultural Land (%): Percentage of land area used for agricultural purposes.
- Land Area (Km2): Total land area of the country in square kilometers.
- Armed Forces Size: Size of the armed forces in the country.
- Birth Rate: Number of births per 1,000 population per year.
- Calling Code: International calling code for the country.
- Capital/Major City: Name of the capital or major city.
- CO2 Emissions: Carbon dioxide emissions in tons.
- CPI: Consumer Price Index, a measure of inflation and purchasing power.
- CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
- Currency_Code: Currency code used in the country.
- Fertility Rate: Average number of children born to a woman during her lifetime.
- Forested Area (%): Percentage of land area covered by forests.
- Gasoline_Price: Price of gasoline per liter in local currency.
- GDP: Gross Domestic Product, the total value of goods and services produced in the country.
- Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
- Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
- Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
- Largest City: Name of the country's largest city.
- Life Expectancy: Average number of years a newborn is expected to live.
- Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
- Minimum Wage: Minimum wage level in local currency.
- Official Language: Official language(s) spoken in the country.
- Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
- Physicians per Thousand: Number of physicians per thousand people.
- Population: Total population of the country.
- Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
- Tax Revenue (%): Tax revenue as a percentage of GDP.
- Total Tax Rate: Overall tax burden as a percentage of commercial profits.
- Unemployment Rate: Percentage of the labor force that is unemployed.
- Urban Population: Percentage of the population living in urban areas.
- Latitude: Latitude coordinate of the country's location.
- Longitude: Longitude coordinate of the country's location.
Potential Use Cases
- Analyze population density and land area to study spatial distribution patterns.
- Investigate the relationship between agricultural land and food security.
- Examine carbon dioxide emissions and their impact on climate change.
- Explore correlations between economic indicators such as GDP and various socio-economic factors.
- Investigate educational enrollment rates and their implications for human capital development.
- Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
- Study labor market dynamics through indicators such as labor force participation and unemployment rates.
- Investigate the role of taxation and its impact on economic development.
- Explore urbanization trends and their social and environmental consequences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Speech Emotion Recognition (SER) is a rapidly evolving field of research aimed at identifying and categorizing emotional states through the analysis of speech signals. As SER holds significant socio-cultural and commercial importance, researchers are increasingly leveraging machine learning and deep learning techniques to drive advancements in this domain. A high-quality dataset is an essential resource for SER studies in any language. Despite Urdu being the 10th most spoken language globally, there is a significant lack of robust SER datasets, creating a research gap. Existing Urdu SER datasets are often limited by their small size, narrow emotional range, and repetitive content, reducing their applicability in real-world scenarios. To address this gap, the Urdu Speech Emotion Recognition (UrduSER) was developed. This comprehensive dataset includes 3500 Urdu speech signals sourced from 10 professional actors, with an equal representation of male and female speakers from diverse age groups. The dataset encompasses seven emotional states: Angry, Fear, Boredom, Disgust, Happy, Neutral, and Sad. The speech samples were curated from a wide collection of Pakistani Urdu drama serials and telefilms available on YouTube, ensuring diversity and natural delivery. Unlike conventional datasets, which rely on predefined dialogs recorded in controlled environments, UrduSER features unique and contextually varied utterances, making it more realistic and applicable for practical applications. To ensure balance and consistency, the dataset contains 500 samples per emotional class, with 50 samples contributed by each actor for each emotion. Additionally, an accompanying Excel file provides detailed metadata for each recording, including the file name, duration, format, sample rate, actor details, emotional state, and corresponding Urdu dialog. This metadata enables researchers to efficiently organize and utilize the dataset for their specific needs. The UrduSER dataset underwent rigorous validation, integrating expert evaluation and model-based validation to ensure its reliability, accuracy, and overall suitability for advancing research and development in Urdu Speech Emotion Recognition.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.
- Each entry includes demographic, professional, and platform-related information such as:
- Name, gender, age, and country
- Primary skill and years of experience
- Hourly rate (with mixed formatting), client rating, and satisfaction score
- Language spoken (based on country)
- Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)
- Gender-based names using Faker’s male/female name generators
- Realistic age and experience distribution (with missing and noisy values)
- Country-language pairs mapped using actual linguistic data
- Messy formatting: mixed data types, missing values, inconsistent casing
- Generated entirely in Python using the faker library no real data used
- Practicing data cleaning and preprocessing
- Performing EDA (Exploratory Data Analysis)
- Developing data pipelines: raw → clean → model-ready
- Teaching feature engineering and handling real-world dirty data
- Exercises in data validation, outlier detection, and format standardization
global_freelancers_raw.csv
| Column Name | Description |
| --------------------- | ------------------------------------------------------------------------ |
| `freelancer_ID` | Unique ID starting with `FL` (e.g., FL250001) |
| `name` | Full name of freelancer (based on gender) |
| `gender` | Gender (messy values and case inconsistency) |
| `age` | Age of the freelancer (20–60, with occasional nulls/outliers) |
| `country` | Country name (with random formatting/casing) |
| `language` | Language spoken (mapped from country) |
| `primary_skill` | Key freelance domain (e.g., Web Dev, AI, Cybersecurity) |
| `years_of_experience` | Work experience in years (some missing values or odd values included) |
| `hourly_rate (USD)` | Hourly rate with currency symbols or missing data |
| `rating` | Rating between 1.0–5.0 (some zeros and nulls included) |
| `is_active` | Active status (inconsistently represented as strings, numbers, booleans) |
| `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs) |
The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.
Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.
Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.
We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.
Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.
Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset, the Twitter Italian Negation (TIN) Corpus, provides an interesting glimpse into language change in Romance languages with the emergence of non-standard uses of negations. This collection contains 10,000 tweets from ten different cities -Milan, Rome, Naples, Palermo, Bologna, Turin, Florence Cagliari Genoa and New York City -each collected in August 2019. The data includes tokenized text and frequency measures for each tweet as well as a city column so users can explore regional differences. With this resource users can uncover how the language of these cities is changing over time or even how language usage between neighboring countries or states may differ. Get ready to dive deep into the fascinating shifts that occur between spoken and written languages!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains 10,000 tweets in Italian gathered from ten different cities between August and December 2019. This collection of tweets provides an interesting insight into the language change phenomena in Romance languages, specifically with regard to non-standard uses of negations.
The dataset is composed of nine columns: token, absolute frequency, relative frequency, variation, and city from which the tweet originated. Each row represents a single token in a particular tweet: each tweet can contain more than one token.
By using this dataset you can analyze and compare patterns of usage across different cities or even within a specific city. You can also compare variations within tokens between different cities to understand how certain constructions are used differently across regions or dialects. Additionally you could use this data to examine trends in literary works such as poetry by looking at the most commonly used words and phrases over time.
To use the data effectively, it is important first to understand what each column represents:
Tok (Tokenized text): This is text that has been broken down into individual words or tokens representing all of the words found in a particular tweet including punctuation marks like commas or exclamation points;
Abs (Absolute Frequency): This is the total number of times that a particular token appears within all tweets;
Rel (Relative Frequency): This is calculated by calculating how many times a particular token appears compared to other tokens;
Var (Variation): This indicates whether there have been any alterations made compared to standard usage such as “has” being replaced with “haz”;
City: The originator's city corresponds with each tweet guiding analysis on usage differences among locales for example “Milan” or “Genua” but also generalized larger geographic areas such as “Italy” versus other countries like “United States.
Using these numeric values alongside thematic exploration allows for understanding not only usages but trends across different geographic populations relative representations both locally and globally provided by Twitter users regarding issues related language use especially non-standard dialectical contructs throughout Italy
- Studying the regional variation of Italian negation constructions by comparing the frequency and variation between cities.
- Investigating language change over time by tracking changes in relative and absolute frequencies of negation constructions across tweets.
- Exploring how different socio-economic contexts or trends such as news, fashion, sports impacted the evolution of language use in tweets in each city
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: interessa+word1.csv | Column name | Description | |:--------------|:------------------------------------------------------| | tok | Tokenized text of the tweet. (String) | | abs | Absolute frequency of a token in the...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset captures the diversity of students with a language background other than English (LBOTE) who are enrolled in NSW government schools. \r \r Data Notes:\r \r * LBOTE students are those in whose home a language other than English is spoken by the student, parents, or other primary caregivers.\r \r * LBOTE and total (headcount) enrolment figures are collected in March of each year. Most other collections use enrolment data that are collected as part of the Mid Year Census in August. \r \r * The table is ordered by the largest language groups for language groups with 1000 or more students in the most recent year presented. Language groups with fewer than 1000 students are included in 'other language groups'.\r \r * Indian and Chinese Languages are included as a combined total, and also as separate distinct languages. Therefore Indian and Chinese data appears twice in the table. \r \r * Due to rounding issues, the total percentage for Indian and Chinese Language groups may be slightly different to the sum of the distinct languages.\r \r * There can be minor changes in the categorization of less common languages and dialects over time. For example, these definitional variations account for the difference in the 2018 total as reported in the ‘Enrolments of LBOTE government school students by largest language groups’ table in the 2018 and 2019 LBOTE bulletins. \r \r \r Data Source:\r \r * Centre for Education Statistics and Evaluation, NSW Department of Education.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset offers a powerful synthetic English-ASL gloss parallel corpus that was generated in 2012, providing an exciting opportunity to bridge the cultural divide between English and American Sign Language. By exploring this cross-cultural language interoperability, it aims to connect linguistic communities and bring together aspects of communication often seen as separated. The data supports innovative approaches to machine translation models and helps to uncover further insights into bridging linguistic divides.
The dataset consists of two primary columns:
The dataset is typically provided in a CSV file format, specifically referenced as train.csv
. It comprises two columns: gloss
and text
. The gloss
column contains 81,123 unique values, while the text
column contains 81,016 unique values. This indicates the dataset consists of approximately 81,123 records.
This dataset can be used for a variety of applications and use cases, including:
The dataset focuses on the linguistic relationship between English and American Sign Language. While specific demographic details are not provided, its general availability is noted as global. The data was generated in 2012, offering a snapshot from that time.
CC0
This dataset is ideal for:
Original Data Source: AslgPc12 (English-ASL Gloss Parallel Corpus 2012)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset captures the diversity of students with a language background other than English (LBOTE) who are enrolled in NSW government schools.
Data Notes:
LBOTE students are those in whose home a language other than English is spoken by the student, parents, or other primary caregivers.
LBOTE and total (headcount) enrolment figures are collected in March of each year. Most other collections use enrolment data that are collected as part of the Mid Year Census in August.
The table is ordered by the largest language groups for language groups with 1000 or more students in the most recent year presented. Language groups with fewer than 1000 students are included in 'other language groups'.
Indian and Chinese Languages are included as a combined total, and also as separate distinct languages. Therefore Indian and Chinese data appears twice in the table.
Due to rounding issues, the total percentage for Indian and Chinese Language groups may be slightly different to the sum of the distinct languages.
There can be minor changes in the categorization of less common languages and dialects over time. For example, these definitional variations account for the difference in the 2018 total as reported in the ‘Enrolments of LBOTE government school students by largest language groups’ table in the 2018 and 2019 LBOTE bulletins.
Data Source:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pashtu is a language spoken by more than 50 million people in the world. It is also the national language of Afghanistan. In the two largest provinces of Pakistan (Khyber Pakhtun Khwa and Baluchistan) Pashtu is also spoken. Although the optical character recognition system of the other languages is in very developed form, for the Pashtu language very rare work has been reported. As in the initial step, we are introducing this dataset for digits recognition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset captures the diversity of students with a language background other than English (LBOTE) who are enrolled in NSW government preschools.
Data Notes:
LBOTE students are those in whose home a language other than English is spoken by the student, parents, or other primary caregivers.
Preschools include all preschools attached to government primary/infant schools, Dubbo School of Distance Education, School of the Air and the John Brotchie Nursery School. Government funded community preschools and NSW Centre-Based services that provide a preschool program in NSW are not included.
Students include children enrolled in a preschool or an Early Intervention program that is run by a NSW government school. These government preschool classes provide full-time or part-time schooling at pre-primary level.
LBOTE enrolment figures are collected in March of each year. Most other collections use enrolment data that are collected as part of the Mid Year Census in August.
Indian and Chinese Languages are included as a combined total, and also as separate distinct languages. Therefore Indian and Chinese data appears twice in the table.
Due to rounding issues, the total percentage for Indian and Chinese Language groups may be slightly different to the sum of the distinct languages. ‘Other language groups’ includes languages with small enrolments. The total number of languages included in ‘Other language groups’ is specified in the notes at the bottom of the table.
Data Source:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Question answering (QA) is the field of information retrieval (IR) aimed at answering questions from paragraphs in natural language processing (NLP). Essentially, IR is a technique to retrieve and rank documents based on keywords, while in the QA system, answers to questions are retrieved based on the paragraph's content. The Kurdish language belongs to the Indo-European family spoken by 30-40 million people worldwide. Almost all Kurdish people speak Sorani and Kurmanji dialects. In this dataset, the Sorani dialect is used as the first attempt to collect and create a Kurdish News Question-Answering Dataset (KNQAD). The texts are collected from numerous Kurdish news websites covering various fields such as religion, social issues, art, health, economy, politics, sports, and more. In this project, 15,002 question-answer pairs are created manually from 15,002 paragraphs. Three preprocessing steps are implemented on the raw text paragraphs: stemming, removing stop words, and removing special characters.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
German-English Code-Switching speech dataset
We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching. This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching. The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German. It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc.
In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (∼54min) on average. These audio files are manually annotated at word-level and also segment level in XML format. We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).
Citation
@article{baumann2019spoken,
title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},
author={Baumann, Timo and K{\"o}hn, Arne and Hennig, Felix},
journal={Language Resources and Evaluation},
volume={53},
number={2},
pages={303--329},
year={2019},
publisher={Springer}
}
@article{grave2018learning,
title={Learning word vectors for 157 languages},
author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1802.06893},
year={2018}
}
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.Hausa is a member of the Chadic language family, and belongs together with the Semitic and Cushitic languages to the Afroasiatic language family. With over 25 million speakers, it is widely spoken in West Africa. The collection of the Hausa speech and text corpus followed the GlobalPhone collection standards. First, a large text corpus was built by crawling websites that cover main Hausa newspaper sources. Hausa’s modern official orthography is a Latin-based alphabet called Boko, which was imposed in the 1930s by the British colonial administration. It consists of 22 characters of the English alphabet plus five special characters. The collection is based on five main newspapers written in Boko. After cleaning and normalization, these texts were used to build language models and to select prompts for the speech data recordings. Native speakers of Hausa were asked to read prompted sentences of newspaper articles. The entire collection...
With a population just short of 3 million people, the city of Toronto is the largest in Canada, and one of the largest in North America (behind only Mexico City, New York and Los Angeles). Toronto is also one of the most multicultural cities in the world, making life in Toronto a wonderful multicultural experience for all. More than 140 languages and dialects are spoken in the city, and almost half the population Toronto were born outside Canada.It is a place where people can try the best of each culture, either while they work or just passing through. Toronto is well known for its great food.
This dataset was created by doing webscraping of Toronto wikipedia page . The dataset contains the latitude and longitude of all the neighborhoods and boroughs with postal code of Toronto City,Canada.
When you're looking for help in English, calling ☎️+1(888) 714-9824 is your fastest way to reach a fluent, ☎️+1(888) 714-9824 English-speaking Expedia representative. ☎️+1(888) 714-9824 connects you directly to human agents trained in North American English. Whether you’re calling from inside the U.S. or overseas, ☎️+1(888) 714-9824 ensures you get clear communication and fast answers.
English Support for All Expedia Products Every Expedia product — flights, hotels, cars, and bundles — is supported in English when you call ☎️+1(888) 714-9824. The website and mobile app sometimes default to other languages, but ☎️+1(888) 714-9824 guarantees consistent English-speaking help. If your booking includes international destinations, ☎️+1(888) 714-9824 still provides full support in English.
From booking errors to itinerary changes, ☎️+1(888) 714-9824 handles every request without a language barrier. If you’re dealing with a hotel in Europe or a flight in Asia, ☎️+1(888) 714-9824 communicates with vendors while keeping you informed in English. Travelers worldwide rely on ☎️+1(888) 714-9824 for clear, understandable service.
Easy Access to Native English Agents If you want to skip automated menus or language confusion, dial ☎️+1(888) 714-9824 and ask for an English agent directly. Most representatives at ☎️+1(888) 714-9824 are either native speakers or highly fluent. Expedia prioritizes clarity, so ☎️+1(888) 714-9824 agents are trained in American and British English.
Even if your original booking was made in another language, ☎️+1(888) 714-9824 can switch your preferences. Some travelers receive emails or confirmations in different languages — ☎️+1(888) 714-9824 updates your settings so all future messages arrive in English. No miscommunication happens when ☎️+1(888) 714-9824 is involved.
Help With Complex English Travel Questions Need help explaining a multi-stop flight, rebooking a canceled hotel, or disputing charges? ☎️+1(888) 714-9824 is staffed with agents trained in travel-specific English vocabulary. Even difficult issues — like visa rules or refund policies — are explained in simple, clear terms by ☎️+1(888) 714-9824 agents who speak English fluently.
If you’re not sure what words to use or which button to press, ☎️+1(888) 714-9824 walks you through it all step-by-step. This level of assistance is why ☎️+1(888) 714-9824 stands out from automated tools. You never feel rushed or misunderstood when ☎️+1(888) 714-9824 is helping you.
Assistance With International English Calls Even if you’re calling from a non-English-speaking country, ☎️+1(888) 714-9824 offers complete support in English. Many international travelers prefer speaking to U.S.-based agents — ☎️+1(888) 714-9824 connects you directly without translation layers. Whether you're in Canada, Europe, or Asia, ☎️+1(888) 714-9824 keeps your communication in English.
No time zone confusion, no accent difficulties, no call center redirection — ☎️+1(888) 714-9824 makes the process efficient and comfortable. If you feel anxious about foreign languages, ☎️+1(888) 714-9824 ensures your entire call stays in fluent English from start to finish.
Booking and Cancellation in English Whether you’re making a new booking or canceling a trip, ☎️+1(888) 714-9824 does everything in English. If your itinerary needs updating, ☎️+1(888) 714-9824 walks you through each step clearly. Changing dates, traveler names, or payment methods is fast when ☎️+1(888) 714-9824 explains it in plain English.
If something’s unclear in your confirmation email, ☎️+1(888) 714-9824 reviews it line by line. Travelers with limited travel vocabulary find it easier to understand policies with ☎️+1(888) 714-9824. You never have to guess when English-speaking help is just a ☎️+1(888) 714-9824 call away.
Final Thought: Choose English Customer Service First To avoid confusion or frustration, always start with ☎️+1(888) 714-9824 when you need Expedia support in English. It ensures your message is heard and your problem is solved quickly. Whether it’s a basic question or an emergency issue, ☎️+1(888) 714-9824 in English is the solution every traveler needs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.
RUSLAN is a Russian spoken language corpus for text-to-speech task. RUSLAN contains 22,200 audio samples with text annotations – more than 31 hours of high-quality speech of one person – being one of the largest annotated Russian corpus in terms of speech duration for a single speaker.
https://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/R1WHEAhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/R1WHEA
Corpus PINO (Corpus Pluristilistico di Italiano e Napoletano Orali, “Multistylistic Corpus of Spoken Italian and Neapolitan”) is a resource designed for research on different styles of spoken Italian and Neapolitan dialect. The corpus consists of anonymized audio recordings and ELAN time-aligned orthographic transcriptions involving fifty participants (stratified by age, gender, and education level). PINO includes four kinds of spoken activities: sociolinguistic interview; adapted DIAPIX (a “spot the differences” game); reading list; questionnaire with open answer on local language and culture. Corpus PINO was designed to allow for inter-variety as well as intra-variety analysis. It also allows for analyses of interspeaker variation, or of intra-speaker variation, as each speaker carried out the same four tasks. This structure was thought as a way to encourage systematic and replicable research based on parallel comparisons. The conclusions drawn for the portion of the Italian continuum PINO targets, then, can be used for cross-linguistic comparison with similar continua where quantitative evidence is already available. PINO is also a contribution to the preservation of the local cultural heritage and of a minority language, i.e., an italo-romance dialect. It attests the lives, memories, opinions, traditions, practices, attitudes of fifty members of this community, thus photographing these aspects in a specific moment in time – a post-postmodern society where the tension between global and local plays a pivotal role – and in a place – the province of Naples area – often framed in terms of contradictions, polyvalency, and exceptionality. Hence, Corpus PINO might be used not only for strictly linguistic or discourse analysis, but for more sociological-based works as well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.
This work is accepted by ACL findings 2021. You may find the paper here: https://arxiv.org/pdf/2105.06762.pdf.
If you want to use our dataset, please cite our paper.
Dialogue Data We collect dialogue data for DialogSum from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2019), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.
Compared with previous datasets, dialogues from DialogSum have distinct characteristics: * Under rich real-life scenarios, including more diverse task-oriented scenarios; * Have clear communication patterns and intents, which is valuable to serve as summarization sources; * Have a reasonable length, which comforts the purpose of automatic summarization.
Summaries We ask annotators to summarize each dialogue based on the following criteria: * Convey the most salient information; * Be brief; * Preserve important named entities within the conversation; * Be written from an observer perspective; * Be written in formal language.
Topics In addition to summaries, we also ask annotators to write a short topic for each dialogue, which can be potentially useful for future work, e.g. generating summaries by leveraging topic information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.
Key Features
- Country: Name of the country.
- Density (P/Km2): Population density measured in persons per square kilometer.
- Abbreviation: Abbreviation or code representing the country.
- Agricultural Land (%): Percentage of land area used for agricultural purposes.
- Land Area (Km2): Total land area of the country in square kilometers.
- Armed Forces Size: Size of the armed forces in the country.
- Birth Rate: Number of births per 1,000 population per year.
- Calling Code: International calling code for the country.
- Capital/Major City: Name of the capital or major city.
- CO2 Emissions: Carbon dioxide emissions in tons.
- CPI: Consumer Price Index, a measure of inflation and purchasing power.
- CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
- Currency_Code: Currency code used in the country.
- Fertility Rate: Average number of children born to a woman during her lifetime.
- Forested Area (%): Percentage of land area covered by forests.
- Gasoline_Price: Price of gasoline per liter in local currency.
- GDP: Gross Domestic Product, the total value of goods and services produced in the country.
- Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
- Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
- Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
- Largest City: Name of the country's largest city.
- Life Expectancy: Average number of years a newborn is expected to live.
- Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
- Minimum Wage: Minimum wage level in local currency.
- Official Language: Official language(s) spoken in the country.
- Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
- Physicians per Thousand: Number of physicians per thousand people.
- Population: Total population of the country.
- Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
- Tax Revenue (%): Tax revenue as a percentage of GDP.
- Total Tax Rate: Overall tax burden as a percentage of commercial profits.
- Unemployment Rate: Percentage of the labor force that is unemployed.
- Urban Population: Percentage of the population living in urban areas.
- Latitude: Latitude coordinate of the country's location.
- Longitude: Longitude coordinate of the country's location.
Potential Use Cases
- Analyze population density and land area to study spatial distribution patterns.
- Investigate the relationship between agricultural land and food security.
- Examine carbon dioxide emissions and their impact on climate change.
- Explore correlations between economic indicators such as GDP and various socio-economic factors.
- Investigate educational enrollment rates and their implications for human capital development.
- Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
- Study labor market dynamics through indicators such as labor force participation and unemployment rates.
- Investigate the role of taxation and its impact on economic development.
- Explore urbanization trends and their social and environmental consequences.