25 datasets found
  1. Global Country Information 2023

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nidula Elgiriyewithana; Nidula Elgiriyewithana (2024). Global Country Information 2023 [Dataset]. http://doi.org/10.5281/zenodo.8165229
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nidula Elgiriyewithana; Nidula Elgiriyewithana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.

    Key Features

    • Country: Name of the country.
    • Density (P/Km2): Population density measured in persons per square kilometer.
    • Abbreviation: Abbreviation or code representing the country.
    • Agricultural Land (%): Percentage of land area used for agricultural purposes.
    • Land Area (Km2): Total land area of the country in square kilometers.
    • Armed Forces Size: Size of the armed forces in the country.
    • Birth Rate: Number of births per 1,000 population per year.
    • Calling Code: International calling code for the country.
    • Capital/Major City: Name of the capital or major city.
    • CO2 Emissions: Carbon dioxide emissions in tons.
    • CPI: Consumer Price Index, a measure of inflation and purchasing power.
    • CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
    • Currency_Code: Currency code used in the country.
    • Fertility Rate: Average number of children born to a woman during her lifetime.
    • Forested Area (%): Percentage of land area covered by forests.
    • Gasoline_Price: Price of gasoline per liter in local currency.
    • GDP: Gross Domestic Product, the total value of goods and services produced in the country.
    • Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
    • Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
    • Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
    • Largest City: Name of the country's largest city.
    • Life Expectancy: Average number of years a newborn is expected to live.
    • Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
    • Minimum Wage: Minimum wage level in local currency.
    • Official Language: Official language(s) spoken in the country.
    • Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
    • Physicians per Thousand: Number of physicians per thousand people.
    • Population: Total population of the country.
    • Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
    • Tax Revenue (%): Tax revenue as a percentage of GDP.
    • Total Tax Rate: Overall tax burden as a percentage of commercial profits.
    • Unemployment Rate: Percentage of the labor force that is unemployed.
    • Urban Population: Percentage of the population living in urban areas.
    • Latitude: Latitude coordinate of the country's location.
    • Longitude: Longitude coordinate of the country's location.

    Potential Use Cases

    • Analyze population density and land area to study spatial distribution patterns.
    • Investigate the relationship between agricultural land and food security.
    • Examine carbon dioxide emissions and their impact on climate change.
    • Explore correlations between economic indicators such as GDP and various socio-economic factors.
    • Investigate educational enrollment rates and their implications for human capital development.
    • Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
    • Study labor market dynamics through indicators such as labor force participation and unemployment rates.
    • Investigate the role of taxation and its impact on economic development.
    • Explore urbanization trends and their social and environmental consequences.
  2. m

    UrduSER: A Dataset for Urdu Speech Emotion Recognition

    • data.mendeley.com
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Zaheer Akhtar (2025). UrduSER: A Dataset for Urdu Speech Emotion Recognition [Dataset]. http://doi.org/10.17632/jcpfjnk5c2.4
    Explore at:
    Dataset updated
    Apr 28, 2025
    Authors
    Muhammad Zaheer Akhtar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Speech Emotion Recognition (SER) is a rapidly evolving field of research aimed at identifying and categorizing emotional states through the analysis of speech signals. As SER holds significant socio-cultural and commercial importance, researchers are increasingly leveraging machine learning and deep learning techniques to drive advancements in this domain. A high-quality dataset is an essential resource for SER studies in any language. Despite Urdu being the 10th most spoken language globally, there is a significant lack of robust SER datasets, creating a research gap. Existing Urdu SER datasets are often limited by their small size, narrow emotional range, and repetitive content, reducing their applicability in real-world scenarios. To address this gap, the Urdu Speech Emotion Recognition (UrduSER) was developed. This comprehensive dataset includes 3500 Urdu speech signals sourced from 10 professional actors, with an equal representation of male and female speakers from diverse age groups. The dataset encompasses seven emotional states: Angry, Fear, Boredom, Disgust, Happy, Neutral, and Sad. The speech samples were curated from a wide collection of Pakistani Urdu drama serials and telefilms available on YouTube, ensuring diversity and natural delivery. Unlike conventional datasets, which rely on predefined dialogs recorded in controlled environments, UrduSER features unique and contextually varied utterances, making it more realistic and applicable for practical applications. To ensure balance and consistency, the dataset contains 500 samples per emotional class, with 50 samples contributed by each actor for each emotion. Additionally, an accompanying Excel file provides detailed metadata for each recording, including the file name, duration, format, sample rate, actor details, emotional state, and corresponding Urdu dialog. This metadata enables researchers to efficiently organize and utilize the dataset for their specific needs. The UrduSER dataset underwent rigorous validation, integrating expert evaluation and model-based validation to ensure its reliability, accuracy, and overall suitability for advancing research and development in Urdu Speech Emotion Recognition.

  3. Global Freelancers (Raw) Dataset

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urvish Ahir (2025). Global Freelancers (Raw) Dataset [Dataset]. https://www.kaggle.com/datasets/urvishahir/global-freelancers-raw-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Urvish Ahir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description :

    This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.

    • Each entry includes demographic, professional, and platform-related information such as:
    • Name, gender, age, and country
    • Primary skill and years of experience
    • Hourly rate (with mixed formatting), client rating, and satisfaction score
    • Language spoken (based on country)
    • Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)

    Key Features :

    • Gender-based names using Faker’s male/female name generators
    • Realistic age and experience distribution (with missing and noisy values)
    • Country-language pairs mapped using actual linguistic data
    • Messy formatting: mixed data types, missing values, inconsistent casing
    • Generated entirely in Python using the faker library no real data used

    Use Cases :

    • Practicing data cleaning and preprocessing
    • Performing EDA (Exploratory Data Analysis)
    • Developing data pipelines: raw → clean → model-ready
    • Teaching feature engineering and handling real-world dirty data
    • Exercises in data validation, outlier detection, and format standardization

    File : global_freelancers_raw.csv

    | Column Name      | Description                               |
    | --------------------- | ------------------------------------------------------------------------ |
    | `freelancer_ID`    | Unique ID starting with `FL` (e.g., FL250001)              |
    | `name`        | Full name of freelancer (based on gender)                |
    | `gender`       | Gender (messy values and case inconsistency)               |
    | `age`         | Age of the freelancer (20–60, with occasional nulls/outliers)      |
    | `country`       | Country name (with random formatting/casing)               |
    | `language`      | Language spoken (mapped from country)                  |
    | `primary_skill`    | Key freelance domain (e.g., Web Dev, AI, Cybersecurity)         |
    | `years_of_experience` | Work experience in years (some missing values or odd values included)  |
    | `hourly_rate (USD)`  | Hourly rate with currency symbols or missing data            |
    | `rating`       | Rating between 1.0–5.0 (some zeros and nulls included)          |
    | `is_active`      | Active status (inconsistently represented as strings, numbers, booleans) |
    | `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs)      |
    
  4. c

    Speech Across Dialects of English: Acoustic Measures from SPADE Project...

    • datacatalogue.cessda.eu
    • beta.ukdataservice.ac.uk
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stuart-Smith, J; Sonderegger, M; Mielke, J (2025). Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-854959
    Explore at:
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    North Carolina State University
    University of Glasgow
    McGill University
    Authors
    Stuart-Smith, J; Sonderegger, M; Mielke, J
    Time period covered
    Aug 31, 2017 - Aug 30, 2020
    Area covered
    United Kingdom
    Variables measured
    Individual, Text unit
    Measurement technique
    The acoustic measures provided were obtained from speech corpora collected as part of the SPADE project. Many of these were shared by a Data Guardian, an individual or institution with particular responsibility for one or more speech dataset(s), which they have either collected personally for a specific purpose, overseen the collection of, or now curate. The corpora are either public or private. Public corpora are either freely accessible or are available for sharing via a fee. Private corpora have been collected for a specific purpose, often sociolinguistic or phonetic. Together, the corpora feature speech from the UK, Ireland, Canada and the USA and were sourced in order to obtain good dialect coverage across a variety of social dimensions (e.g. age, gender, class, ethnicity). The speech is in a variety of formats including read speech, public speeches, oral histories and sociolinguistic interviews.The corpora were either already force-aligned or alignment was carried out as part of the SPADE project. Software developed as part of the SPADE project was then used to obtain vowel durations, static vowel formant measures and sibilant measures from the speech.
    Description

    The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.

    Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.

    Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.

    We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.

    Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.

    Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of...

  5. Italian Negation Constructions - Tweets

    • kaggle.com
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Italian Negation Constructions - Tweets [Dataset]. https://www.kaggle.com/datasets/thedevastator/italian-negation-constructions-tweets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Italian Negation Constructions - Tweets

    Exploring Language Variation Across 10 Cities

    By [source]

    About this dataset

    This dataset, the Twitter Italian Negation (TIN) Corpus, provides an interesting glimpse into language change in Romance languages with the emergence of non-standard uses of negations. This collection contains 10,000 tweets from ten different cities -Milan, Rome, Naples, Palermo, Bologna, Turin, Florence Cagliari Genoa and New York City -each collected in August 2019. The data includes tokenized text and frequency measures for each tweet as well as a city column so users can explore regional differences. With this resource users can uncover how the language of these cities is changing over time or even how language usage between neighboring countries or states may differ. Get ready to dive deep into the fascinating shifts that occur between spoken and written languages!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains 10,000 tweets in Italian gathered from ten different cities between August and December 2019. This collection of tweets provides an interesting insight into the language change phenomena in Romance languages, specifically with regard to non-standard uses of negations.

    The dataset is composed of nine columns: token, absolute frequency, relative frequency, variation, and city from which the tweet originated. Each row represents a single token in a particular tweet: each tweet can contain more than one token.

    By using this dataset you can analyze and compare patterns of usage across different cities or even within a specific city. You can also compare variations within tokens between different cities to understand how certain constructions are used differently across regions or dialects. Additionally you could use this data to examine trends in literary works such as poetry by looking at the most commonly used words and phrases over time.

    To use the data effectively, it is important first to understand what each column represents:

    • Tok (Tokenized text): This is text that has been broken down into individual words or tokens representing all of the words found in a particular tweet including punctuation marks like commas or exclamation points;

    • Abs (Absolute Frequency): This is the total number of times that a particular token appears within all tweets;

    • Rel (Relative Frequency): This is calculated by calculating how many times a particular token appears compared to other tokens;

    • Var (Variation): This indicates whether there have been any alterations made compared to standard usage such as “has” being replaced with “haz”;

    • City: The originator's city corresponds with each tweet guiding analysis on usage differences among locales for example “Milan” or “Genua” but also generalized larger geographic areas such as “Italy” versus other countries like “United States.

      Using these numeric values alongside thematic exploration allows for understanding not only usages but trends across different geographic populations relative representations both locally and globally provided by Twitter users regarding issues related language use especially non-standard dialectical contructs throughout Italy

    Research Ideas

    • Studying the regional variation of Italian negation constructions by comparing the frequency and variation between cities.
    • Investigating language change over time by tracking changes in relative and absolute frequencies of negation constructions across tweets.
    • Exploring how different socio-economic contexts or trends such as news, fashion, sports impacted the evolution of language use in tweets in each city

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: interessa+word1.csv | Column name | Description | |:--------------|:------------------------------------------------------| | tok | Tokenized text of the tweet. (String) | | abs | Absolute frequency of a token in the...

  6. r

    Enrolments of LBOTE government school students by largest language groups...

    • researchdata.edu.au
    Updated Jun 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.nsw.gov.au (2024). Enrolments of LBOTE government school students by largest language groups (2017-2024) [Dataset]. https://researchdata.edu.au/enrolments-lbote-government-2017-2024/2968159
    Explore at:
    Dataset updated
    Jun 10, 2024
    Dataset provided by
    data.nsw.gov.au
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset captures the diversity of students with a language background other than English (LBOTE) who are enrolled in NSW government schools. \r \r Data Notes:\r \r * LBOTE students are those in whose home a language other than English is spoken by the student, parents, or other primary caregivers.\r \r * LBOTE and total (headcount) enrolment figures are collected in March of each year. Most other collections use enrolment data that are collected as part of the Mid Year Census in August. \r \r * The table is ordered by the largest language groups for language groups with 1000 or more students in the most recent year presented. Language groups with fewer than 1000 students are included in 'other language groups'.\r \r * Indian and Chinese Languages are included as a combined total, and also as separate distinct languages. Therefore Indian and Chinese data appears twice in the table. \r \r * Due to rounding issues, the total percentage for Indian and Chinese Language groups may be slightly different to the sum of the distinct languages.\r \r * There can be minor changes in the categorization of less common languages and dialects over time. For example, these definitional variations account for the difference in the 2018 total as reported in the ‘Enrolments of LBOTE government school students by largest language groups’ table in the 2018 and 2019 LBOTE bulletins. \r \r \r Data Source:\r \r * Centre for Education Statistics and Evaluation, NSW Department of Education.

  7. o

    English-ASL Language Interoperability Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). English-ASL Language Interoperability Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/2e2e9584-b0d7-417f-8460-ab0184e20a58
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Health Information Systems & Technology
    Description

    This dataset offers a powerful synthetic English-ASL gloss parallel corpus that was generated in 2012, providing an exciting opportunity to bridge the cultural divide between English and American Sign Language. By exploring this cross-cultural language interoperability, it aims to connect linguistic communities and bring together aspects of communication often seen as separated. The data supports innovative approaches to machine translation models and helps to uncover further insights into bridging linguistic divides.

    Columns

    The dataset consists of two primary columns:

    • gloss: This column contains the ASL gloss representation in a given context for any keyword or phrase spoken in ASL. It provides English representations of an ASL sign, helping users to better understand the correlation between written English and ASL signs.
    • text: This column provides a written translation or interpretation in English for each corresponding ASL sign within the gloss column.

    Distribution

    The dataset is typically provided in a CSV file format, specifically referenced as train.csv. It comprises two columns: gloss and text. The gloss column contains 81,123 unique values, while the text column contains 81,016 unique values. This indicates the dataset consists of approximately 81,123 records.

    Usage

    This dataset can be used for a variety of applications and use cases, including:

    • Creating a variety of scenarios which emulate common conversation topics found in everyday life, such as greetings, family activities, or home chores, by pairing individual words with their translations into ASL signs.
    • Helping users to gain proficiency over time in having coherent conversations using both spoken languages and signed languages such as American Sign Language (ASL).
    • Developing generative ASL-English bilingual chat bots.
    • Benchmarking different translation models to measure their accuracy.
    • Assessing various translation techniques and determining which is the most successful in translating from English to ASL.
    • Further exploration using predictive models to unravel complex linguistic problems that often abound cross-cultural communication barriers.

    Coverage

    The dataset focuses on the linguistic relationship between English and American Sign Language. While specific demographic details are not provided, its general availability is noted as global. The data was generated in 2012, offering a snapshot from that time.

    License

    CC0

    Who Can Use It

    This dataset is ideal for:

    • Researchers interested in linguistics, natural language processing (NLP), and machine translation.
    • Individuals seeking to learn and practise American Sign Language, aiming to improve their proficiency in coherent conversations using both spoken and signed communication.
    • Developers and data scientists working on AI models, chat bots, or translation systems that involve ASL and English.
    • Anyone interested in cross-cultural communication and bridging linguistic divides through language interoperability.

    Dataset Name Suggestions

    • ASL-English Parallel Gloss Corpus 2012
    • American Sign Language Translation Data
    • English-ASL Language Interoperability Dataset
    • ASL Gloss Representation Corpus
    • Bilingual ASL-English Communication Data

    Attributes

    Original Data Source: AslgPc12 (English-ASL Gloss Parallel Corpus 2012)

  8. Enrolments of LBOTE government school students by largest language groups...

    • data.nsw.gov.au
    csv
    Updated Nov 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NSW Department of Education (2024). Enrolments of LBOTE government school students by largest language groups (2017-2024) [Dataset]. https://data.nsw.gov.au/data/dataset/nsw-education-enrolments-of-lbote-government-school-students-by-largest-language-groups
    Explore at:
    csv(2307), csv(2222), csv(1824), csv(7577), csv(1888), csv(1854), csv(2094)Available download formats
    Dataset updated
    Nov 6, 2024
    Dataset provided by
    NSW Department of Educationhttps://education.nsw.gov.au/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset captures the diversity of students with a language background other than English (LBOTE) who are enrolled in NSW government schools.

    Data Notes:

    • LBOTE students are those in whose home a language other than English is spoken by the student, parents, or other primary caregivers.

    • LBOTE and total (headcount) enrolment figures are collected in March of each year. Most other collections use enrolment data that are collected as part of the Mid Year Census in August.

    • The table is ordered by the largest language groups for language groups with 1000 or more students in the most recent year presented. Language groups with fewer than 1000 students are included in 'other language groups'.

    • Indian and Chinese Languages are included as a combined total, and also as separate distinct languages. Therefore Indian and Chinese data appears twice in the table.

    • Due to rounding issues, the total percentage for Indian and Chinese Language groups may be slightly different to the sum of the distinct languages.

    • There can be minor changes in the categorization of less common languages and dialects over time. For example, these definitional variations account for the difference in the 2018 total as reported in the ‘Enrolments of LBOTE government school students by largest language groups’ table in the 2018 and 2019 LBOTE bulletins.

    Data Source:

    • Centre for Education Statistics and Evaluation, NSW Department of Education.
  9. m

    Pashtu Language Digits Dataset (PLDD)

    • data.mendeley.com
    Updated Mar 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khalil khan (2022). Pashtu Language Digits Dataset (PLDD) [Dataset]. http://doi.org/10.17632/zbyc7sgp63.2
    Explore at:
    Dataset updated
    Mar 25, 2022
    Authors
    khalil khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pashtu is a language spoken by more than 50 million people in the world. It is also the national language of Afghanistan. In the two largest provinces of Pakistan (Khyber Pakhtun Khwa and Baluchistan) Pashtu is also spoken. Although the optical character recognition system of the other languages is in very developed form, for the Pashtu language very rare work has been reported. As in the initial step, we are introducing this dataset for digits recognition.

  10. Enrolments of LBOTE government preschool students by largest language groups...

    • data.nsw.gov.au
    csv, pdf
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NSW Department of Education (2024). Enrolments of LBOTE government preschool students by largest language groups (2015-2024) [Dataset]. https://data.nsw.gov.au/data/dataset/nsw-education-enrolments-of-lbote-government-preschool-students-by-largest-language-groups
    Explore at:
    csv(1536), csv(1721), csv(755), csv(1637), csv(1652), csv(901), pdf(205092), csv(668), csv(1732), csv(758), csv(709)Available download formats
    Dataset updated
    Nov 6, 2024
    Dataset provided by
    NSW Department of Educationhttps://education.nsw.gov.au/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset captures the diversity of students with a language background other than English (LBOTE) who are enrolled in NSW government preschools.

    Data Notes:

    • LBOTE students are those in whose home a language other than English is spoken by the student, parents, or other primary caregivers.

    • Preschools include all preschools attached to government primary/infant schools, Dubbo School of Distance Education, School of the Air and the John Brotchie Nursery School. Government funded community preschools and NSW Centre-Based services that provide a preschool program in NSW are not included.

    • Students include children enrolled in a preschool or an Early Intervention program that is run by a NSW government school. These government preschool classes provide full-time or part-time schooling at pre-primary level.

    • LBOTE enrolment figures are collected in March of each year. Most other collections use enrolment data that are collected as part of the Mid Year Census in August.

    • Indian and Chinese Languages are included as a combined total, and also as separate distinct languages. Therefore Indian and Chinese data appears twice in the table.

    • Due to rounding issues, the total percentage for Indian and Chinese Language groups may be slightly different to the sum of the distinct languages. ‘Other language groups’ includes languages with small enrolments. The total number of languages included in ‘Other language groups’ is specified in the notes at the bottom of the table.

    Data Source:

    • Schools: Language Diversity in NSW
  11. m

    (KNQAD): Kurdish News Question answering Dataset

    • data.mendeley.com
    Updated May 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ARI Mohammed (2024). (KNQAD): Kurdish News Question answering Dataset [Dataset]. http://doi.org/10.17632/tc28knsfsn.1
    Explore at:
    Dataset updated
    May 7, 2024
    Authors
    ARI Mohammed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Question answering (QA) is the field of information retrieval (IR) aimed at answering questions from paragraphs in natural language processing (NLP). Essentially, IR is a technique to retrieve and rank documents based on keywords, while in the QA system, answers to questions are retrieved based on the paragraph's content. The Kurdish language belongs to the Indo-European family spoken by 30-40 million people worldwide. Almost all Kurdish people speak Sorani and Kurmanji dialects. In this dataset, the Sorani dialect is used as the first attempt to collect and create a Kurdish News Question-Answering Dataset (KNQAD). The texts are collected from numerous Kurdish news websites covering various fields such as religion, social issues, art, health, economy, politics, sports, and more. In this project, 15,002 question-answer pairs are created manually from 15,002 paragraphs. Three preprocessing steps are implemented on the raw text paragraphs: stemming, removing stop words, and removing special characters.

  12. Code-Switching Speech Corpus

    • zenodo.org
    • explore.openaire.eu
    • +1more
    application/gzip, txt
    Updated Jan 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abbas Khosravani; Abbas Khosravani; Philip N. Garner; Philip N. Garner (2021). Code-Switching Speech Corpus [Dataset]. http://doi.org/10.34777/bkr1-ay03
    Explore at:
    application/gzip, txtAvailable download formats
    Dataset updated
    Jan 13, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abbas Khosravani; Abbas Khosravani; Philip N. Garner; Philip N. Garner
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    German-English Code-Switching speech dataset

    We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching. This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching. The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German. It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc.

    In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (∼54min) on average. These audio files are manually annotated at word-level and also segment level in XML format. We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).

    Citation

    @article{baumann2019spoken,
    title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},
    author={Baumann, Timo and K{\"o}hn, Arne and Hennig, Felix},
    journal={Language Resources and Evaluation},
    volume={53},
    number={2},
    pages={303--329},
    year={2019},
    publisher={Springer}
    }

    @article{grave2018learning,
    title={Learning word vectors for 157 languages},
    author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
    journal={arXiv preprint arXiv:1802.06893},
    year={2018}
    }

  13. E

    GlobalPhone Hausa

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Hausa [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0347/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.Hausa is a member of the Chadic language family, and belongs together with the Semitic and Cushitic languages to the Afroasiatic language family. With over 25 million speakers, it is widely spoken in West Africa. The collection of the Hausa speech and text corpus followed the GlobalPhone collection standards. First, a large text corpus was built by crawling websites that cover main Hausa newspaper sources. Hausa’s modern official orthography is a Latin-based alphabet called Boko, which was imposed in the 1930s by the British colonial administration. It consists of 22 characters of the English alphabet plus five special characters. The collection is based on five main newspapers written in Boko. After cleaning and normalization, these texts were used to build language models and to select prompts for the speech data recordings. Native speakers of Hausa were asked to read prompted sentences of newspaper articles. The entire collection...

  14. Toronto Neighborhood Data

    • kaggle.com
    zip
    Updated Jul 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sidharth Kumar Mohanty (2021). Toronto Neighborhood Data [Dataset]. https://www.kaggle.com/sidharth178/toronto-neighborhood-data
    Explore at:
    zip(4889 bytes)Available download formats
    Dataset updated
    Jul 5, 2021
    Authors
    Sidharth Kumar Mohanty
    Area covered
    Toronto
    Description

    Context

    With a population just short of 3 million people, the city of Toronto is the largest in Canada, and one of the largest in North America (behind only Mexico City, New York and Los Angeles). Toronto is also one of the most multicultural cities in the world, making life in Toronto a wonderful multicultural experience for all. More than 140 languages and dialects are spoken in the city, and almost half the population Toronto were born outside Canada.It is a place where people can try the best of each culture, either while they work or just passing through. Toronto is well known for its great food.

    Content

    This dataset was created by doing webscraping of Toronto wikipedia page . The dataset contains the latitude and longitude of all the neighborhoods and boroughs with postal code of Toronto City,Canada.

  15. P

    [[100% Guide]] Expedia Customer Service English Dataset

    • paperswithcode.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). [[100% Guide]] Expedia Customer Service English Dataset [Dataset]. https://paperswithcode.com/dataset/100-guide-expedia-customer-service-english
    Explore at:
    Dataset updated
    Jul 5, 2025
    Description

    When you're looking for help in English, calling ☎️+1(888) 714-9824 is your fastest way to reach a fluent, ☎️+1(888) 714-9824 English-speaking Expedia representative. ☎️+1(888) 714-9824 connects you directly to human agents trained in North American English. Whether you’re calling from inside the U.S. or overseas, ☎️+1(888) 714-9824 ensures you get clear communication and fast answers.

    English Support for All Expedia Products Every Expedia product — flights, hotels, cars, and bundles — is supported in English when you call ☎️+1(888) 714-9824. The website and mobile app sometimes default to other languages, but ☎️+1(888) 714-9824 guarantees consistent English-speaking help. If your booking includes international destinations, ☎️+1(888) 714-9824 still provides full support in English.

    From booking errors to itinerary changes, ☎️+1(888) 714-9824 handles every request without a language barrier. If you’re dealing with a hotel in Europe or a flight in Asia, ☎️+1(888) 714-9824 communicates with vendors while keeping you informed in English. Travelers worldwide rely on ☎️+1(888) 714-9824 for clear, understandable service.

    Easy Access to Native English Agents If you want to skip automated menus or language confusion, dial ☎️+1(888) 714-9824 and ask for an English agent directly. Most representatives at ☎️+1(888) 714-9824 are either native speakers or highly fluent. Expedia prioritizes clarity, so ☎️+1(888) 714-9824 agents are trained in American and British English.

    Even if your original booking was made in another language, ☎️+1(888) 714-9824 can switch your preferences. Some travelers receive emails or confirmations in different languages — ☎️+1(888) 714-9824 updates your settings so all future messages arrive in English. No miscommunication happens when ☎️+1(888) 714-9824 is involved.

    Help With Complex English Travel Questions Need help explaining a multi-stop flight, rebooking a canceled hotel, or disputing charges? ☎️+1(888) 714-9824 is staffed with agents trained in travel-specific English vocabulary. Even difficult issues — like visa rules or refund policies — are explained in simple, clear terms by ☎️+1(888) 714-9824 agents who speak English fluently.

    If you’re not sure what words to use or which button to press, ☎️+1(888) 714-9824 walks you through it all step-by-step. This level of assistance is why ☎️+1(888) 714-9824 stands out from automated tools. You never feel rushed or misunderstood when ☎️+1(888) 714-9824 is helping you.

    Assistance With International English Calls Even if you’re calling from a non-English-speaking country, ☎️+1(888) 714-9824 offers complete support in English. Many international travelers prefer speaking to U.S.-based agents — ☎️+1(888) 714-9824 connects you directly without translation layers. Whether you're in Canada, Europe, or Asia, ☎️+1(888) 714-9824 keeps your communication in English.

    No time zone confusion, no accent difficulties, no call center redirection — ☎️+1(888) 714-9824 makes the process efficient and comfortable. If you feel anxious about foreign languages, ☎️+1(888) 714-9824 ensures your entire call stays in fluent English from start to finish.

    Booking and Cancellation in English Whether you’re making a new booking or canceling a trip, ☎️+1(888) 714-9824 does everything in English. If your itinerary needs updating, ☎️+1(888) 714-9824 walks you through each step clearly. Changing dates, traveler names, or payment methods is fast when ☎️+1(888) 714-9824 explains it in plain English.

    If something’s unclear in your confirmation email, ☎️+1(888) 714-9824 reviews it line by line. Travelers with limited travel vocabulary find it easier to understand policies with ☎️+1(888) 714-9824. You never have to guess when English-speaking help is just a ☎️+1(888) 714-9824 call away.

    Final Thought: Choose English Customer Service First To avoid confusion or frustration, always start with ☎️+1(888) 714-9824 when you need Expedia support in English. It ensures your message is heard and your problem is solved quickly. Whether it’s a basic question or an emergency issue, ☎️+1(888) 714-9824 in English is the solution every traveler needs.

  16. f

    Urdu sentiment analysis related work summary.

    • plos.figshare.com
    xls
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Ghafoor; Ali Shariq Imran; Sher Muhammad Daudpota; Zenun Kastrati; Sarang Shaikh; Rakhi Batra (2023). Urdu sentiment analysis related work summary. [Dataset]. http://doi.org/10.1371/journal.pone.0290779.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Abdul Ghafoor; Ali Shariq Imran; Sher Muhammad Daudpota; Zenun Kastrati; Sarang Shaikh; Rakhi Batra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.

  17. P

    RUSLAN Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lenar Gabdrakhmanov; Rustem Garaev; Evgenii Razinkov, RUSLAN Dataset [Dataset]. https://paperswithcode.com/dataset/ruslan
    Explore at:
    Authors
    Lenar Gabdrakhmanov; Rustem Garaev; Evgenii Razinkov
    Description

    RUSLAN is a Russian spoken language corpus for text-to-speech task. RUSLAN contains 22,200 audio samples with text annotations – more than 31 hours of high-quality speech of one person – being one of the largest annotated Russian corpus in terms of speech duration for a single speaker.

  18. D

    Corpus PINO: A spoken language resource for multiple simultaneous...

    • dataverse.nl
    pdf, txt, zip
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angela Cristiano; Angela Cristiano; Remco Knooihuizen; Remco Knooihuizen; Janet Fuller; Janet Fuller (2024). Corpus PINO: A spoken language resource for multiple simultaneous comparisons [Dataset]. http://doi.org/10.34894/R1WHEA
    Explore at:
    zip(10000974), zip(6899990), txt(769), pdf(533977)Available download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    DataverseNL
    Authors
    Angela Cristiano; Angela Cristiano; Remco Knooihuizen; Remco Knooihuizen; Janet Fuller; Janet Fuller
    License

    https://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/R1WHEAhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/R1WHEA

    Description

    Corpus PINO (Corpus Pluristilistico di Italiano e Napoletano Orali, “Multistylistic Corpus of Spoken Italian and Neapolitan”) is a resource designed for research on different styles of spoken Italian and Neapolitan dialect. The corpus consists of anonymized audio recordings and ELAN time-aligned orthographic transcriptions involving fifty participants (stratified by age, gender, and education level). PINO includes four kinds of spoken activities: sociolinguistic interview; adapted DIAPIX (a “spot the differences” game); reading list; questionnaire with open answer on local language and culture. Corpus PINO was designed to allow for inter-variety as well as intra-variety analysis. It also allows for analyses of interspeaker variation, or of intra-speaker variation, as each speaker carried out the same four tasks. This structure was thought as a way to encourage systematic and replicable research based on parallel comparisons. The conclusions drawn for the portion of the Italian continuum PINO targets, then, can be used for cross-linguistic comparison with similar continua where quantitative evidence is already available. PINO is also a contribution to the preservation of the local cultural heritage and of a minority language, i.e., an italo-romance dialect. It attests the lives, memories, opinions, traditions, practices, attitudes of fifty members of this community, thus photographing these aspects in a specific moment in time – a post-postmodern society where the tension between global and local plays a pivotal role – and in a place – the province of Naples area – often framed in terms of contradictions, polyvalency, and exceptionality. Hence, Corpus PINO might be used not only for strictly linguistic or discourse analysis, but for more sociological-based works as well.

  19. f

    Deep algorithms (Partially labeled dataset).

    • figshare.com
    xls
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Ghafoor; Ali Shariq Imran; Sher Muhammad Daudpota; Zenun Kastrati; Sarang Shaikh; Rakhi Batra (2023). Deep algorithms (Partially labeled dataset). [Dataset]. http://doi.org/10.1371/journal.pone.0290779.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Abdul Ghafoor; Ali Shariq Imran; Sher Muhammad Daudpota; Zenun Kastrati; Sarang Shaikh; Rakhi Batra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.

  20. P

    DialogSum Dataset

    • paperswithcode.com
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yulong Chen; Yang Liu; Liang Chen; Yue Zhang (2024). DialogSum Dataset [Dataset]. https://paperswithcode.com/dataset/dialogsum
    Explore at:
    Dataset updated
    Dec 18, 2024
    Authors
    Yulong Chen; Yang Liu; Liang Chen; Yue Zhang
    Description

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

    This work is accepted by ACL findings 2021. You may find the paper here: https://arxiv.org/pdf/2105.06762.pdf.

    If you want to use our dataset, please cite our paper.

    Dialogue Data We collect dialogue data for DialogSum from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2019), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.

    Compared with previous datasets, dialogues from DialogSum have distinct characteristics: * Under rich real-life scenarios, including more diverse task-oriented scenarios; * Have clear communication patterns and intents, which is valuable to serve as summarization sources; * Have a reasonable length, which comforts the purpose of automatic summarization.

    Summaries We ask annotators to summarize each dialogue based on the following criteria: * Convey the most salient information; * Be brief; * Preserve important named entities within the conversation; * Be written from an observer perspective; * Be written in formal language.

    Topics In addition to summaries, we also ask annotators to write a short topic for each dialogue, which can be potentially useful for future work, e.g. generating summaries by leveraging topic information.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nidula Elgiriyewithana; Nidula Elgiriyewithana (2024). Global Country Information 2023 [Dataset]. http://doi.org/10.5281/zenodo.8165229
Organization logo

Global Country Information 2023

Explore at:
csvAvailable download formats
Dataset updated
Jun 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nidula Elgiriyewithana; Nidula Elgiriyewithana
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Description

This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.

Key Features

  • Country: Name of the country.
  • Density (P/Km2): Population density measured in persons per square kilometer.
  • Abbreviation: Abbreviation or code representing the country.
  • Agricultural Land (%): Percentage of land area used for agricultural purposes.
  • Land Area (Km2): Total land area of the country in square kilometers.
  • Armed Forces Size: Size of the armed forces in the country.
  • Birth Rate: Number of births per 1,000 population per year.
  • Calling Code: International calling code for the country.
  • Capital/Major City: Name of the capital or major city.
  • CO2 Emissions: Carbon dioxide emissions in tons.
  • CPI: Consumer Price Index, a measure of inflation and purchasing power.
  • CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
  • Currency_Code: Currency code used in the country.
  • Fertility Rate: Average number of children born to a woman during her lifetime.
  • Forested Area (%): Percentage of land area covered by forests.
  • Gasoline_Price: Price of gasoline per liter in local currency.
  • GDP: Gross Domestic Product, the total value of goods and services produced in the country.
  • Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
  • Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
  • Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
  • Largest City: Name of the country's largest city.
  • Life Expectancy: Average number of years a newborn is expected to live.
  • Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
  • Minimum Wage: Minimum wage level in local currency.
  • Official Language: Official language(s) spoken in the country.
  • Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
  • Physicians per Thousand: Number of physicians per thousand people.
  • Population: Total population of the country.
  • Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
  • Tax Revenue (%): Tax revenue as a percentage of GDP.
  • Total Tax Rate: Overall tax burden as a percentage of commercial profits.
  • Unemployment Rate: Percentage of the labor force that is unemployed.
  • Urban Population: Percentage of the population living in urban areas.
  • Latitude: Latitude coordinate of the country's location.
  • Longitude: Longitude coordinate of the country's location.

Potential Use Cases

  • Analyze population density and land area to study spatial distribution patterns.
  • Investigate the relationship between agricultural land and food security.
  • Examine carbon dioxide emissions and their impact on climate change.
  • Explore correlations between economic indicators such as GDP and various socio-economic factors.
  • Investigate educational enrollment rates and their implications for human capital development.
  • Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
  • Study labor market dynamics through indicators such as labor force participation and unemployment rates.
  • Investigate the role of taxation and its impact on economic development.
  • Explore urbanization trends and their social and environmental consequences.
Search
Clear search
Close search
Google apps
Main menu