60 datasets found
  1. Top Languages Spoken in the United States

    • kaggle.com
    zip
    Updated Oct 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Top Languages Spoken in the United States [Dataset]. https://www.kaggle.com/datasets/thedevastator/top-languages-spoken-in-the-united-states
    Explore at:
    zip(356420 bytes)Available download formats
    Dataset updated
    Oct 22, 2022
    Authors
    The Devastator
    Area covered
    United States
    Description

    Top Languages Spoken in the United States

    The Impact of linguistics on Community and Business in America

    About this dataset

    Languages are an important part of daily life in the USA. Here is a table that shows the most common languages spoken in the USA, as well as a big spreadsheet which shows each CBSA (Core-Based Statistical Area, or urban area).

    Language usage varies widely throughout the United States. According to the latest census data, over 350 different languages are represented in homes across the country. The following table and spreadsheet provide more detailed information on language usage throughout the various states and cities in the US:

    Columns: - index: Index column for dataframe - Table with column headers in row 5 and row headers in column A: Contains language data for each CBSA (Core Based Statistical Area) - Unnamed: 1: Rank of CBSA by total number of speakers of all languages - Unnamed: 2: Name of CBSA - Unnamed: 3: Population of CBSA - Unnamed: 4: Percent of population that speaks English very well - Unnamed: 5 through Unnamed: 58 : Languages spoken by at least 0.1% of the population, with corresponding percentages

    How to use the dataset

    1. This dataset can be used to understand the linguistic diversity of the United States, and to compare languages spoken across different states and cities.
    2. This data can also be used to explore trends in language usage over time.
    3. businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and tailor their marketing or customer service accordingly.
    4. Schools could use this dataset to plan language-learning programs based on the needs of their community.
    5. Policymakers could use this data to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

    Research Ideas

    1. Businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and cater their marketing or customer service accordingly.
    2. Schools could use this data to plan language-learning programs based on the needs of their community.
    3. Policymakers could use this dataset to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

    Acknowledgements

    This dataset was created by Gary Hoover. The data was sourced from https://www.kaggle.com/garyhoov/us-languages

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: Languages Spoken at Home by Urban Area = CBSA.csv

    File: US Languages Spoken at Home 2014.csv | Column name | Description | |:-------------------------------------------------------------------|:--------------| | Table with column headers in row 5 and row headers in column A | |

  2. w

    Language Spoken at Home Full Dataset

    • geo.wa.gov
    Updated Jan 1, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelby.Flanagan@doh.wa.gov_WADOH (2026). Language Spoken at Home Full Dataset [Dataset]. https://geo.wa.gov/items/4185536f95d649dcb8d586cb873e5d81
    Explore at:
    Dataset updated
    Jan 1, 2026
    Dataset authored and provided by
    Shelby.Flanagan@doh.wa.gov_WADOH
    Area covered
    Description

    BackgroundIn the US, people who don’t speak English well often have a lower quality of life than those who do [1]. They may also have limited access to health care, including mental health services, and may not be able to take part in key national health surveys like the Behavioral Risk Factor Surveillance System (BRFSS). Communities where many people have limited English skills tend to live closer to toxic chemicals. Limited English skills can also make it harder for community members to get involved in local decision-making, which can affect environmental policies and lead to health inequalities. Data SourceWashington Office of the Superintendent of Public Instruction (OSPI) | Public Records CenterMethodologyThe data was collected through a public records request from the OSPI data portal. It shows what languages students speak at home, organized by school district. OSPI collects and reports data by academic year. For example, the 2023 data comes from the 2022-2023 school year (August 1, 2022 to May 31, 2023). OSPI updates this information regularly.CaveatsThese figures only include households with children enrolled in public schools from pre-K through 12th grade. The data may change over time as new information becomes available. Source1. Shariff-Marco, S., Gee, G. C., Breen, N., Willis, G., Reeve, B. B., Grant, D., Ponce, N. A., Krieger, N., Landrine, H., Williams, D. R., Alegria, M., Mays, V. M., Johnson, T. P., & Brown, E. R. (2009). A mixed-methods approach to developing a self-reported racial/ethnic discrimination measure for use in multiethnic health surveys. Ethnicity & disease, 19(4), 447–453.CitationWashington Tracking Network, Washington State Department of Health. Languages Spoken at Home. Data from the Washington Office of Superintendent of Public Instruction (OSPI). Published January 2026. Web.

  3. F

    American English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for US English.
    Voice Assistants: Build smart assistants capable of understanding natural American conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  4. The most spoken languages worldwide 2025

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  5. l

    LA County Language Spoken at Home (census tract)

    • data.lacounty.gov
    • egis-lacounty.hub.arcgis.com
    • +2more
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    County of Los Angeles (2025). LA County Language Spoken at Home (census tract) [Dataset]. https://data.lacounty.gov/datasets/la-county-language-spoken-at-home-census-tract
    Explore at:
    Dataset updated
    Jul 28, 2025
    Dataset authored and provided by
    County of Los Angeles
    Area covered
    Description

    US Census American Community Survey Custom Tabulation (ST542) by Census Tract. Language spoken at home for population 5 years and over by ability to speak English, summarized by census tract for 114 languages spoken across LA County, 5-year estimates 2019-2023.See also source data tables:Census Tracts: Language Spoken at Home LA County Census TractsLA County: Language Spoken at Home LA County Headings:GEOIDGeography identificationCT20Census tract (2020)NameCensus tract nameCSACountywide Statistical Area (city or community)SPAService Planning AreaSDSupervisorial Districttotal_popPopulation over 5 years old in census tract (universe)total_limited_engPopulation that speaks English less than "very well"total_limited_eng_pctPercent of population that speaks English less than "very well"

  6. D

    2023 Limited English Proficiency (LEP) for the DVRPC Region Public Use...

    • catalog.dvrpc.org
    • njogis-newjersey.opendata.arcgis.com
    • +1more
    api, geojson, html +1
    Updated Nov 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DVRPC (2025). 2023 Limited English Proficiency (LEP) for the DVRPC Region Public Use Microdata Areas [Dataset]. https://catalog.dvrpc.org/dataset/2023-limited-english-proficiency-lep-for-the-dvrpc-region-public-use-microdata-areas
    Explore at:
    api, xml, html, geojsonAvailable download formats
    Dataset updated
    Nov 4, 2025
    Dataset authored and provided by
    DVRPC
    Description

    The Delaware Valley Regional Planning Commission (DVRPC) is committed to upholding the principles and intentions of the 1964 Civil Rights Act and related nondiscrimination statutes in all of the Commission’s work, including publications, products, communications, public input, and decision-making processes. Language barriers may prohibit people who are Limited in English Proficiency (also known as LEP persons) from obtaining services, information, or participating in public planning processes. To better identify LEP populations and thoroughly evaluate the Commission’s efforts to provide meaningful access, DVRPC has produced this Limited-English Proficiency Plan. This is the data that was used to make the maps for the upcoming plan. Public Use Microdata Area (PUMA), are geographies of at least 100,000 people that are nested within states or equivalent entities. States are able to delineate PUMAs within their borders, or use PUMA Criteria provided by the Census Bureau. Census tables used to gather data from the 2019- 2023 American Community Survey 5-Year Estimates ACS 2019-2023, Table B16001: Language Spoken at Home by Ability to Speak English for the Population 5 Years and Over. ACS data are derived from a survey and are subject to sampling variablity.

    *Limited English Proficiency (LEP) refers to those persons that speak English less than "very well". DVRPC has mapped the below Language Groups for our Plan.

    Spanish

    Russian

    Chinese

    Korean

    Vietnamese Source of PUMA boundaries: US Census Bureau. The TIGER/Line Files Please refer to U:_OngoingProjects\LEP\ACS_5YR_B16001_PUMAs_metadata.xlsx for full attribute loop up and fields used in making the DVRPC LEP Map Series. Please contact Chris Pollard (cpollard@dvrpc.org) should you have any questions about this dataset.

  7. Share of U.S. school-age children who don't speak English at home 1979-2022

    • statista.com
    Updated Apr 25, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2014). Share of U.S. school-age children who don't speak English at home 1979-2022 [Dataset]. https://www.statista.com/statistics/476804/percentage-of-school-age-children-who-speak-another-language-than-english-at-home-in-the-us/
    Explore at:
    Dataset updated
    Apr 25, 2014
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    In 2022, about 21.4 percent of schoolchildren spoke another language than English at home in the United States. This is a slight increase from 2021, when 21.3 percent of U.S. schoolchildren did not speak English at home.

  8. Census Data - Languages spoken in Chicago, 2008 – 2012

    • data.cityofchicago.org
    • healthdata.gov
    • +3more
    csv, xlsx, xml
    Updated Sep 12, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Census Bureau (2014). Census Data - Languages spoken in Chicago, 2008 – 2012 [Dataset]. https://data.cityofchicago.org/Health-Human-Services/Census-Data-Languages-spoken-in-Chicago-2008-2012/a2fk-ec6q
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Sep 12, 2014
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Authors
    U.S. Census Bureau
    Area covered
    Chicago
    Description

    This dataset contains estimates of the number of residents aged 5 years or older in Chicago who “speak English less than very well,” by the non-English language spoken at home and community area of residence, for the years 2008 – 2012. See the full dataset description for more information at: https://data.cityofchicago.org/api/views/fpup-mc9v/files/dK6ZKRQZJ7XEugvUavf5MNrGNW11AjdWw0vkpj9EGjg?download=true&filename=P:\EPI\OEPHI\MATERIALS\REFERENCES\ECONOMIC_INDICATORS\Dataset_Description_Languages_2012_FOR_PORTAL_ONLY.pdf

  9. What languages are spoken by people who have limited English ability?

    • anrgeodata.vermont.gov
    • visionzero.geohub.lacity.org
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urban Observatory by Esri (2023). What languages are spoken by people who have limited English ability? [Dataset]. https://anrgeodata.vermont.gov/maps/2c15f2d8d81343d883e70a73317a5cb9
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    Esrihttp://esri.com/
    Authors
    Urban Observatory by Esri
    Area covered
    Description

    This map shows the predominant language(s) spoken by people who have limited English speaking ability. This is shown using American Community Survey data from the US Census Bureau by state, county, and tract.There are 12 different language/language groupings: SpanishFrench, Haitian, or CajunKoreanChinese (including Mandarin and Cantonese)VietnameseTagalog (including Filipino)ArabicGerman or other West GermanicRussian, Polish, or other SlavicOther Indo-European (such as Italian or Portuguese)Other Asian and Pacific Island (such as Japanese or Hmong)Other and unspecified (such as Navajo or Hebrew).This map also uses a feature effect to identify the counties with either 10,000 or 5% of the population having limited English ability. According to the Voting Rights Act, "localities where there are more than 10,000 or over 5 percent of the total voting age citizens in a single political subdivision (usually a county, but a township or municipality in some states) who are members of a single language minority group, have depressed literacy rates, and do not speak English very well" are required to "provide [voting materials] in the language of the applicable minority group as well as in the English language".This map uses these hosted feature layers containing the most recent American Community Survey data. These layers are part of ArcGIS Living Atlas, and are updated every year when the American Community Survey releases new estimates, so values in the map always reflect the newest data available.

  10. a

    Predominant Language Spoken at Home - ACS 2016-Copy-Copy

    • umn.hub.arcgis.com
    Updated Dec 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Minnesota (2019). Predominant Language Spoken at Home - ACS 2016-Copy-Copy [Dataset]. https://umn.hub.arcgis.com/maps/75844925ac7d472ea49c62e518686577
    Explore at:
    Dataset updated
    Dec 24, 2019
    Dataset authored and provided by
    University of Minnesota
    Area covered
    Description

    This map shows the predominant language spoken at home by the US population aged 5+. This is shown by Census Tract and County centroids. The data values are from the 2012-2016 American Community Survey 5-year estimates in the S1601 Table for Language Spoken at Home. The popup in the map provides a breakdown of the population age 5+ by the language spoken at home. Data values for other age groups are also available within the data's table. The color of the symbols represent the most common language spoken at home. This predominance map style compares the count of people age 5+ based on what language is spoken at home, and returns the value with the highest count. The census breaks down the population 5+ by the following language options:English OnlyNon-English - SpanishNon-English - Asian and Pacific Islander LanguagesNon-English - Indo European LanguagesNon-English - OtherThe size of the symbols represents how many people are 5 years or older, which helps highlight the quantity of people that live within an area that were sampled for this language categorization. The strength of the color represents HOW predominant an language is within an area. If the symbol is a strong color, it makes up a larger portion of the population. This map is designed for a dark basemap such as the Human Geography Basemap or the Dark Gray Canvas Basemap. See the web map to see the pattern at both the county and tract level. This map helps to show the most common language spoken at home at both a regional and local level. The tract pattern shows how distinct neighborhoods are clustered by which language they speak. The county pattern shows how language is used throughout the country. This pattern is shown by census tracts at large scales, and counties at smaller scales.This data was downloaded from the United States Census Bureau American Fact Finder on January 16, 2018. It was then joined with 2016 vintage centroid points and hosted to ArcGIS Online and into the Living Atlas.Nationally, the breakdown of education for the population 5+ is as follows:Total EstimateMargin of ErrorPercent EstimateMargin of ErrorPopulation 5 years and over298,691,202+/-3,594(X)(X)Speak only English235,519,143+/-154,40978.90%+/-0.1Spanish39,145,066+/-94,57113.10%+/-0.1Asian and Pacific Island languages10,172,370+/-22,5613.40%+/-0.1Other Indo-European languages10,827,536+/-46,3353.60%+/-0.1Other languages3,027,087+/-23,3021.00%+/-0.1

  11. F

    American English Call Center Data for Telecom AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English Call Center Data for Telecom AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/telecom-call-center-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This US English Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native US English speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.

    Participant Diversity:
    Speakers: 60 native US English speakers from our verified contributor pool.
    Regions: Representing multiple provinces across United States of America to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.

    Inbound Calls:
    Phone Number Porting
    Network Connectivity Issues
    Billing and Payments
    Technical Support
    Service Activation
    International Roaming Enquiry
    Refund Requests and Billing Adjustments
    Emergency Service Access, and others
    Outbound Calls:
    Welcome Calls & Onboarding
    Payment Reminders
    Customer Satisfaction Surveys
    Technical Updates
    Service Usage Reviews
    Network Complaint Status Calls, and more

    This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., pauses, coughs)
    High transcription accuracy with word error rate < 5% thanks to dual-layered quality checks.

    These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent, dialect, and location.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  12. ACS 5YR Demographic Estimate Data by State

    • catalog.data.gov
    • datasets.ai
    Updated Mar 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Housing and Urban Development (2024). ACS 5YR Demographic Estimate Data by State [Dataset]. https://catalog.data.gov/dataset/acs-5yr-demographic-estimate-data-by-state
    Explore at:
    Dataset updated
    Mar 1, 2024
    Dataset provided by
    United States Department of Housing and Urban Developmenthttp://www.hud.gov/
    Description

    2016-2020 ACS 5-Year estimates of demographic variables (see below) compiled at the State level. These variables include Sex By Age, Hispanic Or Latino Origin By Race, Household Type (Including Living Alone), Households By Presence Of People Under 18 Years By Household Type, Households By Presence Of People 60 Years And Over By Household Type, Nativity By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over, Average Household Size Of Occupied Housing Units By Tenure, and Sex by Educational Attainment for the Population 18 Years and Over.

  13. Don't Patronize Me!

    • kaggle.com
    zip
    Updated Jul 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carla Perez-Almendros (2021). Don't Patronize Me! [Dataset]. https://www.kaggle.com/carlaperezalmendros/dont-patronize-me
    Explore at:
    zip(150722 bytes)Available download formats
    Dataset updated
    Jul 2, 2021
    Authors
    Carla Perez-Almendros
    Description

    Don't Patronize Me! -- Data Statement

    A. CURATION RATIONALE

    We present Don’t Patronize Me!, an annotated dataset with Patronizing and Condescending Language (PCL)towards vulnerable communities. This annotated data is especially aimed at the NLP community in order to help improve the modelling and detection of PCL when referring to vulnerable communities, with the ultimate goal of producing and consuming a more responsible and inclusive communication. The Don’t Patronize Me! dataset (v.1.0) consists of 7,738 paragraphs about vulnerable communities extracted from news stories from the News on Web (NoW) corpus (https://www.english-corpora.org/now/, used with permission). This original corpus contains more than 18 million articles crawled from online media in 20 English-speaking countries from 2010 until 2018. In order to create our own dataset, we automatically selected from the NoW corpus just those articles where at least one word from a list of selected keywords was present. The articles were then divided per country and keyword and split into paragraphs. With the objective of assuring a balanced representation of countries and keywords, we randomly selected 75 paragraphs per keyword and country using theSciKitLearn library [11]. The final dataset will be a collection of 15,000 paragraphs with PCL annotations referring to vulnerable communities (150 per keyword; 750 per country).

    Countries represented in the Don’t Patronize Me! dataset: Australia, Hong Kong, Sri Lanka, Pakistan, Bangladesh, Ireland, Malaysia, Singapore, Canada, India, Nigeria, Tanzania, United Kingdom, Jamaica, New Zealand, United States, Ghana, Kenia, Philipines, South Africa.

    The keywords include seven potentially vulnerable groups which are widely referred to in general media and are potential recipients of condescending treatment. The remaining three keywords are concepts usually used to describe the former communities or the situations they live.

    Keywords: Disable, Homeless, Immigrant, Migrant, Poor families, Women, Hopeless, Vulnerable, In need, Refugee.

    B. LANGUAGE VARIETIES

    The paragraphs included in the Don’t Patronize Me! dataset are written in English. Twenty English speaking countries are represented in the dataset, thus all their varieties of English are expected to be present in the corpus. See below for the codes of the English varieties as recommended in BCP-47. It is not possible for us to know either the regional varieties of English in each country if any, or if English is the speaker’s first language.

    Language varieties in the dataset: en-AU, en-HK, en-LK, en-PK, en-BD, en-IE, en-MY, en-SG, en-CA, en-IN, en-NG, en-TZ, en-GB, en- JM, en-NZ, en-US, en-GH, en-KE, en-PH, en-ZA

    C. SPEAKER DEMOGRAPHIC

    As the paragraphs of our dataset are extracted from another corpus, we do not have the possibility to trace socio-demographic data of the speakers. Nevertheless, we can assume a) they are journalist, as they work in the media, so they are educated professionals; b) they speak English, although we do not know if this is their first language, and c) there is a wide representation of different races and ethnic origins, as we collect texts from 20 countries. In our dataset, each country contributes with 750 articles, so this is the maximum number of different authors we could have per country. We have not observed any disorder of speech, as the texts are written, probably edited and reviewed before their publication.

    D. ANNOTATOR DEMOGRAPHIC

    The annotators who collaborated in this dataset are three white females, with ages between 25 and 35 years old. Their first language is Spanish, but they are proficient in English. They all have graduate and postgraduate studies in communication, computer science and data science.

    E. SPEECH SITUATION

    The news stories from where the paragraphs of our dataset are extracted were published between 2010 and 2018 in 20 countries (see section A). The stories are asynchronous communication, written, edited and probably reviewed before publishing. The texts are published news articles; thus they are likely intended to reach a general audience, although the characteristics of the audience might vary depending on the country of publication.

    F. TEXT CHARACTERISTICS

    The texts of the dataset belong to the journalism genre and the topics have been previously selected to cover the treatment of the media towards potentially vulnerable groups, as explained in section A.

  14. Census of Population and Housing, 2000 [United States]: Summary File 4, Iowa...

    • search.gesis.org
    Updated Feb 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Department of Commerce. Bureau of the Census (2021). Census of Population and Housing, 2000 [United States]: Summary File 4, Iowa - Version 1 [Dataset]. http://doi.org/10.3886/ICPSR13527.v1
    Explore at:
    Dataset updated
    Feb 16, 2021
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    GESIS search
    Authors
    United States Department of Commerce. Bureau of the Census
    License

    https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de457443https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de457443

    Area covered
    Iowa, United States
    Description

    Abstract (en): Summary File 4 (SF 4) from the United States 2000 Census contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units. Population items include basic population totals: urban and rural, households and families, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status, year of entry, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, industry, occupation, class of worker, income, and poverty status. Housing items include basic housing totals: urban and rural, number of rooms, number of bedrooms, year moved into unit, household size and occupants per room, units in structure, year structure built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, monthly rent, and shelter costs. In Summary File 4, the sample data are presented in 213 population tables (matrices) and 110 housing tables, identified with "PCT" and "HCT" respectively. Each table is iterated for 336 population groups: the total population, 132 race groups, 78 American Indian and Alaska Native tribe categories (reflecting 39 individual tribes), 39 Hispanic or Latino groups, and 86 ancestry groups. The presentation of SF4 tables for any of the 336 population groups is subject to a population threshold. That is, if there are fewer than 100 people (100-percent count) in a specific population group in a specific geographic area, and there are fewer than 50 unweighted cases, their population and housing characteristics data are not available for that geographic area in SF4. For the ancestry iterations, only the 50 unweighted cases test can be performed. See Appendix H: Characteristic Iterations, for a complete list of characteristic iterations. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Created variable labels and/or value labels.. All persons in housing units in Iowa in 2000. 2013-05-25 Multiple Census data file segments were repackaged for distribution into a single zip archive per dataset. No changes were made to the data or documentation.2006-01-12 All files were removed from dataset 342 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 341 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 340 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 339 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 338 and flagged as study-level files, so that they will accompany all downloads. Because of the number of files per state in Summary File 4, ICPSR has given each state its own ICPSR study number in the range ICPSR 13512-13563. The study number for the national file is 13570. Data for each state are being released as they become available.The data are provided in 38 segments (files) per iteration. These segments are PCT1-PCT4, PCT5-PCT16, PCT17-PCT34, PCT35-PCT37, PCT38-PCT45, PCT46-PCT49, PCT50-PCT61, PCT62-PCT67, PCT68-PCT71, PCT72-PCT76, PCT77-PCT78, PCT79-PCT81, PCT82-PCT84, PCT85-PCT86 (partial), PCT86 (partial), PCT87-PCT103, PCT104-PCT120, PCT121-PCT131, PCT132-PCT137, PCT138-PCT143, PCT144, PCT145-PCT150, PCT151-PCT156, PCT157-PCT162, PCT163-PCT208, PCT209-PCT213, HCT1-HCT9, HCT10-HCT18, HCT19-HCT22, HCT23-HCT25, HCT26-HCT29, HCT30-HCT39, HCT40-HCT55, HCT56-HCT61, HCT62-HCT70, HCT71-HCT81, HCT82-HCT86, and HCT87-HCT110. The iterations are Parts 1-336, the Geographic Header File is Part 337. The Geographic Header File is in fixed-format ASCII and the table files are in comma-delimited ASCII format. A merged iteration will have 7,963 variables.For Parts 251-336, the part names contain numbers within parentheses that refer to the Ancestry Code List (page G1 of the codebook).

  15. PISA Test Scores

    • kaggle.com
    zip
    Updated Dec 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    piAI (2019). PISA Test Scores [Dataset]. https://www.kaggle.com/datasets/econdata/pisa-test-scores/code
    Explore at:
    zip(74778 bytes)Available download formats
    Dataset updated
    Dec 27, 2019
    Authors
    piAI
    Description

    Context

    The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

    The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.

    Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:

    Content

    grade: The grade in school of the student (most 15-year-olds in America are in 10th grade)

    male: Whether the student is male (1/0)

    raceeth: The race/ethnicity composite of the student

    preschool: Whether the student attended preschool (1/0)

    expectBachelors: Whether the student expects to obtain a bachelor's degree (1/0)

    motherHS: Whether the student's mother completed high school (1/0)

    motherBachelors: Whether the student's mother obtained a bachelor's degree (1/0)

    motherWork: Whether the student's mother has part-time or full-time work (1/0)

    fatherHS: Whether the student's father completed high school (1/0)

    fatherBachelors: Whether the student's father obtained a bachelor's degree (1/0)

    fatherWork: Whether the student's father has part-time or full-time work (1/0)

    selfBornUS: Whether the student was born in the United States of America (1/0)

    motherBornUS: Whether the student's mother was born in the United States of America (1/0)

    fatherBornUS: Whether the student's father was born in the United States of America (1/0)

    englishAtHome: Whether the student speaks English at home (1/0)

    computerForSchoolwork: Whether the student has access to a computer for schoolwork (1/0)

    read30MinsADay: Whether the student reads for pleasure for 30 minutes/day (1/0)

    minutesPerWeekEnglish: The number of minutes per week the student spend in English class

    studentsInEnglish: The number of students in this student's English class at school

    schoolHasLibrary: Whether this student's school has a library (1/0)

    publicSchool: Whether this student attends a public school (1/0)

    urban: Whether this student's school is in an urban area (1/0)

    schoolSize: The number of students in this student's school

    readingScore: The student's reading score, on a 1000-point scale

    Acknowledgements

    MITx ANALYTIX

  16. ACS 5YR Demographic Estimate Data by Tract

    • data.lojic.org
    • opendata.atlantaregional.com
    • +2more
    Updated Aug 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Housing and Urban Development (2023). ACS 5YR Demographic Estimate Data by Tract [Dataset]. https://data.lojic.org/datasets/b0d9506fcfa04762a882c3b6453e0144
    Explore at:
    Dataset updated
    Aug 21, 2023
    Dataset provided by
    United States Department of Housing and Urban Developmenthttp://www.hud.gov/
    Authors
    Department of Housing and Urban Development
    Area covered
    Description

    2016-2020 ACS 5-Year estimates of demographic variables (see below) compiled at the tract level.The American Community Survey (ACS) 5 Year 2016-2020 demographic information is a subset of information available for download from the U.S. Census. Tables used in the development of this dataset include: B01001 - Sex By Age; B03002 - Hispanic Or Latino Origin By Race; B11001 - Household Type (Including Living Alone); B11005 - Households By Presence Of People Under 18 Years By Household Type; B11006 - Households By Presence Of People 60 Years And Over By Household Type; B16005 - Nativity By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over; B25010 - Average Household Size Of Occupied Housing Units By Tenure, and; B15001 - Sex by Educational Attainment for the Population 18 Years and Over; To learn more about the American Community Survey (ACS), and associated datasets visit: https://www.census.gov/programs-surveys/acs, for questions about the spatial attribution of this dataset, please reach out to us at GISHelpdesk@hud.gov. Data Dictionary: DD_ACS 5-Year Demographic Estimate Data by TractDate of Coverage: 2016-2020

  17. Decennial Census: State Legislative District Summary File (Sample)

    • catalog.data.gov
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Census Bureau (2023). Decennial Census: State Legislative District Summary File (Sample) [Dataset]. https://catalog.data.gov/dataset/decennial-census-state-legislative-district-summary-file-sample
    Explore at:
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Description

    The State Legislative District Summary File (Sample) (SLDSAMPLE) contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units. Population items include basic population totals; urban and rural; households and families; marital status; grandparents as caregivers; language and ability to speak English; ancestry; place of birth, citizenship status, and year of entry; migration; place of work; journey to work (commuting); school enrollment and educational attainment; veteran status; disability; employment status; industry, occupation, and class of worker; income; and poverty status. Housing items include basic housing totals; urban and rural; number of rooms; number of bedrooms; year moved into unit; household size and occupants per room; units in structure; year structure built; heating fuel; telephone service; plumbing and kitchen facilities; vehicles available; value of home; monthly rent; and shelter costs. The file contains subject content identical to that shown in Summary File 3 (SF 3).

  18. g

    Census of Population and Housing, 2000 [United States]: Summary File 4,...

    • search.gesis.org
    Updated Feb 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Department of Commerce. Bureau of the Census (2021). Census of Population and Housing, 2000 [United States]: Summary File 4, District of Columbia - Version 1 [Dataset]. http://doi.org/10.3886/ICPSR13520.v1
    Explore at:
    Dataset updated
    Feb 26, 2021
    Dataset provided by
    ICPSR - Interuniversity Consortium for Political and Social Research
    GESIS search
    Authors
    United States Department of Commerce. Bureau of the Census
    License

    https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de457436https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de457436

    Area covered
    Washington, United States
    Description

    Abstract (en): Summary File 4 (SF 4) from the United States 2000 Census contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units. Population items include basic population totals: urban and rural, households and families, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status, year of entry, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, industry, occupation, class of worker, income, and poverty status. Housing items include basic housing totals: urban and rural, number of rooms, number of bedrooms, year moved into unit, household size and occupants per room, units in structure, year structure built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, monthly rent, and shelter costs. In Summary File 4, the sample data are presented in 213 population tables (matrices) and 110 housing tables, identified with "PCT" and "HCT" respectively. Each table is iterated for 336 population groups: the total population, 132 race groups, 78 American Indian and Alaska Native tribe categories (reflecting 39 individual tribes), 39 Hispanic or Latino groups, and 86 ancestry groups. The presentation of SF4 tables for any of the 336 population groups is subject to a population threshold. That is, if there are fewer than 100 people (100-percent count) in a specific population group in a specific geographic area, and there are fewer than 50 unweighted cases, their population and housing characteristics data are not available for that geographic area in SF4. For the ancestry iterations, only the 50 unweighted cases test can be performed. See Appendix H: Characteristic Iterations, for a complete list of characteristic iterations. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Created variable labels and/or value labels.. All persons in housing units in the District of Columbia in 2000. 2013-05-25 Multiple Census data file segments were repackaged for distribution into a single zip archive per dataset. No changes were made to the data or documentation.2006-01-12 All files were removed from dataset 342 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 341 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 340 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 339 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 338 and flagged as study-level files, so that they will accompany all downloads. Because of the number of files per state in Summary File 4, ICPSR has given each state its own ICPSR study number in the range ICPSR 13512-13563. The study number for the national file is 13570. Data for each state are being released as they become available.The data are provided in 38 segments (files) per iteration. These segments are PCT1-PCT4, PCT5-PCT16, PCT17-PCT34, PCT35-PCT37, PCT38-PCT45, PCT46-PCT49, PCT50-PCT61, PCT62-PCT67, PCT68-PCT71, PCT72-PCT76, PCT77-PCT78, PCT79-PCT81, PCT82-PCT84, PCT85-PCT86 (partial), PCT86 (partial), PCT87-PCT103, PCT104-PCT120, PCT121-PCT131, PCT132-PCT137, PCT138-PCT143, PCT144, PCT145-PCT150, PCT151-PCT156, PCT157-PCT162, PCT163-PCT208, PCT209-PCT213, HCT1-HCT9, HCT10-HCT18, HCT19-HCT22, HCT23-HCT25, HCT26-HCT29, HCT30-HCT39, HCT40-HCT55, HCT56-HCT61, HCT62-HCT70, HCT71-HCT81, HCT82-HCT86, and HCT87-HCT110. The iterations are Parts 1-336, the Geographic Header File is Part 337. The Geographic Header File is in fixed-format ASCII and the table files are in comma-delimited ASCII format. A merged iteration will have 7,963 variables.For Parts 251-336, the part names contain numbers within parentheses that refer to the Ancestry Code List (page G1 of the codebook).

  19. g

    Census of Population and Housing, 2000 [United States]: Summary File 4, West...

    • search.gesis.org
    Updated Feb 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Department of Commerce. Bureau of the Census (2021). Census of Population and Housing, 2000 [United States]: Summary File 4, West Virginia - Archival Version [Dataset]. http://doi.org/10.3886/ICPSR13560
    Explore at:
    Dataset updated
    Feb 8, 2021
    Dataset provided by
    ICPSR - Interuniversity Consortium for Political and Social Research
    GESIS search
    Authors
    United States Department of Commerce. Bureau of the Census
    License

    https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de446517https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de446517

    Area covered
    West Virginia, United States
    Description

    Abstract (en): Summary File 4 (SF 4) from the United States 2000 Census contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units. Population items include basic population totals: urban and rural, households and families, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status, year of entry, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, industry, occupation, class of worker, income, and poverty status. Housing items include basic housing totals: urban and rural, number of rooms, number of bedrooms, year moved into unit, household size and occupants per room, units in structure, year structure built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, monthly rent, and shelter costs. In Summary File 4, the sample data are presented in 213 population tables (matrices) and 110 housing tables, identified with "PCT" and "HCT" respectively. Each table is iterated for 336 population groups: the total population, 132 race groups, 78 American Indian and Alaska Native tribe categories (reflecting 39 individual tribes), 39 Hispanic or Latino groups, and 86 ancestry groups. The presentation of SF4 tables for any of the 336 population groups is subject to a population threshold. That is, if there are fewer than 100 people (100-percent count) in a specific population group in a specific geographic area, and there are fewer than 50 unweighted cases, their population and housing characteristics data are not available for that geographic area in SF4. For the ancestry iterations, only the 50 unweighted cases test can be performed. See Appendix H: Characteristic Iterations, for a complete list of characteristic iterations. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Created variable labels and/or value labels.. All persons in housing units in West Virginia in 2000. 2013-05-25 Multiple Census data file segments were repackaged for distribution into a single zip archive per dataset. No changes were made to the data or documentation.2006-01-12 All files were removed from dataset 342 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 341 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 340 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 339 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 338 and flagged as study-level files, so that they will accompany all downloads. Because of the number of files per state in Summary File 4, ICPSR has given each state its own ICPSR study number in the range ICPSR 13512-13563. The study number for the national file is 13570. Data for each state are being released as they become available.The data are provided in 38 segments (files) per iteration. These segments are PCT1-PCT4, PCT5-PCT16, PCT17-PCT34, PCT35-PCT37, PCT38-PCT45, PCT46-PCT49, PCT50-PCT61, PCT62-PCT67, PCT68-PCT71, PCT72-PCT76, PCT77-PCT78, PCT79-PCT81, PCT82-PCT84, PCT85-PCT86 (partial), PCT86 (partial), PCT87-PCT103, PCT104-PCT120, PCT121-PCT131, PCT132-PCT137, PCT138-PCT143, PCT144, PCT145-PCT150, PCT151-PCT156, PCT157-PCT162, PCT163-PCT208, PCT209-PCT213, HCT1-HCT9, HCT10-HCT18, HCT19-HCT22, HCT23-HCT25, HCT26-HCT29, HCT30-HCT39, HCT40-HCT55, HCT56-HCT61, HCT62-HCT70, HCT71-HCT81, HCT82-HCT86, and HCT87-HCT110. The iterations are Parts 1-336, the Geographic Header File is Part 337. The Geographic Header File is in fixed-format ASCII and the table files are in comma-delimited ASCII format. A merged iteration will have 7,963 variables.For Parts 251-336, the part names contain numbers within parentheses that refer to the Ancestry Code List (page G1 of the codebook).

  20. c

    Census of Population and Housing, 2000: Summary File 3, Alaska

    • archive.ciser.cornell.edu
    Updated Jun 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of the Census (2024). Census of Population and Housing, 2000: Summary File 3, Alaska [Dataset]. http://doi.org/10.6077/czxz-3621
    Explore at:
    Dataset updated
    Jun 10, 2024
    Dataset authored and provided by
    Bureau of the Census
    Variables measured
    HousingUnit, Individual
    Description

    Summary File 3 contains sample data, which is the information compiled from the questions asked of a sample of all people and housing units in the United States. Population items include basic population totals as well as counts for the following characteristics: urban and rural, households and families, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status, year of entry, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, industry, occupation, class of worker, income, and poverty status. Housing items include basic housing totals and counts for urban and rural, number of rooms, number of bedrooms, year moved into unit, household size and occupants per room, units in structure, year structure built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, and monthly rent and shelter costs. The Summary File 3 population tables are identified with a "P" prefix and the housing tables are identified with an "H," followed by a sequential number. The "P" and "H" tables are shown for the block group and higher level geography, while the "PCT" and "HCT" tables are shown for the census tract and higher level geography. There are 16 "P" tables, 15 "PCT" tables, and 20 "HCT" tables that bear an alphabetic suffix on the table number, indicating that they are repeated for nine major race and Hispanic or Latino groups. There are 484 population tables and 329 housing tables for a total of 813 unique tables. (Source: downloaded from ICPSR 7/13/10)

    Please Note: This dataset is part of the historical CISER Data Archive Collection and is also available at ICPSR at https://doi.org/10.3886/ICPSR13343.v1. We highly recommend using the ICPSR version as they may make this dataset available in multiple data formats in the future.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). Top Languages Spoken in the United States [Dataset]. https://www.kaggle.com/datasets/thedevastator/top-languages-spoken-in-the-united-states
Organization logo

Top Languages Spoken in the United States

The Impact of linguistics on Community and Business in America

Explore at:
zip(356420 bytes)Available download formats
Dataset updated
Oct 22, 2022
Authors
The Devastator
Area covered
United States
Description

Top Languages Spoken in the United States

The Impact of linguistics on Community and Business in America

About this dataset

Languages are an important part of daily life in the USA. Here is a table that shows the most common languages spoken in the USA, as well as a big spreadsheet which shows each CBSA (Core-Based Statistical Area, or urban area).

Language usage varies widely throughout the United States. According to the latest census data, over 350 different languages are represented in homes across the country. The following table and spreadsheet provide more detailed information on language usage throughout the various states and cities in the US:

Columns: - index: Index column for dataframe - Table with column headers in row 5 and row headers in column A: Contains language data for each CBSA (Core Based Statistical Area) - Unnamed: 1: Rank of CBSA by total number of speakers of all languages - Unnamed: 2: Name of CBSA - Unnamed: 3: Population of CBSA - Unnamed: 4: Percent of population that speaks English very well - Unnamed: 5 through Unnamed: 58 : Languages spoken by at least 0.1% of the population, with corresponding percentages

How to use the dataset

  1. This dataset can be used to understand the linguistic diversity of the United States, and to compare languages spoken across different states and cities.
  2. This data can also be used to explore trends in language usage over time.
  3. businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and tailor their marketing or customer service accordingly.
  4. Schools could use this dataset to plan language-learning programs based on the needs of their community.
  5. Policymakers could use this data to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

Research Ideas

  1. Businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and cater their marketing or customer service accordingly.
  2. Schools could use this data to plan language-learning programs based on the needs of their community.
  3. Policymakers could use this dataset to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

Acknowledgements

This dataset was created by Gary Hoover. The data was sourced from https://www.kaggle.com/garyhoov/us-languages

License

Unknown License - Please check the dataset description for more information.

Columns

File: Languages Spoken at Home by Urban Area = CBSA.csv

File: US Languages Spoken at Home 2014.csv | Column name | Description | |:-------------------------------------------------------------------|:--------------| | Table with column headers in row 5 and row headers in column A | |

Search
Clear search
Close search
Google apps
Main menu