This is a dataset I found online through the Google Dataset Search portal.
The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.
The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.
The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.
These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.
Sources:
Google Dataset Search: https://toolbox.google.com/datasetsearch
2009-2013 American Community Survey
Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html
Downloaded From: https://data.world/kvaughn/languages-county
Banner and thumbnail photo by Farzad Mohsenvand on Unsplash
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study tests the malleability of thinking styles across cultures. Participants from China, Hong Kong, and the United States were randomly assigned to one of four conditions to manipulate thinking styles (analytical, holistic, intuitive, and control). Participants first responded to a scale measuring four thinking styles (analytical, holistic, intuitive, and normative). They then read a message to induce one of these thinking styles and responded to four scenarios regarding their decision related to the scenario, the difficulty in making the decision, their confidence in their decision, and the perceived realism of the scenario. Participants then responded to the same scale measuring the four thinking styles. Results supported the expectation that people in different cultures use predominantly different thinking styles to make decisions. The manipulation of thinking styles, however, changed people’s thinking in complex rather than direct ways. Analytical thinking, which is the predominant style used by U.S. Americans, was not as malleable as the other styles, and the American participants were less changeable in their style than participants from China and Hong Kong. In other words, and in summary, some thinking styles and some cultures seem to be more malleable than others. Implications of these results for understanding culture and cognition are discussed.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Cultural Items Dataset for HW1 of the NLP course (2025)
This is the dataset for the first homework of the 2025 edition of the NLP course at Sapienza University. The dataset is a collection of Wikidata Items classified as:
Cultural Agnostic: the item is commonly known/used worldwide and no culture claims the item. Cultural Representative: the item is originated in a culture and/or claimed by a culture as their own, but other cultures know/use it or have similar items. Cultural… See the full description on the dataset page: https://huggingface.co/datasets/sapienzanlp/nlp2025_hw1_cultural_dataset.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Even people from frequently studied cultural contexts differ in how they conceptualize compassion, partly because of differences in how much they want to avoid feeling negative. To broaden this past work, we include participants from an understudied cultural context and start to examine the process through which culture shapes compassion. Based on ethnographic and empirical studies that include Ecuadorians, we predicted that Ecuadorians would want to avoid feeling negative less compared to U.S. Americans. Furthermore, we hypothesized that because of these differences in avoided negative affect, compared to U.S. Americans, for Ecuadorians, a compassionate response would contain more emotion sharing, which in turn would be associated with conceptualizing a compassionate face as one that mirrors sadness more and expresses happiness (e.g., a kind smile) less. Using a reverse correlation task, participants in the U.S. and Ecuador selected the stimuli that most resembled a compassionate face. They also reported how much they wanted to avoid feeling negative and described what a compassionate response would entail. As predicted, compared to U.S. Americans, Ecuadorians wanted to avoid feeling negative less, they conceptualized a compassionate response as one that focused more on emotion sharing, and visualized a compassionate face as one that contained more sadness and less happiness. Furthermore, exploratory analyses suggest that wanting to avoid feeling negative and conceptualizations of a compassionate response as emotion sharing partly sequentially explained the cultural differences in conceptualizations of a compassionate face. What people regard as compassionate differs across cultures, which has important implications for cross-cultural counseling.
https://www.icpsr.umich.edu/web/ICPSR/studies/36801/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/36801/terms
The 2015 American Housing Survey marks the first release of a newly integrated national sample and independent metropolitan area samples. The 2015 release features many variable name revisions, as well as the integration of an AHS Codebook Interactive Tool available on the U.S. Census Bureau We site. This data collection provides information on the characteristics of a national sample of housing units in 2015, including apartments, single-family homes, mobile homes, and vacant housing units. Data from the 15 largest metropolitan areas in the United States are included in the national sample survey (the AHS 2015 Metropolitan Data are also available as ICPSR 36805). The data are presented in three separate parts: Part 1, Household Record (Main Record), Part 2, Person Record, and Part 3, Project Record. Household Record data includes questions about household occupancy and tenure, household exterior and interior structural features, household equipment and appliances, housing problems, housing costs, home improvement, neighborhood features, recent moving information, income, and basic demographic information. The household record data also features four rotating topical modules: Arts and Culture, Food Security, Housing Counseling, and Healthy Homes. Person Record data includes questions about personal disabilities, income, and basic demographic information. Finally, the Project Record data includes questions about home improvement projects. Specific questions were asked about the types of projects, costs, funding sources, and year of completion.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Previous research indicates that cultural variations exist in conceptualizations of compassion, potentially attributable to the extent to which individuals in diverse cultural settings want to avoid (versus accept) feeling negative emotions and the significance they place on emotional sharing as a component of compassion. The present study investigates the conceptualization of compassion among individuals in Mexico and the United States, aiming to understand why these cultural differences occur. We hypothesized that Mexicans (1) would want to avoid feeling negative less, (2) would consequently regard emotion sharing as a more critical element of a compassionate response, and (3) would therefore conceptualize a compassionate face as one that mirrors sadness more and expresses happiness less compared to U.S. Americans. Participants from Mexico and the United States engaged in a reverse correlation task, selecting stimuli that most closely resembled a compassionate face. The selected images were aggregated and coded for the extent of sadness and happiness depicted. Additionally, participants indicated how much they wanted to avoid feeling negative and, by using an open-ended format, described what a compassionate response would entail in their view. These responses were coded for whether or not they focused on emotion sharing. Consistent with our hypotheses, Mexicans, who want to avoid feeling negative less compared to U.S. Americans, place greater importance on emotion sharing in a compassionate response. This variation is associated with Mexicans conceptualizing a compassionate face as one that portrays more sadness and less happiness compared to U.S. Americans. People in different cultural contexts have different views about what compassion might entail. Understanding and embracing these cultural differences in compassion can help us navigate our increasingly multicultural world, fostering more meaningful connections and guiding our actions with more humility and sensitivity.
https://www.icpsr.umich.edu/web/ICPSR/studies/36357/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/36357/terms
The Arts and Cultural Production Satellite Account (ACPSA) is produced through the partnership between the United States Bureau of Economic Analysis (BEA) and the National Endowment for the Arts (NEA). Built with the BEA's input-output (I-O) accounts, the ACPSA provides detailed statistics that illustrate the impact of arts and cultural production on the United States economy. Specifically, this account provides an assessment of the arts and cultural sector's contributions to gross domestic product (GDP). For years 1998 to 2020, the ACPSA presents annual statistics about the following items: (1) Output of detailed arts and cultural commodities and the industries producing these commodities; (2) employment and compensation within these industries; (3) arts and cultural value added by industry; and (4) commodity-flow details for arts and cultural production products. In the data tables provided, the statistics fall under two broad categories: (1) core arts and cultural production and (2) supporting arts and cultural production. The core category contains the commodities in which the output primarily contributes to arts and culture. Performing arts, museums, design services, and arts education are included in the core category. The supporting category consists of commodities that support the core category through publication, dissemination of the creative process, or other supportive functions. This category contains event promotion, printing, and broadcasting. The seven national-level data tables provided for each year from 1998 to 2020 include: Table 1. Production of Commodities by Industry Table 2. Output and Value Added by Industry Table 3. Supply and Consumption of Commodities Table 4. Employment and Compensation of Employees by Industry Table 5. Total ACPSA-related Employment by Industry Table 6. Output by ACPSA Commodity Table 7. Real Output by Commodity For years 2001-2020 a state-level value added and employment data table is included. It contains value added by industry by state, estimates for each state annually of employment and compensation by industry, and comparisons with ACPSA employment and compensation by industry the same year. It also includes the annual total of employment in each state across the arts and cultural commodities industries. In addition, estimates of real value added by industry and estimates of real gross output and prices indexes by ACPSA commodity are provided in separate Excel files. The industries and commodities presented in the data are based on the 2007 North American Industrial Classification System (NAICS). Users are encouraged to review the Table Guide as it gives important information for all data tables. Also, users should review the NEA Guide to the U.S. Arts and Cultural Production Satellite Account and the latest Arts Data Profile Series reports dedicated to the ACPSA: The U.S. Arts and Cultural Production Satellite Account (1998-2020) and State-Level Estimates of the Arts' Economic Value and Employment (2001-2020).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Cultural Nuances Dataset V1: Understanding Cross-Cultural Differences with Chain of Thought Reasoning
Description: Dive into the intricate world of cultural differences with the "Cultural Nuances Dataset V1." This open-source resource (MIT licensed) presents a carefully curated collection of question-and-answer pairs designed to train AI models in understanding the subtle yet significant variations in language, behavior, decision-making, and social norms across diverse cultures.… See the full description on the dataset page: https://huggingface.co/datasets/moremilk/CoT-Reasoning_Cultural_Nuances.
According to a report published by UNESCO in February 2022, cultural and creative industries accounted for 3.1 percent of the global gross domestic product and 6.2 percent of global employment in 2020. That year, the coronavirus (COVID-19) pandemic set unprecedented challenges for this market. Overall, due to the pandemic, it was estimated that cultural and creative industries worldwide lost around 750 billion U.S. dollars in gross value added (GVA) and 10 million jobs in 2020.
In order to develop various methods of comparable data collection on health and health system responsiveness WHO started a scientific survey study in 2000-2001. This study has used a common survey instrument in nationally representative populations with modular structure for assessing health of indviduals in various domains, health system responsiveness, household health care expenditures, and additional modules in other areas such as adult mortality and health state valuations.
The health module of the survey instrument was based on selected domains of the International Classification of Functioning, Disability and Health (ICF) and was developed after a rigorous scientific review of various existing assessment instruments. The responsiveness module has been the result of ongoing work over the last 2 years that has involved international consultations with experts and key informants and has been informed by the scientific literature and pilot studies.
Questions on household expenditure and proportionate expenditure on health have been borrowed from existing surveys. The survey instrument has been developed in multiple languages using cognitive interviews and cultural applicability tests, stringent psychometric tests for reliability (i.e. test-retest reliability to demonstrate the stability of application) and most importantly, utilizing novel psychometric techniques for cross-population comparability.
The study was carried out in 61 countries completing 71 surveys because two different modes were intentionally used for comparison purposes in 10 countries. Surveys were conducted in different modes of in- person household 90 minute interviews in 14 countries; brief face-to-face interviews in 27 countries and computerized telephone interviews in 2 countries; and postal surveys in 28 countries. All samples were selected from nationally representative sampling frames with a known probability so as to make estimates based on general population parameters.
The survey study tested novel techniques to control the reporting bias between different groups of people in different cultures or demographic groups ( i.e. differential item functioning) so as to produce comparable estimates across cultures and groups. To achieve comparability, the selfreports of individuals of their own health were calibrated against well-known performance tests (i.e. self-report vision was measured against standard Snellen's visual acuity test) or against short descriptions in vignettes that marked known anchor points of difficulty (e.g. people with different levels of mobility such as a paraplegic person or an athlete who runs 4 km each day) so as to adjust the responses for comparability . The same method was also used for self-reports of individuals assessing responsiveness of their health systems where vignettes on different responsiveness domains describing different levels of responsiveness were used to calibrate the individual responses.
This data are useful in their own right to standardize indicators for different domains of health (such as cognition, mobility, self care, affect, usual activities, pain, social participation, etc.) but also provide a better measurement basis for assessing health of the populations in a comparable manner. The data from the surveys can be fed into composite measures such as "Healthy Life Expectancy" and improve the empirical data input for health information systems in different regions of the world. Data from the surveys were also useful to improve the measurement of the responsiveness of different health systems to the legitimate expectations of the population.
Sample survey data [ssd]
A sample of 5,000 households across the US was purchased from Survey Sampling, Inc. located in Connecticut. This sample is based on Random Digit samples.
This sample was stratified by state to match the percentage of U.S. residents living in each of the fifty states.
The 5,000 sampled households were randomly assigned to one of three different experimental treatments (normal, personalized and personalised plus 2$ incentive)
The experiment was done for purposes of evaluating response rate effects of alternative means of contacting US residents.
Mail Questionnaire [mail]
Data Coding At each site the data was coded by investigators to indicate the respondent status and the selection of the modules for each respondent within the survey design. After the interview was edited by the supervisor and considered adequate it was entered locally.
Data Entry Program A data entry program was developed in WHO specifically for the survey study and provided to the sites. It was developed using a database program called the I-Shell (short for Interview Shell), a tool designed for easy development of computerized questionnaires and data entry (34). This program allows for easy data cleaning and processing.
The data entry program checked for inconsistencies and validated the entries in each field by checking for valid response categories and range checks. For example, the program didn’t accept an age greater than 120. For almost all of the variables there existed a range or a list of possible values that the program checked for.
In addition, the data was entered twice to capture other data entry errors. The data entry program was able to warn the user whenever a value that did not match the first entry was entered at the second data entry. In this case the program asked the user to resolve the conflict by choosing either the 1st or the 2nd data entry value to be able to continue. After the second data entry was completed successfully, the data entry program placed a mark in the database in order to enable the checking of whether this process had been completed for each and every case.
Data Transfer The data entry program was capable of exporting the data that was entered into one compressed database file which could be easily sent to WHO using email attachments or a file transfer program onto a secure server no matter how many cases were in the file. The sites were allowed the use of as many computers and as many data entry personnel as they wanted. Each computer used for this purpose produced one file and they were merged once they were delivered to WHO with the help of other programs that were built for automating the process. The sites sent the data periodically as they collected it enabling the checking procedures and preliminary analyses in the early stages of the data collection.
Data quality checks Once the data was received it was analyzed for missing information, invalid responses and representativeness. Inconsistencies were also noted and reported back to sites.
Data Cleaning and Feedback After receipt of cleaned data from sites, another program was run to check for missing information, incorrect information (e.g. wrong use of center codes), duplicated data, etc. The output of this program was fed back to sites regularly. Mainly, this consisted of cases with duplicate IDs, duplicate cases (where the data for two respondents with different IDs were identical), wrong country codes, missing age, sex, education and some other important variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Past research has shown that culture can form and shape our temporal orientation–the relative emphasis on the past, present, or future. However, there are mixed findings on how temporal orientations vary between North American and East Asian cultures due to the limitations of survey methodology and sampling. In this study, we applied an inductive approach and leveraged big data and natural language processing between two popular social media platforms–Twitter and Weibo–to assess the similarities and differences in temporal orientation in the United States of America and China, respectively. We first established predictive models from annotation data and used them to classify a larger set of English Twitter sentences (NTW = 1,549,136) and a larger set of Chinese Weibo sentences (NWB = 95,181) into four temporal catetories–past, future, atemporal present, and temporal present. Results show that there is no significant difference between Twitter and Weibo on past or future orientations; the large temporal orientation difference between North Americans and Chinese derives from their different prevailing focus on atemporal (e.g., facts, ideas) present (Twitter) or temporal present (e.g., the “here” and “now”) (Weibo). Our findings contribute to the debate on cultural differences in temporal orientations with new perspectives following a new methodological approach. The study’s implications call for a reevaluation of how temporal orientation is measured in cross-cultural studies, emphasizing the use of large-scale language data and acknowledging the atemporal present category. Understanding temporal orientations can guide effective cross-cultural communication strategies to tailor approaches for different audience based on temporal orientations, enhancing intercultural understanding and engagement.
The Documentary Policy Mission of the Ministry of Culture publishes the results of its international watch each month in open data. ** ** Find in map form a selection of articles from cultural actors and the foreign press concerning the cultural policies of the different countries of the world. Corpus: * local and international sources * free or partial access sources * three languages: French, English, Spanish The filter data is sorted in alphabetical order, don't hesitate to scroll through the lists. Since the display of the number of data is limited, some lists may be truncated. Initiated in April 2023, this watch is likely to change: do not hesitate to contact us for any suggestions for improvement at [diane.gaudron@culture.gouv.fr]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
National cultures and cultural differences provide a crucial component of the international business (IB) research context. We conducted a bibliometric study of articles published in seven leading IB journals over a period of three decades to analyze how national culture has been impacting IB research. Co-citation mappings permit us to identify the ties binding works dealing with culture and cultural issues in IB. We identify two main clusters of research, each comprising two sub-clusters, with Hofstede’s (1980) work delineating much of the conceptual and empirical approach to culture-related studies. One main cluster entails works on the conceptualization of culture and its dimensions and the other cluster focuses on cultural distance. This conceptual framework captures the extant IB research incorporating culture-related concepts and influences.
This data collection consists of pilot data measuring task equivalence for measures of attention and interpretation bias. Congruent Mandarin and English emotional Stroop, attention probe (both measuring attention bias) and similarity ratings task and scrambled sentence task (both measuring interpretation bias) were developed using back-translation and decentering procedures. Tasks were then completed by 47 bilingual Mandarin-English speakers. Presented are data detailing personal characteristics, task scores and bias scores.
The way in which we process information in the world around us has a significant effect on our health and well being. For example, some people are more prone than others to notice potential dangers, to remember bad things from the past and assume the worst, when the meaning of an event or comment is uncertain. These tendencies are called negative cognitive biases and can lead to low mood and poor quality of life. They also make people vulnerable to mental illnesses.
In contrast, those with positive cognitive biases tend to function well and remain healthy. To date most of this work has been conducted on white, western populations and we do not know whether similar cognitive biases exist in Eastern cultures. This project will examine cognitive biases in Eastern (Hong Kong nationals ) and Western (UK nationals) people to see whether there are any differences between the two. It will also examine what happens to cognitive biases when someone migrates to a different culture. This will tell us whether influences from the society and culture around us have any effect on our cognitive biases. Finally the project will consider how much our own cognitive biases are inherited from our parents.
Together these results will tell us whether the known good and bad effects of cognitive biases apply to non Western cultural groups as well, and how much cognitive biases are decided by our genes or our environment.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Institutions of higher education (IHE) throughout the United States have a long history of acting out various levels of commitment to diversity advancement, equity, and inclusion (DEI). Despite decades of DEI “efforts,” the academy is fraught with legacies of racism that uphold white supremacy and prevent marginalized populations from full participation. Furthermore, politicians have not only weaponized education but passed legislation to actively ban DEI programs and censor general education curricula (https://tinyurl.com/antiDEI). Ironically, systems of oppression are particularly apparent in the fields of Ecology, Evolution, and Conservation Biology (EECB)–which recognize biological diversity as essential for ecological integrity and resilience. Yet, amongst EECB faculty, people who do not identify as cis-heterosexual, non-disabled, affluent white males are poorly represented. Furthermore, IHE lack metrics to quantify DEI as a priority. Here we show that only 30.3% of US-faculty positions advertised in EECB from Jan 2019-May 2020 required a diversity statement; diversity statement requirements did not correspond with state-level diversity metrics. Though many announcements “encourage women and minorities to apply,” empirical evidence demonstrates that hiring committees at most institutions did not prioritize an applicant’s DEI advancement potential. We suggest a model for change and call on administrators and faculty to implement SMART (i.e., Specific, Measurable, Achievable, Realistic, and Timely) strategies for DEI advancement across IHE throughout the United States. We anticipate our quantification of diversity statement requirements relative to other application materials will motivate institutional change in both policy and practice when evaluating a candidate’s potential “fit”. IHE must embrace a leadership role to not only shift the academic culture to one that upholds DEI, but to educate and include people who represent the full diversity of our society. In the current context of political censure of education including book banning and backlash aimed at Critical Race Theory, which further reinforce systemic white supremacy, academic integrity and justice are more critical than ever. Methods Here we investigated the (lack of) process in faculty searches at IHE for evaluating candidates’ ability to advance DEI objectives. We quantified the prevalence of required diversity statements relative to research and/or teaching statements for all faculty positions posted to the Eco-Evo Jobs Board (http://ecoevojobs.net) from January 2019 - May 2020 as a proxy for institutional DEI prioritization (Supplement). We also mapped the job posts that required diversity statements geographically to gauge whether and where diversity is valued in higher education across the US. Data analysis We pulled all faculty jobs posted on Eco-Evo jobs board (http://ecoevojobs.net) from Jan 1, 2019, to May 31, 2020. For each position, we recorded the Location (i.e., state), Subject Area, Closing Date, Rank, whether or not the position is Tenure Track, and individual application materials (i.e., Research statement, Teaching statement, combined Teaching and Research statement, Diversity statement, Mentorship statement). Of the 543 faculty positions posted during this time, we eliminated 299 posts because the web links were broken or application information was no longer available (i.e., “NA”), leaving 244 faculty job posts. For each of the retained posts, we coded the requirement of teaching, research, diversity, and/or mentorship statements as follows:
"Yes” = statement required “No” = statement not required “Other” = application materials did not explicitly require a Diversity Statement (i.e., option or suggested that applicants include a statement on diversity and inclusion as a component of their teaching and/or research statement or in their cover letter)
Data visualization We created a Sankey diagram using Sankey Flow Show (THORTEC Software GmbH: www.sankeyflowshow.com) to compare diversity and representation from the general population, through (Science, Technology, Engineering, and Mathematics) STEM academia (a career hierarchy often referred to as the “leaky pipeline”). We procured population data from the US Census Bureau (US Department of Commerce: https://www.census.gov/quickfacts/fact/table/US/PST045219) and quantified the diversity/representation in Conservation Biology (https://datausa.io/profile/cip/ecology-evolution-systematics-population-biology#demographics) and Ecology (https://datausa.io/profile/cip/conservation-biology) using Data USA (developed by Deloitte Touche Tohmatsu Limited and Datawheel). We used the 2015 Diversity Index (produced by PolicyLink and the USC Program for Environmental and Regional Equity: https://nationalequityatlas.org/indicators/Diversity_index/Ranking:33271/United_States/false/Year(s):2015/) to quantify relative ethnic diversity per state, and graphed Figure 2B using the tidyverse, rgdal, broom, and rgeos packages in R (see Base code used to produce Figure 2 in R, below). The Diversity index measures the representation of White, Black, Latino, Asian/Pacific Islander, Native American, and Mixed/other race in a given population. A maximum possible diversity score (1.79) would indicate even representation of all ethnic/racial groups. We checked all figures using the Color Blindness Simulator (ColBlindor: https://www.color-blindness.com/coblis-color-blindness-simulator/) to maintain inclusivity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The zip file contains fourteen Parquet [1] files of two kinds, for each of the seven years between 2015 and 2021 included: - region_counts: for every word found, gives how many times it appeared, regardless of capitalization ("count" column), how many times it appeared with at least one capitalized letter ("count_upper"), in how many different counties it appeared ("nr_cells"), and whether we considered it to be a proper noun ("is_proper") - raw_cell_counts: gives the count for every word by county, regardless of capitalization.
These counts were obtained from geo-tagged Tweets posted those years within the contiguous US, which were collected through the through the streaming API of Twitter, and more specifically using the “statuses/filter” end-point [2]. See the project's paper for more details on methodology, and the code repository to reproduce the analysis.
The two text files are our lists of excluded word forms.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
BLEnD
This is the official repository of BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages (Submitted to NeurIPS 2024 Datasets and Benchmarks Track). 24/12/05: Updated translation errors25/05/02: Updated multiple choice questions file (v1.1)
About
Large language models (LLMs) often lack culture-specific everyday knowledge, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural… See the full description on the dataset page: https://huggingface.co/datasets/nayeon212/BLEnD.
alielfilali01/MA-Culture-Vision-v0.2 dataset hosted on Hugging Face and contributed by the HF Datasets community
https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de653307https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de653307
Abstract (en): The Study of Women's Health Across the Nation (SWAN), is a multi-site longitudinal, epidemiologic study designed to examine the health of women during their middle years. The study examines the physical, biological, psychological, and social changes during this transitional period. The goal of SWAN's research is to help scientists, health care providers, and women learn how mid-life experiences affect health and quality of life during aging. The data include questions about doctor visits, medical conditions, medications, treatments, medical procedures, relationships, smoking, and menopause related information. The study is co-sponsored by the National Institute on Aging (NIA), the National Institute of Nursing Research (NINR), the National Institutes of Health (NIH), and the NIH Office of Research on Women's Health. The study began in 1994. Between 1999 and 2001, 2,710 of the 3,302 women that joined SWAN were seen for their third follow-up visit. The research centers are located in the following communities: Detroit, Michigan; Boston, Massachusetts; Chicago, Illinois; Oakland and Los Angeles, California; Newark, New Jersey; and Pittsburgh, Pennsylvania. SWAN participants represent five racial/ethnic groups and a variety of backgrounds and cultures. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Created variable labels and/or value labels.; Created online analysis version with question text.; Checked for undocumented or out-of-range codes.. Presence of Common Scales: Raw data can be used to create CES-D and SF-36 scores. Response Rates: 16,065 completed the screening interview. 3,302 were enrolled in the longitudinal study. 2,881 completed the first follow-up visit. 2,748 completed the second follow-up visit. 2,710 completed the third follow-up visit. Datasets:DS1: Study of Womens Health Across the Nation (SWAN): Visit 03 Dataset, [United States], 1999-2001 Women age 40 through 55, living in designated geographic areas, with the ability to speak English or other designated languages (Japanese, Cantonese, or Spanish), who had the cognitive ability to provide verbal informed consent, and had membership in a specific site's targeted ethnic group. Smallest Geographic Unit: None Site-specific sampling frames were used and encompassed a range of types, including lists of households, telephone numbers, and individual names of women. 2019-05-29 This data collection has been enhanced in the following ways. The title of the study was updated to match current ICPSR standards. Variable labels have been revised to spell out abbreviations and acronyms, and to correct prior misspellings. The variables in the dataset have also been reordered to match the documentation provided by the Principal Investigator. A fuller version of the question text pertaining to individual variables was completed, and now available in the ICPSR codebook. An additional document was included in this release that lists all the publications based off of the SWAN data series. Lastly, the study is now available for online analysis.2018-08-22 The data were updated to adjust missing values.2014-02-12 This data collection is now publicly available. Funding institution(s): United States Department of Health and Human Services. National Institutes of Health (NR004061). United States Department of Health and Human Services. National Institutes of Health. National Institute on Aging (AG012495, AG012505, AG012539, AG012546, AG012553, AG012554). United States Department of Health and Human Services. National Institutes of Health. National Institute of Nursing Research (AG012535). United States Department of Health and Human Services. National Institutes of Health. Office of Research on Women's Health (AG012531). face-to-face interview self-enumerated questionnaire
This is a dataset I found online through the Google Dataset Search portal.
The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.
The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.
The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.
These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.
Sources:
Google Dataset Search: https://toolbox.google.com/datasetsearch
2009-2013 American Community Survey
Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html
Downloaded From: https://data.world/kvaughn/languages-county
Banner and thumbnail photo by Farzad Mohsenvand on Unsplash