26 datasets found
  1. Income of individuals by age group, sex and income source, Canada, provinces...

    • www150.statcan.gc.ca
    • ouvert.canada.ca
    • +2more
    Updated May 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government of Canada, Statistics Canada (2025). Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas [Dataset]. http://doi.org/10.25318/1110023901-eng
    Explore at:
    Dataset updated
    May 1, 2025
    Dataset provided by
    Statistics Canadahttps://statcan.gc.ca/en
    Area covered
    Canada
    Description

    Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas, annual.

  2. h

    Supporting data for “Family and Work of Middle-Class Women with Two Children...

    • datahub.hku.hk
    Updated Sep 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yixi Chen (2022). Supporting data for “Family and Work of Middle-Class Women with Two Children under the Universal Two-Child Policy in Urban China ” [Dataset]. http://doi.org/10.25442/hku.20579436.v1
    Explore at:
    Dataset updated
    Sep 7, 2022
    Dataset provided by
    HKU Data Repository
    Authors
    Yixi Chen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset is a file of the raw interview scripts with my interviewees during the fieldwork conducted between 2021.6 to 2022.2.

    This thesis investigates how urban middle-class working women with two children make sense of work, childcare, and self under the universal two-child policy of China. This thesis also explores how the idea of individual and family interact in these women's construction of a sense of self. On January 1st, 2016, the one-child policy was replaced by the universal two-child policy, under which all married couples in China are allowed to have two children. In the scholarships of motherhood, it is widely documented across cultures that it is a site of patriarchal oppression where women are expected to meet the unrealistic ideal of intensive mothering to be a good mother, suffer from the motherhood wage penalty and face more work-family conflict than fathers. Emprical studies of China also came to similar conclusions and such findings are not only widely regonized in scholarship but is also widespread in popular discourse in China. Despite that marriage and having children is still universal for the generation of the research target, women born in the 1970s and 1980s, due to compounding influence fo the one-child policy, increasing financial burden of raising a child etcs, having only one child has become widely acceptable and normal. Given this context, this study intend to investigate how these middle-class women, who are relatively empowered and resourceful, come to a decision that is seemingly against their own interest. Moreover, unlike in the west where the issue of childbearing and childcaring is mainly an issue of the conjugal couple and the gender realtions is at the center of the discussion, in China, extended family, especially grandparents also play a role in both the decision making process and the subsequent childcare arrangement. Therefore, to study the second-time mothers’ childcare and work experiences in contemporary urban China, we also need to situate them, as individuals, in their family. To investigate how they make sense of childcare and work is also to understand the tension between individual and family. By interviewing twenty-one parents from middle-class family in Guangzhou with a second child under six years old, this study finds that these urban working women with two children consider themselves as an individual unit and full-time paid employment is something that cannot be given up since it is the means of securing that independent self . However, they did not prioritize their personal interest to that of other family members, especially the elder child and thus the decision of having a second child is mainly for the sake of the elder child. Moreover, grandparents played an essential role to provide a childcare safety net, without which, these urban working women would not be able to work full-time and maintain the independent self as they defined it. The portrayal of these women’s experiences reflected the individualization process in China where people are indivdualized without individualism, and family are evoked as strategy to achieve personal as well as family goals. The findings of this study contributs to theories of motherhood by adding an intergenerational perspective to the existing gender perspective and also contributes to the studies of family by understanding the relation and interaction between individual and family in thse women’s construction of sense of self in the context of contemporary China.

  3. Households by annual income India FY 2021

    • statista.com
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Households by annual income India FY 2021 [Dataset]. https://www.statista.com/statistics/482584/india-households-by-annual-income/
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    India
    Description

    In the financial year 2021, a majority of Indian households fell under the aspirers category, earning between ******* and ******* Indian rupees a year. On the other hand, about ***** percent of households that same year, accounted for the rich, earning over * million rupees annually. The middle class more than doubled that year compared to ** percent in financial year 2005. Middle-class income group and the COVID-19 pandemic During the COVID-19 pandemic specifically during the lockdown in March 2020, loss of incomes hit the entire household income spectrum. However, research showed the severest affected groups were the upper middle- and middle-class income brackets. In addition, unemployment rates were rampant nationwide that further lead to a dismally low GDP. Despite job recoveries over the last few months, improvement in incomes were insignificant. Economic inequality While India maybe one of the fastest growing economies in the world, it is also one of the most vulnerable and severely afflicted economies in terms of economic inequality. The vast discrepancy between the rich and poor has been prominent since the last ***** decades. The rich continue to grow richer at a faster pace while the impoverished struggle more than ever before to earn a minimum wage. The widening gaps in the economic structure affect women and children the most. This is a call for reinforcement in in the country’s social structure that emphasizes access to quality education and universal healthcare services.

  4. H

    Money, Morals, and Manners: The Culture of the French and the American...

    • dataverse.harvard.edu
    Updated Feb 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michèle Lamont (2022). Money, Morals, and Manners: The Culture of the French and the American Upper-Middle Class, 1986-1988 [Dataset]. http://doi.org/10.7910/DVN/1AVX7P
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Michèle Lamont
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/5.4/customlicense?persistentId=doi:10.7910/DVN/1AVX7Phttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/5.4/customlicense?persistentId=doi:10.7910/DVN/1AVX7P

    Time period covered
    1986 - 1988
    Area covered
    United States
    Description

    The purpose of this study was to compare how members of the French and American upper-middle class define being a "worthy person," and to explain the important cross-national differences in these definitions by examining broad cultural and structural features of French and American society. Subjects were 160 college educated, white male professionals, managers, and businessmen who lived in and around Indianapolis, New York, Paris, and Clermont-Ferrand. Respondents were randomly chosen from the phone directories of middle- and upper-middle-class suburbs and neighborhood. Brief phone interviews were conducted to determine availability and eligibility. The final participants were matches as closely as possible by level of education and occupation. Data collection centered on 2-hour semi-directed interviews. Variables assessed include labels participants used to describe people whom they placed above and below themselves, description of people with whom participants chose to associate, those they felt superior and inferior to and those who invoked hostility, indifference, and sympathy. Negative and positive traits of coworkers, perceptions of cultural traits most valued in their workplace, and child rearing values were also assessed. Audio Data Availability Note: This study contains audio data that have been digitized. There are 452 audio files available.

  5. Most populated cities in the U.S. - median household income 2022

    • statista.com
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Most populated cities in the U.S. - median household income 2022 [Dataset]. https://www.statista.com/statistics/205609/median-household-income-in-the-top-20-most-populated-cities-in-the-us/
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    United States
    Description

    In 2022, San Francisco had the highest median household income of cities ranking within the top 25 in terms of population, with a median household income in of 136,692 U.S. dollars. In that year, San Jose in California was ranked second, and Seattle, Washington third.

    Following a fall after the great recession, median household income in the United States has been increasing in recent years. As of 2022, median household income by state was highest in Maryland, Washington, D.C., Utah, and Massachusetts. It was lowest in Mississippi, West Virginia, and Arkansas. Families with an annual income of 25,000 and 49,999 U.S. dollars made up the largest income bracket in America, with about 25.26 million households.

    Data on median household income can be compared to statistics on personal income in the U.S. released by the Bureau of Economic Analysis. Personal income rose to around 21.8 trillion U.S. dollars in 2022, the highest value recorded. Personal income is a measure of the total income received by persons from all sources, while median household income is “the amount with divides the income distribution into two equal groups,” according to the U.S. Census Bureau. Half of the population in question lives above median income and half lives below. Though total personal income has increased in recent years, this wealth is not distributed throughout the population. In practical terms, income of most households has decreased. One additional statistic illustrates this disparity: for the lowest quintile of workers, mean household income has remained more or less steady for the past decade at about 13 to 16 thousand constant U.S. dollars annually. Meanwhile, income for the top five percent of workers has actually risen from about 285,000 U.S. dollars in 1990 to about 499,900 U.S. dollars in 2020.

  6. a

    Limited Resources Sub-Index: TEPI Citywide Census Tracts

    • cotgis.hub.arcgis.com
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tucson (2024). Limited Resources Sub-Index: TEPI Citywide Census Tracts [Dataset]. https://cotgis.hub.arcgis.com/maps/cotgis::limited-resources-sub-index-tepi-citywide-census-tracts
    Explore at:
    Dataset updated
    Jul 2, 2024
    Dataset authored and provided by
    City of Tucson
    Area covered
    Description

    For detailed information, visit the Tucson Equity Priority Index StoryMap.Download the layer's data dictionaryNote: This layer is symbolized to display the percentile distribution of the Limited Resources Sub-Index. However, it includes all data for each indicator and sub-index within the citywide census tracts TEPI.What is the Tucson Equity Priority Index (TEPI)?The Tucson Equity Priority Index (TEPI) is a tool that describes the distribution of socially vulnerable demographics. It categorizes the dataset into 5 classes that represent the differing prioritization needs based on the presence of social vulnerability: Low (0-20), Low-Moderate (20-40), Moderate (40-60), Moderate-High (60-80) High (80-100). Each class represents 20% of the dataset’s features in order of their values. The features within the Low (0-20) classification represent the areas that, when compared to all other locations in the study area, have the lowest need for prioritization, as they tend to have less socially vulnerable demographics. The features that fall into the High (80-100) classification represent the 20% of locations in the dataset that have the greatest need for prioritization, as they tend to have the highest proportions of socially vulnerable demographics. How is social vulnerability measured?The Tucson Equity Priority Index (TEPI) examines the proportion of vulnerability per feature using 11 demographic indicators:Income Below Poverty: Households with income at or below the federal poverty level (FPL), which in 2023 was $14,500 for an individual and $30,000 for a family of fourUnemployment: Measured as the percentage of unemployed persons in the civilian labor forceHousing Cost Burdened: Homeowners who spend more than 30% of their income on housing expenses, including mortgage, maintenance, and taxesRenter Cost Burdened: Renters who spend more than 30% of their income on rentNo Health Insurance: Those without private health insurance, Medicare, Medicaid, or any other plan or programNo Vehicle Access: Households without automobile, van, or truck accessHigh School Education or Less: Those highest level of educational attainment is a High School diploma, equivalency, or lessLimited English Ability: Those whose ability to speak English is "Less Than Well."People of Color: Those who identify as anything other than Non-Hispanic White Disability: Households with one or more physical or cognitive disabilities Age: Groups that tend to have higher levels of vulnerability, including children (those below 18), and seniors (those 65 and older)An overall percentile value is calculated for each feature based on the total proportion of the above indicators in each area. How are the variables combined?These indicators are divided into two main categories that we call Thematic Indices: Economic and Personal Characteristics. The two thematic indices are further divided into five sub-indices called Tier-2 Sub-Indices. Each Tier-2 Sub-Index contains 2-3 indicators. Indicators are the datasets used to measure vulnerability within each sub-index. The variables for each feature are re-scaled using the percentile normalization method, which converts them to the same scale using values between 0 to 100. The variables are then combined first into each of the five Tier-2 Sub-Indices, then the Thematic Indices, then the overall TEPI using the mean aggregation method and equal weighting. The resulting dataset is then divided into the five classes, where:High Vulnerability (80-100%): Representing the top classification, this category includes the highest 20% of regions that are the most socially vulnerable. These areas require the most focused attention. Moderate-High Vulnerability (60-80%): This upper-middle classification includes areas with higher levels of vulnerability compared to the median. While not the highest, these areas are more vulnerable than a majority of the dataset and should be considered for targeted interventions. Moderate Vulnerability (40-60%): Representing the middle or median quintile, this category includes areas of average vulnerability. These areas may show a balanced mix of high and low vulnerability. Detailed examination of specific indicators is recommended to understand the nuanced needs of these areas. Low-Moderate Vulnerability (20-40%): Falling into the lower-middle classification, this range includes areas that are less vulnerable than most but may still exhibit certain vulnerable characteristics. These areas typically have a mix of lower and higher indicators, with the lower values predominating. Low Vulnerability (0-20%): This category represents the bottom classification, encompassing the lowest 20% of data points. Areas in this range are the least vulnerable, making them the most resilient compared to all other features in the dataset.

  7. Single-earner and dual-earner census families by number of children

    • www150.statcan.gc.ca
    • ouvert.canada.ca
    • +2more
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government of Canada, Statistics Canada (2024). Single-earner and dual-earner census families by number of children [Dataset]. http://doi.org/10.25318/1110002801-eng
    Explore at:
    Dataset updated
    Jun 27, 2024
    Dataset provided by
    Statistics Canadahttps://statcan.gc.ca/en
    Area covered
    Canada
    Description

    Families of tax filers; Single-earner and dual-earner census families by number of children (final T1 Family File; T1FF).

  8. h

    Alibaba and China outlook

    • datahub.hku.hk
    txt
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pui Hei Un (2022). Alibaba and China outlook [Dataset]. http://doi.org/10.25442/hku.20277909.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 12, 2022
    Dataset provided by
    HKU Data Repository
    Authors
    Pui Hei Un
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    China boasts the fastest growing GDP of all developed nations. Neighboring regions will have the largest middle class in history. China is building transport infrastructure to take advantage. Companies that capture market share in this region will be the largest and best performing over the next decade.

    Macro Tailwinds

    1) China GDP is the fastest growing of any major country with expected 5-6% over the next decade. If businesses (Alibaba, Tencent, etc..) maintain flat market share, that alone will drive 5-6% over the next decade. This is already higher than JP Morgans expectation (from their 13f filings) that the US market will perform between -5% and +5% over this coming decade.

    2) The Southeast Asia Region contains about 5 billion people. China is constructing the One Best One Road which will be completed by 2030. This will grant their businesses access to the fastest and largest growing middle class in human history. Over the next 10+ years this region will be home to the largest middle class in history, potentially over 10x that of North America and Europe, based on stock price in Google Sheets.

    Increasing average Chinese income.

    Chinese average income has more than doubled over the last decade. Having sustained the least economic damage from the virus, this trend is expected to continue. At this pace the average Chinese citizen salary will be at 50% of the average US by 2030 (with stock price in Excel provided by Finsheet via Finnhub Stock Api), with the difference being there are 4x more Chinese. Thus a market potential of almost 2x the US over the next decade.

    The Southeast Asia Region now contains the largest total number of billionaires, this number is expected to increase at an increasing rate as the region continues to develop. Over the next 10 years the largest trading route ever assembled will be completed, and China will be the primary provider of goods to 5b+ people

    2013 North America was home to the largest number of billionaires. This reversed with Asia over the following 5 years. This separation is expected to continue at an increasing rate. Why does this matter? Over the next 10 years the largest trading route ever assembled will be completed, and China will be the primary provider of goods to 5b+ people

    Companies that can easily access all customers in the world will perform best. This is good news for Apple, Microsoft, and Disney. Disney stock price in Excel right now is $70. But not for Amazon or Google which at first may sound contrary as the expectation is that Amazon "will take over the world". However one cannot do that without first conquering China. Firms like Alibaba and Tencent will have easy access to the global infrastructure being built by China in an attempt to speed up and ease trade in that region. The following guide shows how to get stock price in Excel.

    We will explore companies using a:

    1) Past

    2) Present (including financial statements)

    3) Future

    4) Story/Tailwind

    Method to find investing ideas in these regions. The tailwind is currently largest in the Asia region with 6%+ GDP growth according to the latest SEC form 4 from Edgar Company Search. This is relevant as investments in this region have a greater margin of safety; investing in a company that maintains flat market share should increase about 6% per year as the market growth size is so significant. The next article I will explore Alibaba (NYSE: BABA), and why I recently purchased a large position during the recent Ant Financial Crisis.

  9. Data Confrontation Seminar, 1969: Comparative Socio-Political Data

    • icpsr.umich.edu
    ascii
    Updated Jan 12, 2006
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inter-university Consortium for Political and Social Research (2006). Data Confrontation Seminar, 1969: Comparative Socio-Political Data [Dataset]. http://doi.org/10.3886/ICPSR00038.v1
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Jan 12, 2006
    Dataset authored and provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/38/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/38/terms

    Time period covered
    1969
    Area covered
    Norway, India, Poland, Germany, Netherlands, Global, Sweden, Japan, France, Denmark
    Description

    This study contains selected electoral and demographic national data for nine nations in the 1950s and 1960s. The data were prepared for the Data Confrontation Seminar on the Use of Ecological Data in Comparative Cross-National Research held under the auspices of the Inter-university Consortium for Political and Social Research on April 1-18, 1969. One of the primary concerns of this international seminar was the need for cooperation in the development of data resources in order to facilitate exchange of data among individual scholars and research groups. Election returns for two or more national and/or local elections are provided for each of the nine nations, as well as ecological materials for at least two time points in the general period of the 1950s and 1960s. While each dataset was received at a single level of aggregation, the data have been further aggregated to at least a second level of aggregation. In most cases, the data can be supplied at the commune or municipality level and at the province or district level as well. Part 1 (Germany, Regierungsbezirke), Part 2 (Germany, Kreise), Part 3 (Germany, Lander), and Part 4 (Germany, Wahlkreise) contain data for all kreise, laender (states), administrative districts, and electoral districts for national elections in the period 1957-1969, and for state elections in the period 1946-1969, and ecological data from 1951 and 1961. Part 5 (France, Canton), and Part 6 (France, Departemente) contain data for the cantons and departements of two regions of France (West and Central) for the national elections of 1956, 1962, and 1967, and ecological data for the years 1954 and 1962. Data are provided for election returns for selected parties: Communist, Socialist, Radical, Federation de Gauche, and the Fifth Republic. Included are raw votes and percentage of total votes for each party. Ecological data provide information on total population, proportion of total population in rural areas, agriculture, industry, labor force, and middle class in 1954, as well as urbanization, crime rates, vital statistics, migration, housing, and the index of "comforts." Part 7 (Japan, Kanagawa Prefecture), Part 8 (Japan, House of Representatives Time Series), Part 9 (Japan, House of (Councilors (Time Series)), and Part 10 (Japan, Prefecture) contain data for the 46 prefectures for 15 national elections between 1949 and 1968, including data for all communities in the prefecture of Kanagawa for 13 national elections, returns for 8 House of Representatives' elections, 7 House of Councilors' elections, descriptive data from 4 national censuses, and ecological data for 1950, 1955, 1960, and 1965. Data are provided for total number of electorate, voters, valid votes, and votes cast by such groups as the Jiyu, Minshu, Kokkyo, Minji, Shakai, Kyosan, and Mushozoku for the Communist, Socialist, Conservative, Komei, and Independent parties for all the 46 prefectures. Population characteristics include age, sex, employment, marriage and divorce rates, total number of live births, deaths, households, suicides, Shintoists, Buddhists, and Christians, and labor union members, news media subscriptions, savings rate, and population density. Part 11 (India, Administrative Districts) and Part 12 (India, State) contain data for all administrative districts and all states and union territories for the national and state elections in 1952, 1957, 1962, 1965, and 1967, the 1958 legislative election, and ecological data from the national censuses of 1951 and 1961. Data are provided for total number of votes cast for the Congress, Communist, Jan Sangh, Kisan Mazdoor Praja, Socialist, Republican, Regional, and other parties, contesting candidates, electorate, valid votes, and the percentage of valid votes cast. Also included are votes cast for the Rightist, Christian Democratic, Center, Socialist, and Communist parties in the 1958 legislative election. Ecological data include total population, urban population, sex distribution, occupation, economically active population, education, literate population, and number of Buddhists, Christians, Hindus, Jainis, Moslems, Sikhs, and other religious groups. Part 13 (Norway, Province), and Part 14 (Norway, Commune) consist of the returns for four national elections in 1949, 1953, 1957, and 1961, and descriptive data from two national censuses. Data are provided for the total number

  10. P

    Source Code Tagger Training Set Dataset

    • paperswithcode.com
    Updated Aug 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian D. Newman; Michael J. Decker; Reem S. AlSuhaibani; Anthony Peruma; Satyajit Mohapatra; Tejal Vishnoi; Marcos Zampieri; Mohamed W. Mkaouer; Timothy J. Sheldon; Emily Hill (2021). Source Code Tagger Training Set Dataset [Dataset]. https://paperswithcode.com/dataset/source-code-tagger-training-set
    Explore at:
    Dataset updated
    Aug 31, 2021
    Authors
    Christian D. Newman; Michael J. Decker; Reem S. AlSuhaibani; Anthony Peruma; Satyajit Mohapatra; Tejal Vishnoi; Marcos Zampieri; Mohamed W. Mkaouer; Timothy J. Sheldon; Emily Hill
    Description

    Ensemble Tagger Training and Testing Set This data includes two files: The training set used to create the SCANL Ensemble tagger [1] and the "unseen" testing set that includes words from systems that are not available in the training set. These are derived from a prior dataset of Grammar Patterns; described in a different paper [2]. Within each of these csv files, you'll find several columns. We explain these columns below:

    Type (only in training set) - Type (or return type) of the identifier to which current word belongs.

    Identifier - The full identifier from which the current word was split.

    Grammar Pattern - The sequence of part-of-speech tags generated by splitting the identifier into words and annotating with part-of-spech tags.

    Word - The current word; derived by splitting the corresponding identifier.

    SWUM annotation - The annotation that the SWUM POS tagger applied to a given word.

    POSSE annotation - The annotation that the POSSE POS tagger applied to a given word.

    Stanford annotation - The annotation that the Stanford POS tagger applied to a given word.

    Flair annotation - The annotation that the FLAIR POS tagger applied to a given word.

    Position - The position of a given word within its original identifier. For example, given an identifier: GetXMLReaderHandler, Get is in position 1, XML is in position 2, Reader is in position 3 and Handler is in position 4.

    Identifier size (max position) - The length, in words, of the identifier of which the word was originally part.

    Normalized position - We normalized the position metric described above such that the first word in the identifier is in position 1, all middle words are in position 2, and the last word is in position 3. For example, given an identifier: GetXMLReaderHandler, Get is in position 1, XML is in position 2, Reader is in position 2 and Handler is in position 3. The reason for this feature is to mitigate the sometimes-negative effect of very long identifiers [2].

    Context - The dataset contains five categories of identifier name: function, parameter, attribute, declaration, and class. We provide the category to which the given identifier belongs as one of the features to allow the ensemble to learn patterns that are more pervasive for certain identifier types versus others. For example, function identifiers contain verbs at a higher rate than other types of identifiers [2].

    Correct - The correct part-of-speech tag for the current word.

    System - System in which the current word was found.

    Identifier Code - Each identifier has a unique number. Each word that has the same number is a part of the same identifier. For example, you can concatenate each word with a code of 0 to recreate the original identifier.

    Context The numbers under the context feature represent the following categories (number -> category): 1. attribute 2. class 3. declaration 4. function 5. parameter

    Best Features We found [1] that the best features, of the features described above, were 1. SWUM 2. POSSE 3. Stanford 4. Normalized position 5. Context

    Tagset The tagset that we use is a subset of Penn treebank. Each of our annotations and an example can be found below. Further examples and definitions can be found in the paper [1]

    AbbreviationExpanded FormExamples
    NnounDisneyland, shoe, faucet, mother, bedroom
    DTdeterminerthe, this, that, these, those, which
    CJconjunctionand, for, nor, but, or, yet, so
    Pprepositionbehind, in front of, at, under, beside, above, beneath, despite
    NPLnoun pluralstreets, cities, cars, people, lists, items, elements.
    NMnoun modifier (adjective)red, cold, hot, scary, beautiful, happy, faster, small
    NMnoun modifier (noun-adjunct italicized)employeeName, filePath, fontSize, userId
    Vverbrun, jump, drive, spin
    VMverb modifier (adverb)very, loudly, seriously, impatiently, badly
    PRpronounshe, he, her, him, it, we, us, they, them, I, me, you
    Ddigit1, 2, 10, 4.12, 0xAF
    PREpreamble (e.g., Hungarian)Gimp, GLEW, GL, G, p_, m_, b_

    Word of Caution Flair and Stanford recognize a larger number of verb conjugations (e.g., VBZ, VBD) than the ensemble, Posse, and SWUM. We left these conjugations in just in case someone wants to use them. If you are uninterested in using these conjugations, you should normalized them to just V-- inline with our tagset.

    Identifier Naming Structure Catalogue We have put together a catalogue of identifier naming structures in source code. This catalogue explains a lot more about why this work is important, how we are using the ensemble tagger and why the tagset looks the way it does.

    The actual tagger implementation You can find the tagger that was trained using this data here: https://github.com/SCANL/ensemble_tagger

    Please cite the paper!

    C. D. Newman, M. J. Decker, R. S. AlSuhaibani, A. Peruma, S. Mohapatra, T. Vishoi, M. Zampieri, M. W. Mkaouer, T. J. Sheldon, and E. Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.

    Christian D. Newman, Reem S. Alsuhaibani, Michael J. Decker, Anthony Peruma, Dishant Kaushik, Mohamed Wiem Mkaouer, Emily Hill, On the generation, structure, and semantics of grammar patterns in source code identifiers, Journal of Systems and Software, 2020, 110740, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2020.110740. (http://www.sciencedirect.com/science/article/pii/S0164121220301680)

    Interested in our research? Check out https://scanl.org/

  11. A Gold Standard Corpus for Activity Information (GoSCAI)

    • zenodo.org
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Dataset]. http://doi.org/10.5281/zenodo.15528545
    Explore at:
    Dataset updated
    May 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Description

    A Gold Standard Corpus for Activity Information

    Dataset Title: A Gold Standard Corpus for Activity Information (GoSCAI)

    Dataset Curators: The Epidemiology & Biostatistics Section of the NIH Clinical Center Rehabilitation Medicine Department

    Dataset Version: 1.0 (May 16, 2025)

    Dataset Citation and DOI: NIH CC RMD Epidemiology & Biostatistics Section. (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Data set]. Zenodo. doi: 10.5281/zenodo.15528545

    EXECUTIVE SUMMARY

    This data statement is for a gold standard corpus of de-identified clinical notes that have been annotated for human functioning information based on the framework of the WHO's International Classification of Functioning, Disability and Health (ICF). The corpus includes 484 notes from a single institution within the United States written in English in a clinical setting. This dataset was curated for the purpose of training natural language processing models to automatically identify, extract, and classify information on human functioning at the whole-person, or activity, level.

    CURATION RATIONALE

    This dataset is curated to be a publicly available resource for the development and evaluation of methods for the automatic extraction and classification of activity-level functioning information as defined in the ICF. The goals of data curation are to 1) create a corpus of a size that can be manually deidentified and annotated, 2) maximize the density and diversity of functioning information of interest, and 3) allow public dissemination of the data.

    LANGUAGE VARIETIES

    Language Region: en-US

    Prose Description: English as written by native and bilingual English speakers in a clinical setting

    LANGUAGE USER DEMOGRAPHIC

    The language users represented in this dataset are medical and clinical professionals who work in a research hospital setting. These individuals hold professional degrees corresponding to their respective specialties. Specific demographic characteristics of the language users such as age, gender, or race/ethnicity were not collected.

    ANNOTATOR DEMOGRAPHIC

    The annotator group consisted of five people, 33 to 76 years old, including four females and one male. Socioeconomically, they came from the middle and upper-middle income classes. Regarding first language, three annotators had English as their first language, one had Chinese, and one had Spanish. Proficiency in English, the language of the data being annotated, was native for three of the annotators and bilingual for the other two. The annotation team included clinical rehabilitation domain experts with backgrounds in occupational therapy, physical therapy, and individuals with public health and data science expertise. Prior to annotation, all annotators were trained on the specific annotation process using established guidelines for the given domain, and annotators were required to achieve a specified proficiency level prior to annotating notes in this corpus.

    LINGUISTIC SITUATION AND TEXT CHARACTERISTICS

    The notes in the dataset were written as part of clinical care within a U.S. research hospital between May 2008 and November 2019. These notes were written by health professionals asynchronously following the patient encounter to document the interaction and support continuity of care. The intended audience of these notes were clinicians involved in the patients' care. The included notes come from nine disciplines - neuropsychology, occupational therapy, physical medicine (physiatry), physical therapy, psychiatry, recreational therapy, social work, speech language pathology, and vocational rehabilitation. The notes were curated to support research on natural language processing for functioning information between 2018 and 2024.

    PREPROCESSING AND DATA FORMATTING

    The final corpus was derived from a set of clinical notes extracted from the hospital electronic medical record (EMR) for the purpose of clinical research. The original data include character-based digital content originally. We work in ASCII 8 or UNICODE encoding, and therefore part of our pre-processing includes running encoding detection and transformation from encodings such as Windows-1252 or ISO-8859 format to our preferred format.

    On the larger corpus, we applied sampling to match our curation rationale. Given the resource constraints of manual annotation, we set out to create a dataset of 500 clinical notes, which would exclude notes over 10,000 characters in length.

    To promote density and diversity, we used five note characteristics as sampling criteria. We used the text length as expressed in number of characters. Next, we considered the discipline group as derived from note type metadata and describes which discipline a note originated from: occupational and vocational therapy (OT/VOC), physical therapy (PT), recreation therapy (RT), speech and language pathology (SLP), social work (SW), or miscellaneous (MISC, including psychiatry, neurology and physiatry). These disciplines were selected for collecting the larger corpus because their notes are likely to include functioning information. Existing information extraction tools were used to obtain annotation counts in four areas of functioning and provided a note’s annotation count, annotation density (annotation count divided by text length), and domain count (number of domains with at least 1 annotation).

    We used stratified sampling across the 6 discipline groups to ensure discipline diversity in the corpus. Because of low availability, 50 notes were sampled from SLP with relaxed criteria, and 90 notes each from the 5 other discipline groups with stricter criteria. Sampled SLP notes were those with the highest annotation density that had an annotation count of at least 5 and a domain count of at least 2. Other notes were sampled by highest annotation count and lowest text length, with a minimum annotation count of 15 and minimum domain count of 3.

    The notes in the resulting sample included certain types of PHI and PII. To prepare for public dissemination, all sensitive or potentially identifying information was manually annotated in the notes and replaced with substituted content to ensure readability and enough context needed for machine learning without exposing any sensitive information. This de-identification effort was manually reviewed to ensure no PII or PHI exposure and correct any resulting readability issues. Notes about pediatric patients were excluded. No intent was made to sample multiple notes from the same patient. No metadata is provided to group notes other than by note type, discipline, or discipline group. The dataset is not organized beyond the provided metadata, but publications about models trained on this dataset should include information on the train/test splits used.

    All notes were sentence-segmented and tokenized using the spaCy en_core_web_lg model with additional rules for sentence segmentation customized to the dataset. Notes are stored in an XML format readable by the GATE annotation software (https://gate.ac.uk/family/developer.html), which stores annotations separately in annotation sets.

    CAPTURE QUALITY

    As the clinical notes were extracted directly from the EMR in text format, the capture quality was determined to be high. The clinical notes did not have to be converted from other data formats, which means this dataset is free from noise introduced by conversion processes such as optical character recognition.

    LIMITATIONS

    Because of the effort required to manually deidentify and annotate notes, this corpus is limited in terms of size and representation. The curation decisions skewed note selection towards specific disciplines and note types to increase the likelihood of encountering information on functioning. Some subtypes of functioning occur infrequently in the data, or not at all. The deidentification of notes was done in a manner to preserve natural language as it would occur in the notes, but some information is lost, e.g. on rare diseases.

    METADATA

    Information on the manual annotation process is provided in the annotation guidelines for each of the four domains:

    - Communication & Cognition (https://zenodo.org/records/13910167)

    - Mobility (https://zenodo.org/records/11074838)

    - Self-Care & Domestic Life (SCDL) (https://zenodo.org/records/11210183)

    - Interpersonal Interactions & Relationships (IPIR) (https://zenodo.org/records/13774684)

    Inter-annotator agreement was established on development datasets described in the annotation guidelines prior to the annotation of this gold standard corpus.

    The gold standard corpus consists of 484 documents, which include 35,147 sentences in total. The distribution of annotated information is provided in the table below.

    <td style="width: 1.75in; padding: 0in 5.4pt 0in

    Domain

    Number of Annotated Sentences

    % of All Sentences

    Mean Number of Annotated Sentences per Document

    Communication & Cognition

    6033

    17.2%

  12. t

    Tucson Equity Priority Index (TEPI): Ward 2 Census Block Groups

    • teds.tucsonaz.gov
    Updated Feb 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tucson (2025). Tucson Equity Priority Index (TEPI): Ward 2 Census Block Groups [Dataset]. https://teds.tucsonaz.gov/maps/cotgis::tucson-equity-priority-index-tepi-ward-2-census-block-groups
    Explore at:
    Dataset updated
    Feb 4, 2025
    Dataset authored and provided by
    City of Tucson
    Area covered
    Description

    For detailed information, visit the Tucson Equity Priority Index StoryMap.Download the Data DictionaryWhat is the Tucson Equity Priority Index (TEPI)?The Tucson Equity Priority Index (TEPI) is a tool that describes the distribution of socially vulnerable demographics. It categorizes the dataset into 5 classes that represent the differing prioritization needs based on the presence of social vulnerability: Low (0-20), Low-Moderate (20-40), Moderate (40-60), Moderate-High (60-80) High (80-100). Each class represents 20% of the dataset’s features in order of their values. The features within the Low (0-20) classification represent the areas that, when compared to all other locations in the study area, have the lowest need for prioritization, as they tend to have less socially vulnerable demographics. The features that fall into the High (80-100) classification represent the 20% of locations in the dataset that have the greatest need for prioritization, as they tend to have the highest proportions of socially vulnerable demographics. How is social vulnerability measured?The Tucson Equity Priority Index (TEPI) examines the proportion of vulnerability per feature using 11 demographic indicators:Income Below Poverty: Households with income at or below the federal poverty level (FPL), which in 2023 was $14,500 for an individual and $30,000 for a family of fourUnemployment: Measured as the percentage of unemployed persons in the civilian labor forceHousing Cost Burdened: Homeowners who spend more than 30% of their income on housing expenses, including mortgage, maintenance, and taxesRenter Cost Burdened: Renters who spend more than 30% of their income on rentNo Health Insurance: Those without private health insurance, Medicare, Medicaid, or any other plan or programNo Vehicle Access: Households without automobile, van, or truck accessHigh School Education or Less: Those highest level of educational attainment is a High School diploma, equivalency, or lessLimited English Ability: Those whose ability to speak English is "Less Than Well."People of Color: Those who identify as anything other than Non-Hispanic White Disability: Households with one or more physical or cognitive disabilities Age: Groups that tend to have higher levels of vulnerability, including children (those below 18), and seniors (those 65 and older)An overall percentile value is calculated for each feature based on the total proportion of the above indicators in each area. How are the variables combined?These indicators are divided into two main categories that we call Thematic Indices: Economic and Personal Characteristics. The two thematic indices are further divided into five sub-indices called Tier-2 Sub-Indices. Each Tier-2 Sub-Index contains 2-3 indicators. Indicators are the datasets used to measure vulnerability within each sub-index. The variables for each feature are re-scaled using the percentile normalization method, which converts them to the same scale using values between 0 to 100. The variables are then combined first into each of the five Tier-2 Sub-Indices, then the Thematic Indices, then the overall TEPI using the mean aggregation method and equal weighting. The resulting dataset is then divided into the five classes, where:High Vulnerability (80-100%): Representing the top classification, this category includes the highest 20% of regions that are the most socially vulnerable. These areas require the most focused attention. Moderate-High Vulnerability (60-80%): This upper-middle classification includes areas with higher levels of vulnerability compared to the median. While not the highest, these areas are more vulnerable than a majority of the dataset and should be considered for targeted interventions. Moderate Vulnerability (40-60%): Representing the middle or median quintile, this category includes areas of average vulnerability. These areas may show a balanced mix of high and low vulnerability. Detailed examination of specific indicators is recommended to understand the nuanced needs of these areas. Low-Moderate Vulnerability (20-40%): Falling into the lower-middle classification, this range includes areas that are less vulnerable than most but may still exhibit certain vulnerable characteristics. These areas typically have a mix of lower and higher indicators, with the lower values predominating. Low Vulnerability (0-20%): This category represents the bottom classification, encompassing the lowest 20% of data points. Areas in this range are the least vulnerable, making them the most resilient compared to all other features in the dataset.

  13. P

    GSM8K Dataset

    • paperswithcode.com
    • tensorflow.org
    • +2more
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman (2024). GSM8K Dataset [Dataset]. https://paperswithcode.com/dataset/gsm8k
    Explore at:
    Dataset updated
    Dec 31, 2024
    Authors
    Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman
    Description

    GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.

  14. Infrastructure Climate Resilience Assessment Data Starter Kit for Nepal

    • zenodo.org
    zip
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Russell; Tom Russell; Diana Jaramillo; Chris Nicholas; Fred Thomas; Fred Thomas; Raghav Pant; Raghav Pant; Jim W. Hall; Jim W. Hall; Diana Jaramillo; Chris Nicholas (2024). Infrastructure Climate Resilience Assessment Data Starter Kit for Nepal [Dataset]. http://doi.org/10.5281/zenodo.10796765
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tom Russell; Tom Russell; Diana Jaramillo; Chris Nicholas; Fred Thomas; Fred Thomas; Raghav Pant; Raghav Pant; Jim W. Hall; Jim W. Hall; Diana Jaramillo; Chris Nicholas
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This starter data kit collects extracts from global, open datasets relating to climate hazards and infrastructure systems.

    These extracts are derived from global datasets which have been clipped to the national scale (or subnational, in cases where national boundaries have been split, generally to separate outlying islands or non-contiguous regions), using Natural Earth (2023) boundaries, and is not meant to express an opinion about borders, territory or sovereignty.

    Human-induced climate change is increasing the frequency and severity of climate and weather extremes. This is causing widespread, adverse impacts to societies, economies and infrastructures. Climate risk analysis is essential to inform policy decisions aimed at reducing risk. Yet, access to data is often a barrier, particularly in low and middle-income countries. Data are often scattered, hard to find, in formats that are difficult to use or requiring considerable technical expertise. Nevertheless, there are global, open datasets which provide some information about climate hazards, society, infrastructure and the economy. This "data starter kit" aims to kickstart the process and act as a starting point for further model development and scenario analysis.

    Hazards:

    • coastal and river flooding (Ward et al, 2020)
    • extreme heat and drought (Russell et al 2023, derived from Lange et al, 2020)
    • tropical cyclone wind speeds (Russell 2022, derived from Bloemendaal et al 2020 and Bloemendaal et al 2022)

    Exposure:

    • population (Schiavina et al, 2023)
    • built-up area (Pesaresi et al, 2023)
    • roads (OpenStreetMap, 2023)
    • railways (OpenStreetMap, 2023)
    • power plants (Global Energy Observatory et al, 2018)
    • power transmission lines (Arderne et al, 2020)

    The spatial intersection of hazard and exposure datasets is a first step to analyse vulnerability and risk to infrastructure and people.

    To learn more about related concepts, there is a free short course available through the Open University on Infrastructure and Climate Resilience. This overview of the course has more details.

    These Python libraries may be a useful place to start analysis of the data in the packages produced by this workflow:

    • snkit helps clean network data
    • nismod-snail is designed to help implement infrastructure exposure, damage and risk calculations

    The open-gira repository contains a larger workflow for global-scale open-data infrastructure risk and resilience analysis.

    For a more developed example, some of these datasets were key inputs to a regional climate risk assessment of current and future flooding risks to transport networks in East Africa, which has a related online visualisation tool at https://east-africa.infrastructureresilience.org/ and is described in detail in Hickford et al (2023).

    References

    • Arderne, Christopher, Nicolas, Claire, Zorn, Conrad, & Koks, Elco E. (2020). Data from: Predictive mapping of the global power system using open data [Dataset]. In Nature Scientific Data (1.1.1, Vol. 7, Number Article 19). Zenodo. DOI: 10.5281/zenodo.3628142
    • Bloemendaal, Nadia; de Moel, H. (Hans); Muis, S; Haigh, I.D. (Ivan); Aerts, J.C.J.H. (Jeroen) (2020): STORM tropical cyclone wind speed return periods. 4TU.ResearchData. [Dataset]. DOI: 10.4121/12705164.v3
    • Bloemendaal, Nadia; de Moel, Hans; Dullaart, Job; Haarsma, R.J. (Reindert); Haigh, I.D. (Ivan); Martinez, Andrew B.; et al. (2022): STORM climate change tropical cyclone wind speed return periods. 4TU.ResearchData. [Dataset]. DOI: 10.4121/14510817.v3
    • Global Energy Observatory, Google, KTH Royal Institute of Technology in Stockholm, Enipedia, World Resources Institute. (2018) Global Power Plant Database. Published on Resource Watch and Google Earth Engine; resourcewatch.org/
    • Hickford et al (2023) Decision support systems for resilient strategic transport networks in low-income countries – Final Report. Available online: https://transport-links.com/hvt-publications/final-report-decision-support-systems-for-resilient-strategic-transport-networks-in-low-income-countries
    • Lange, S., Volkholz, J., Geiger, T., Zhao, F., Vega, I., Veldkamp, T., et al. (2020). Projecting exposure to extreme climate impact events across six event categories and three spatial scales. Earth's Future, 8, e2020EF001616. DOI: 10.1029/2020EF001616
    • Natural Earth (2023) Admin 0 Map Units, v5.1.1. [Dataset] Available online: www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-0-details
    • OpenStreetMap contributors, Russell T., Thomas F., nismod/datapkg contributors (2023) Road and Rail networks derived from OpenStreetMap. [Dataset] Available at global.infrastructureresilience.org
    • Pesaresi M., Politis P. (2023): GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030) European Commission, Joint Research Centre (JRC) PID: data.europa.eu/89h/9f06f36f-4b11-47ec-abb0-4f8b7b1d72ea, doi:10.2905/9F06F36F-4B11-47EC-ABB0-4F8B7B1D72EA
    • Russell, T., Nicholas, C., & Bernhofen, M. (2023). Annual probability of extreme heat and drought events, derived from Lange et al 2020 (Version 2) [Dataset]. Zenodo. DOI: 10.5281/zenodo.8147088
    • Schiavina M., Freire S., Carioli A., MacManus K. (2023): GHS-POP R2023A - GHS population grid multitemporal (1975-2030). European Commission, Joint Research Centre (JRC) PID: data.europa.eu/89h/2ff68a52-5b5b-4a22-8f40-c41da8332cfe, doi:10.2905/2FF68A52-5B5B-4A22-8F40-C41DA8332CFE
    • Ward, P.J., H.C. Winsemius, S. Kuzma, M.F.P. Bierkens, A. Bouwman, H. de Moel, A. Díaz Loaiza, et al. (2020) Aqueduct Floods Methodology. Technical Note. Washington, D.C.: World Resources Institute. Available online at: www.wri.org/publication/aqueduct-floods-methodology.
  15. f

    Code smells and quality attributes dataset

    • figshare.com
    zip
    Updated Nov 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ehsan Esmaili; Morteza Zakeri; Saeed Parsa (2024). Code smells and quality attributes dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24057336.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 3, 2024
    Dataset provided by
    figshare
    Authors
    Ehsan Esmaili; Morteza Zakeri; Saeed Parsa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1 Code smell datasetIn order to create a high quality code smell datasets, we merged five different datasets. These datasets are among the largest and most accurate in our paper “Predicting Code Quality Attributes Based on Code Smells ”. Various software projects were analyzed automatically and manually to collect these labels. Table 1 shows the dataset details.Table 1. Merged datasets and their characteristics.DatasetSamplesProjectsCode smellsPalomba (2018) [1]40888395 versions of 30 open-source projectsLarge class, complex class, class data should be private, inappropriate intimacy, lazy class, middle man, refused equest, spaghetti code, speculative generality, comments, long method, long parameter list, feature envy, message chainsMadeyski [2]3291523 open-source and industrial projectsBlob, data classKhomh [3]_54 versions of 4 open-source projectsAnti-singleton, swiss army knifePecorelli [4]3419 open-source projectsBlobPalomba (2017) [5]_6 open-source projectsDispersed coupling, shotgun surgeryCode smell datasets have been prepared at two levels: class and method. The class level is 15 different smells as labels and 81 software metrics as features. As well, there are five smells and 31 metrics on the method level. This dataset contains samples of Java classes and methods. A sample can be identified by its longname, which contains the project-name, package-name, JavaFile-name, class-name, and method-name. The quantity of each smell ranges from 40 to 11000. The total number of samples is 37517, while the number of non-smells is nearly 3 million. As a result, our dataset is the largest in the study. You can see the details in Table 2.Table 2. The number of smells and non-smells at class and method levelsLevelMetricsSmellSamplesTotalClass81Complex class126523438Class data should be private1839Inappropriate intimacy780Large class990Lazy class774Middle man193Refused bequest1985Spaghetti code3203Speculative generality2723Blob988Data class938Anti-singleton2993Swiss army knife4601Dispersed coupling41Shotgun surgery125Non-smell40506 [3] +8334 [5] +296854 [1]+43862 [2] +55214 [4]444770Method31Comments10714079Feature envy525Long method11366Long parameter list1983Message chains98Non-smell246917624691762 Quality datasetThis dataset contains over 1000 Java project instances where for each instance the relative frequency of 20 code smells has been extracted along with the value of eight software quality attributes. The code quality dataset contains 20 smells as features and 8 quality attributes as labels: Coverageability, extendability, effectiveness, flexibility, functionality, reusability, testability, and understandability. The samples are Java projects identified by their name and version. Features are the ratio of smelly and non-smelly classes or methods in a software project. The quality attributes are a normalized score calculated by QMOOD metrics [6] and models extracted by [7], [8]. 1014 samples of small and large open-source and industrial projects are included in this dataset.The data samples are used to train machine learning models predicting software quality attributes based on code smells.References[1] F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R. Oliveto, and A. De Lucia, “A large-scale empirical study on the lifecycle of code smell co-occurrences,” Inf Softw Technol, vol. 99, pp. 1–10, Jul. 2018, doi: 10.1016/J.INFSOF.2018.02.004.[2] L. Madeyski and T. Lewowski, “MLCQ: Industry-Relevant Code Smell Data Set,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Apr. 2020, pp. 342–347. doi: 10.1145/3383219.3383264.[3] F. Khomh, M. Di Penta, Y. G. Guéhéneuc, and G. Antoniol, “An exploratory study of the impact of antipatterns on class change- and fault-proneness,” Empir Softw Eng, vol. 17, no. 3, pp. 243–275, Jun. 2012, doi: 10.1007/s10664-011-9171-y.[4] F. Pecorelli, F. Palomba, F. Khomh, and A. De Lucia, “Developer-Driven Code Smell Prioritization,” Proceedings - 2020 IEEE/ACM 17th International Conference on Mining Software Repositories, MSR 2020, pp. 220–231, 2020, doi: 10.1145/3379597.3387457.[5] F. Palomba, M. Zanoni, F. A. Fontana, A. De Lucia, and R. Oliveto, “Smells like teen spirit: Improving bug prediction performance using the intensity of code smells,” in Proceedings - 2016 IEEE International Conference on Software Maintenance and Evolution, ICSME 2016, Institute of Electrical and Electronics Engineers Inc., Jan. 2017, pp. 244–255. doi: 10.1109/ICSME.2016.27.[6] J. Bansiya and C. G. Davis, “A hierarchical model for object-oriented design quality assessment,” IEEE Transactions on Software Engineering, vol. 28, no. 1, pp. 4–17, Jan. 2002, doi: 10.1109/32.979986.[7] M. Zakeri-Nasrabadi and S. Parsa, “Learning to predict test effectiveness,” International Journal of Intelligent Systems, 2021, doi: 10.1002/INT.22722.[8] M. Zakeri-Nasrabadi and S. Parsa, “Testability Prediction Dataset,” Mar. 2021, doi: 10.5281/ZENODO.4650228.

  16. m

    KU-BdSL: Khulna University Bengali Sign Language dataset

    • data.mendeley.com
    Updated Mar 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah Al Jaid Jim (2023). KU-BdSL: Khulna University Bengali Sign Language dataset [Dataset]. http://doi.org/10.17632/scpvm2nbkm.2
    Explore at:
    Dataset updated
    Mar 2, 2023
    Authors
    Abdullah Al Jaid Jim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Khulna
    Description

    The KU-BdSL refers to a Bengali sign language dataset, which includes three variants of the data. The variants are - (i) Uni-scale Sign Language Dataset (USLD), (ii) Multi-scale Sign Language Dataset (MSLD), and (iii) Annotated Multi-scale Sign Language Dataset (AMSLD). The dataset consists of images representing single-hand gestures for BdSL alphabets. Several smartphones are taken into account to capture images from 34 participants (26 males and 8 females). These 34 volunteers associated with the dataset creation have not offered any financial benefit. Each version includes 30 classes that resemble the 38 consonants ('shoroborno') of Bengali alphabets. There is a total of 1,500 images in jpg format in each variant. The images are captured on flat surfaces at different times of the day to vary the brightness and contrast. Class names are Unicode values corresponding to the Bengali alphabets for USLD and MSLD.

    Folder Names: 2433 -> ‘Chandra Bindu’ 2434 -> ‘Anusshar’ 2435 -> ‘Bisharga’ 2453 -> ‘Ka’ 2454 -> ‘Kha’ 2455 -> ‘Ga’ 2456 -> ‘Gha’ 2457 -> ‘Uo’ 2458 -> ‘Ca’ 2459 -> ‘Cha’ 2460-2479 -> ‘Borgio Ja/Anta Ja’ 2461 -> ‘Jha’ 2462 -> ‘Yo’ 2463 -> ‘Ta’ 2464 -> ‘Tha’ 2465 -> ‘Da’ 2466 -> ‘Dha’ 2467-2472 -> ‘Murdha Na/Donto Na’ 2468-2510 -> ‘ta/Khanda ta’ 2469 -> ‘tha’ 2470 -> ‘da’ 2471 -> ‘dha’ 2474 -> ‘pa’ 2475 -> ‘fa’ 2476-2477 -> ‘Ba/Bha’ 2478 -> ‘Ma’ 2480-2524-2525 -> ‘Ba-y Ra/Da-y Ra/Dha-y Ra’ 2482 -> ‘La’ 2486-2488-2487 -> ‘Talobbo sha/Danta sa/Murdha Sha’ 2489 -> ‘Ha’

    USLD: USLD has a unique size for all the images that is 512*512 pixels. The intended hand position is placed in the middle of the majority of cases in this dataset. MSLD: The raw images are stored in MSLD so that researchers can make changes to the dataset. The use of various smartphones yields us a wide variety of image sizes. AMSLD: AMSLD has multi-scale annotated data, which is suitable for tasks like localization and classification. From many annotation formats, the YOLO DarkNet annotation has been selected. Each image has an annotation text file containing five numbers separated by white space. The initial number is an integer, and the rest are floating numbers. The first number of the file indicates the class ID corresponding to the label of that image. Class IDs are mapped in a separate text file named 'obj.names'. The second and third values are the beginning normalized coordinates, while the fourth and fifth define the bounding box's normalized width and height.

    This dataset is supported by Research and Innovation Center, Khulna University, Khulna-9208, Bangladesh and all the data from this dataset is free to download, modify, and use. The previous version (Version 1) of this dataset contains the oral permission of the volunteers, and this one has written consent of the participants. Therefore, we encourage researchers to use this version (Version 2) for research objective.

  17. The ground-based MUSICA dataset: Tropospheric water vapour isotopologues...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabine Barthlott; Matthias Schneider; Frank Hase; Thomas Blumenstock; Gizaw Mengistu Tsidu; Michel Grutter de la Mora; Kim Strong; Justus Notholt; Emmanuel Mahieu; Nicholas Jones; Dan Smale; Sabine Barthlott; Matthias Schneider; Frank Hase; Thomas Blumenstock; Gizaw Mengistu Tsidu; Michel Grutter de la Mora; Kim Strong; Justus Notholt; Emmanuel Mahieu; Nicholas Jones; Dan Smale (2020). The ground-based MUSICA dataset: Tropospheric water vapour isotopologues (H216O, H218O and HD16O) as obtained from NDACC/FTIR solar absorption spectra [Dataset]. http://doi.org/10.5281/zenodo.48902
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sabine Barthlott; Matthias Schneider; Frank Hase; Thomas Blumenstock; Gizaw Mengistu Tsidu; Michel Grutter de la Mora; Kim Strong; Justus Notholt; Emmanuel Mahieu; Nicholas Jones; Dan Smale; Sabine Barthlott; Matthias Schneider; Frank Hase; Thomas Blumenstock; Gizaw Mengistu Tsidu; Michel Grutter de la Mora; Kim Strong; Justus Notholt; Emmanuel Mahieu; Nicholas Jones; Dan Smale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MUSICA (“MUlti-platform remote sensing of Isotopologues for investigating the Cycle of Atmospheric water”, http://www.imk-asf.kit.edu/english/musica.php) is a European Research Council (ERC) project. The project has developed tropospheric water vapour isotopologue retrievals (H2O and H2O-δD pairs) using ground-based FTIR spectra as well as thermal nadir spectra measured by the satellite sensor IASI. H2O-δD pairs allow studying tropospheric water transport pathways and in combination with models they can improve our understanding of important climate feedback mechanisms (see also WCRP Grand Challenges: http://www.wcrp-climate.org/grand-challenges).

    For MUSICA, the FTIR spectra have been analysed centrally at KIT using uniform and consistent retrieval settings, thereby guaranteeing ultimate consistency of the retrieval products generated for different FTIR stations. The FTIR products are H2O profiles for the lower, middle and upper troposphere as well as H2O-δD pairs for the lower and middle troposphere. The data have been produced for 12 FTIR stations and date back to 1996.

    The dataset has been extensively characterized and validated (theoretically and empirically). Furthermore, the spectra have been used to perform uniform retrievals of XCO2, which is then used for documenting the long-term stability of these kind of FTIR data. The data are provided in the form of two data types. The first type ("ftir.iso.h2o") is best-suited for tropospheric water vapour distribution studies that disregard the different isotopologues (comparison with radiosonde data, analyses of water vapour variability and trends, etc.). The second type ("ftir.iso.post.h2o") is needed for analysing moisture pathways by means of H2O-δD pair distribution.

    The data format is hdf4 and the files have been generated in compliance with GEOMS (Generic Earth Observation Metadata Standard). The complete MUSICA NDACC/FTIR dataset is also publicly available via the NDACC database (ftp://ftp.cpc.ncep.noaa.gov/ndacc/MUSICA).

    Details on the characteristics of the dataset are described in the paper "Tropospheric water vapour isotopoloque data (H\(_{2}^{16}\)O, H\(_{2}^{18}\)O and HD16O) as obtained from NDACC/FTIR solar absorption spectra" that has been prepared for ESSD in the context of the special issue “25th anniversary of NDACC”.

  18. a

    Data from: The association between disability and all-cause mortality in...

    • renedh-site-primario-mdhc.hub.arcgis.com
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministério dos Direitos Humanos e da Cidadania (2024). The association between disability and all-cause mortality in low-income and middle-income countries: a systematic review and meta-analysis [Dataset]. https://renedh-site-primario-mdhc.hub.arcgis.com/datasets/the-association-between-disability-and-all-cause-mortality-in-low-income-and-middle-income-countries-a-systematic-review-and-meta-analysis
    Explore at:
    Dataset updated
    Aug 13, 2024
    Dataset authored and provided by
    Ministério dos Direitos Humanos e da Cidadania
    Description

    Summary Background There are 1·3 billion people with disabilities globally. On average, they have poorer health than their nondisabled peers, but the extent of increased risk of premature mortality is unknown. We aimed to systematically review the association between disability and mortality in low-income and middle-income countries (LMICs). Methods We searched MEDLINE, Global Health, PsycINFO, and EMBASE from Jan 1, 1990 to Nov 14, 2022. Longitudinal epidemiological studies in any language with a comparator group that measured the association between disability and all-cause mortality in people of any age were eligible for inclusion. Two reviewers independently assessed study eligibility, extracted data, and assessed risk of bias. We used a random-effects meta-analysis to calculate the pooled hazard ratio (HR) for all-cause mortality by disability status. We then conducted meta-analyses separately for different impairment and age groups. Findings We identified 6146 unique articles, of which 70 studies (81 cohorts) were included in the systematic review, from 22 countries. There was variability in the methods used to assess and report disability and mortality. The metaanalysis included 54 studies, representing 62 cohorts (comprising 270 571 people with disabilities). Pooled HRs for all-cause mortality were 2·02 (95% CI 1·77–2·30) for people with disabilities versus those without disabilities, with high heterogeneity between studies (τ²=0·23, I²=98%). This association varied by impairment type: from 1·36 (1·17–1·57) for visual impairment to 3·95 (1·60–9·74) for multiple impairments. The association was highest for children younger than 18 years (4·46, [3·01–6·59]) and lower in people aged 15–49 years (2·45 [1·21–4·97]) and people older than 60 years (1·97 [1·65–2·36]). Interpretation People with disabilities had a two-fold higher mortality rate than people without disabilities in LMICs. Interventions are needed to improve the health of people with disabilities and reduce their higher mortality rate. Funding UK National Institute for Health and Care Research; and UK Foreign, Commonwealth and Development Office.

  19. X Ray Id Zfisb Fsod Kxxa Dataset

    • universe.roboflow.com
    zip
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    roboflow 20 VL FSOD (2025). X Ray Id Zfisb Fsod Kxxa Dataset [Dataset]. https://universe.roboflow.com/roboflow-20-vl-fsod-fa5i3/x-ray-id-zfisb-fsod-kxxa/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    Roboflowhttps://roboflow.com/
    Authors
    roboflow 20 VL FSOD
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    X Ray Id Zfisb Fsod Kxxa Kxxa Bounding Boxes
    Description

    Overview

    Introduction

    This dataset is designed for object detection within X-ray images, focusing on identifying specific parts of the hand and forearm. The goal is to annotate these regions to assist in medical analysis. The classes included in this dataset are: - Hand-Parts: General category for parts of the hand not specifically categorized under other class labels. - DIP: Distal Interphalangeal joints, the joints closest to the fingertips. - MCP: Metacarpophalangeal joints, the joints where fingers meet the hand. - PIP: Proximal Interphalangeal joints, the middle joints of the fingers. - Radius: One of the two large bones in the forearm, located on the thumb side. - Ulna: The second of the two large bones in the forearm, opposite to the radius. - Wrist: The area comprising the carpal bones connecting the hand to the forearm.

    Object Classes

    Hand-Parts

    Description

    This class includes various unclassified structures of the hand observed in X-ray images, excluding those attributed to the DIP, MCP, PIP, Radius, Ulna, and Wrist classes.

    Instructions

    Annotate all recognizable hand structures that do not fall into any specific categories like joints or bones. The annotation should cover visible hand features, ensuring all discernible parts are included except those clearly identified as separate classes.

    DIP

    Description

    The DIP joints are the joints located at the fingertips, immediately before the nail segments, appearing as small gaps or dark lines in X-ray images.

    Instructions

    Draw bounding boxes around the small, distinct gaps appearing near the fingertips. Ensure to include the joint space between the phalanges and exclude any annotations that extend to adjacent bones or non-joint areas.

    MCP

    Description

    MCP joints are located at the base of each finger where the fingers connect with the hand. These appear as prominent spaces on X-rays.

    Instructions

    Annotate the large joint spaces found at the base of each finger. Make sure the bounding box encapsulates the joint area without overlapping into the phalanges or hand body.

    PIP

    Description

    PIP joints are the intermediate fingers' joints located between the proximal and middle phalanges. They appear as visible gaps in X-ray images.

    Instructions

    Carefully highlight the joint spaces located in the mid-segment of each finger. Avoid extending the annotation into adjacent bones or including any metacarpophalangeal or distal joints.

    Radius

    Description

    The radius is a significant forearm bone, visible as a long, thick bone on the thumb side in X-ray images.

    Instructions

    Mark the entire radius bone, ensuring the annotation captures its full extent from the wrist to the elbow without including adjacent bones or structures. The annotation should follow the bone's longitudinal shape.

    Ulna

    Description

    The ulna is the second long bone in the forearm, opposite the radius, and can be identified as running parallel to the radius.

    Instructions

    Enclose the ulna in a bounding box that captures its length from the wrist down to the elbow, distinct from the radius. Keep annotations strictly to the confines of the ulna without overlapping adjacent bones.

    Wrist

    Description

    The wrist region connects the forearm to the hand, comprising carpal bones visible as a cluster in the X-ray.

    Instructions

    Annotate the cluster of bones connecting the hand to the forearm. Ensure to cover the complex structure of the wrist without overlapping into the forearm bones and provide a clear boundary between the wrist and the hand.

  20. Data from: Supplementary Material for "Sonification for Exploratory Data...

    • search.datacite.org
    Updated Feb 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. http://doi.org/10.4119/unibi/2920448
    Explore at:
    Dataset updated
    Feb 5, 2019
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Bielefeld University
    Authors
    Thomas Hermann
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Sonification for Exploratory Data Analysis #### Chapter 8: Sonification Models In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data. ##### 8.1 Data Sonograms Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space. * Table 8.2, page 87: Sound examples for Data Sonograms File: Iris dataset: started in plot (a) at S0 (b) at S1 (c) at S2
    10d noisy circle dataset: started in plot (c) at S0 (mean) (d) at S1 (edge)
    10d Gaussian: plot (d) started at S0
    3 clusters: Example 1
    3 clusters: invisible columns used as output variables: Example 2 Description: Data Sonogram Sound examples for synthetic datasets and the Iris dataset Duration: about 5 s ##### 8.2 Particle Trajectory Sonification Model This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset. * Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x). * Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster. * Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters * Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster * Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step. * Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step. * Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset. ##### 8.3 Markov chain Monte Carlo Sonification The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound. * Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes. * Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset * McMC Sonification for Cluster Analysis, dataset with three clusters, page 107 * Stream 1 MCMC-Ex-3.1 * Stream 2 MCMC-Ex-3.2 * Stream 3 MCMC-Ex-3.3 * Mix MCMC-Ex-3.4 * McMC Sonification for Cluster Analysis, dataset with three clusters, T =0.002s, page 107 * Stream 1 MCMC-Ex-4.1 (stream 1) * Stream 2 MCMC-Ex-4.2 (stream 2) * Stream 3 MCMC-Ex-4.3 (stream 3) * Mix MCMC-Ex-4.4 * McMC Sonification for Cluster Analysis, density with 6 modes, T=0.008s, page 107 * Stream 1 MCMC-Ex-5.1 (stream 1) * Stream 2 MCMC-Ex-5.2 (stream 2) * Stream 3 MCMC-Ex-5.3 (stream 3) * Mix MCMC-Ex-5.4 * McMC Sonification for the Iris dataset, page 108 * MCMC-Ex-6.1 * MCMC-Ex-6.2 * MCMC-Ex-6.3 * MCMC-Ex-6.4 * MCMC-Ex-6.5 * MCMC-Ex-6.6 * MCMC-Ex-6.7 * MCMC-Ex-6.8 ##### 8.4 Principal Curve Sonification Principal Curve Sonification represents data by synthesizing the soundscape while a virtual listener moves along the principal curve of the dataset through the model space. * Noisy Spiral dataset, PCS-Ex-1.1 , page 113 * Noisy Spiral dataset with variance modulation PCS-Ex-1.2 , page 114 * 9d tetrahedron cluster dataset (10 clusters) PCS-Ex-2 , page 114 * Iris dataset, class label used as pitch of auditory grains PCS-Ex-3 , page 114 ##### 8.5 Data Crystallization Sonification Model * Table 8.6, page 122: Sound examples for Crystallization Sonification for 5d Gaussian distribution File: DCS started at center, in tail, from far outside Description: DCS for dataset sampled from N{0, I_5} excited at different locations Duration: 1.4 s * Mixture of 2 Gaussians, page 122 * DCS started at point A DCS-Ex1A * DCS started at point B DCS-Ex1B * Table 8.7, page 124: Sound examples for DCS on variation of the harmonics factor File: h_omega = 1, 2, 3, 4, 5, 6 Description: DCS for a mixture of two Gaussians with varying harmonics factor Duration: 1.4 s * Table 8.8, page 124: Sound examples for DCS on variation of the energy decay time File: tau_(1/2) = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2 Description: DCS for a mixture of two Gaussians varying the energy decay time tau_(1/2) Duration: 1.4 s * Table 8.9, page 125: Sound examples for DCS on variation of the sonification time File: T = 0.2, 0.5, 1, 2, 4, 8 Description: DCS for a mixture of two Gaussians on varying the duration T Duration: 0.2s -- 8s * Table 8.10, page 125: Sound examples for DCS on variation of model space dimension File: selected columns of the dataset: (x0) (x0,x1) (x0,...,x2) (x0,...,x3) (x0,...,x4) (x0,...,x5) Description: DCS for a mixture of two Gaussians varying the dimension Duration: 1.4 s * Table 8.11, page 126: Sound examples for DCS for different excitation locations File: starting point: C0, C1, C2 Description: DCS for a mixture of three Gaussians in 10d space with different rank(S) = {2,4,8} Duration: 1.9 s * Table 8.12, page 126: Sound examples for DCS for the mixture of a 2d distribution and a 5d cluster File: condensation nucleus in (x0,x1)-plane at: (-6,0)=C1, (-3,0)=C2, ( 0,0)=C0 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s * Table 8.13, page 127: Sound examples for DCS for the cancer dataset File: condensation nucleus in (x0,x1)-plane at: benign 1, benign 2
    malignant 1, malignant 2 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s ##### 8.6 Growing Neural Gas Sonification * Table 8.14, page 133: Sound examples for GNGS Probing File: Cluster C0 (2d): a, b, c
    Cluster C1 (4d): a, b, c
    Cluster C2 (8d): a, b, c Description: GNGS for a mixture of 3 Gaussians in 10d space Duration: 1 s * Table 8.15, page 134: Sound examples for GNGS for the noisy spiral dataset File: (a) GNG with 3 neurons 1, 2
    (b) GNG with 20 neurons end, middle, inner end
    (c) GNG with 45 neurons outer end, middle, close to inner end, at inner end
    (d) GNG with 150 neurons outer end, in the middle, inner end
    (e) GNG with 20 neurons outer end, in the middle, inner end
    (f) GNG with 45 neurons outer end, in the middle, inner end Description: GNG probing sonification for 2d noisy spiral dataset Duration: 1 s * Table 8.16, page 136: Sound examples for GNG Process Monitoring Sonification for different data distributions File: Noisy spiral with 1 rotation: sound
    Noisy spiral with 2 rotations: sound
    Gaussian in 5d: sound
    Mixture of 5d and 2d distributions: sound Description: GNG process sonification examples Duration: 5 s #### Chapter 9: Extensions #### In this chapter, two extensions for Parameter Mapping

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Government of Canada, Statistics Canada (2025). Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas [Dataset]. http://doi.org/10.25318/1110023901-eng
Organization logo

Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas

1110023901

Explore at:
Dataset updated
May 1, 2025
Dataset provided by
Statistics Canadahttps://statcan.gc.ca/en
Area covered
Canada
Description

Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas, annual.

Search
Clear search
Close search
Google apps
Main menu